我有一个基本的 scrapy 脚本,它执行以下操作:
- 访问网站
使用规则获取所有页面:
rules = ( Rule(LinkExtractor(allow=(), restrict_xpaths=('//*[@id="pagination_top"]/a',)), callback="parse_page", follow= True), )
在每个页面中,获取产品页面的所有链接:
def parse_page(self, response): for href in response.css("#prod_category > ul > li > a::attr('href')"): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_dir_contents)
并访问每个产品页面以获取有关产品的详细信息。然后我从不同的链接获取更多详细信息
def parse_dir_contents(self, response): # select xpath here print '________________________BEGIN PRODUCT________________________' item = detailedItem() item['title'] = sites.xpath('//*[@id="product-name"]/text()').extract() # get url_2 from this page request = scrapy.Request(url_2, callback=self.parse_detailed_contents) request.meta['item'] = item yield request
最后是获取产品详细信息的函数
I think this last parse_detailed_contents is where I have an issue
def parse_detailed_contents(self, response): item = response.meta['item'] sel = Selector(response) sites = sel.xpath('//*[@id="prod-details"]') print '________________________GETTING DETAILS________________________' item['prod_details'] = sites.xpath('//*[@id="prod-details"]/div/text()').extract() return item
问题是我的脚本为第一个链接返回 item['prod_details'] 但不为后续链接返回任何项目。
Is that because url_2 being passed in the same for all product?
有人可以帮忙吗?预先非常感谢!
最佳答案
尝试添加dont_filter=True
def parse_dir_contents(self, response):
# select xpath here
print '________________________BEGIN PRODUCT________________________'
item = detailedItem()
item['title'] = sites.xpath('//*[@id="product-name"]/text()').extract()
# get url_2 from this page
request = scrapy.Request(url_2, callback=self.parse_detailed_contents,dont_filter=True)
request.meta['item'] = item
yield request
关于python - Scrapy回调函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36440334/