我是 scrapy 的新手,我正在尝试使用 CrawlSpider 抓取网站,我想根据“下一步”按钮递归抓取它。但它不起作用。我认为问题来自正则表达式,但我检查了很多次,我找不到错误。它只会抓取着陆页,而不会进入下一页。
# -*- coding: utf-8 -*-
start_urls = ['https://shopping.yahoo.com/merchantrating/?mid=13652']
rules = (
Rule(LinkExtractor(allow = "/merchantrating/;_ylt=Anf3hF19R8MGFPwuYuJUny4cEb0F\?mid=13652&sort=1&start=\d+"), callback = 'parse_start_url', follow = True),
)
def parse_start_url(self, response):
sel = Selector(response)
contents = sel.xpath('//p')
for content in contents:
item = BedbugsItem()
item['pageContent'] = content.xpath('text()').extract()
self.items.append(item)
return self.items
最佳答案
改为使用 XPath:
rules = (
Rule(LinkExtractor(
restrict_xpaths = [
"//div[@class='pagination']//a[contains(., 'Next')]"
]),
callback = 'parse_start_url',
follow = True),
)
关于python - 需要这个 scrapy 正则表达式的帮助,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26648027/