我写了下面的 scrapy 蜘蛛,但在初始请求后它没有继续爬行过程,尽管我已经 yield
ed 更多 scrapy.Request
以供 scrapy 遵循.
import regex as re
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Spider
class myspider(Spider):
name = 'haha'
allowed_domains = ['https://blog.scrapinghub.com/']
start_urls = ['https://blog.scrapinghub.com/']
extractor = LinkExtractor(allow=allowed_domains)
def parse(self, response):
# To extract all the links on this page
links_in_page = self.extractor.extract_links(response)
for link in links_in_page:
yield scrapy.Request(link.url, callback=self.parse)
最佳答案
allowed_domains
需要是 a list of domains ,而不是 URL 列表。
所以应该是:
allowed_domains = ['blog.scrapinghub.com']
关于scrapy yield 请求不工作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40035099/