我不明白为什么我的蜘蛛只抓取 start_url
,而忽略提取与 allow
参数匹配的任何 URL。
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
from scrapy.settings import Settings
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = "my_spider"
allowed_domains = ["website.com/"]
rules = [Rule(LinkExtractor(allow='/product_page/'), callback='parse', follow=True)]
start_urls = ["http://www.website.com/list_of_products.php"]
custom_settings = {
"ROBOTSTXT_OBEY": "True",
"COOKIES_ENABLED": "False",
"LOG_LEVEL": 'INFO'
}
def parse(self, response):
try:
item = {
# populate "item" with data
}
yield MyItem(**item)
except (DropItem, Exception) as e:
raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)
if __name__ == '__main__':
settings = Settings()
settings.set('ITEM_PIPELINES', {
'pipelines.csv_pipeline.CsvPipeline': 100
})
process = CrawlerProcess(settings)
process.crawl(MySpider)
process.start()
我不确定问题是否是由于从 __name__
调用而发生的。
最佳答案
问题可能是您正在重新定义解析方法,这是应该避免的。来自 crawling rules docs :
Warning
When writing crawl spider rules, avoid using
parse
as callback, since theCrawlSpider
uses theparse
method itself to implement its logic. So if you override theparse
method, the crawl spider will no longer work.
因此,我尝试将函数命名为其他名称(我将其重命名为 parse_item
,类似于文档中的 CrawlSpider
示例,但您可以使用任何名称) :
class MySpider(CrawlSpider):
name = "my_spider"
allowed_domains = ["website.com"]
rules = [Rule(LinkExtractor(allow='/product_page/.+'), callback='parse_item', follow=True),
Rule(LinkExtractor(allow='/list_of_products.+'), callback='parse', follow=True)]
start_urls = ["http://www.website.com/list_of_products.php"]
custom_settings = {
"ROBOTSTXT_OBEY": "True",
"COOKIES_ENABLED": "False",
"LOG_LEVEL": 'INFO'
}
def parse_item(self, response):
try:
item = {
# populate "item" with data
}
yield MyItem(**item)
except (DropItem, Exception) as e:
raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)
关于python - 如果通过 process.crawl() 运行,Scrapy CrawlSpider 不会执行 LinkExtractor,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59740252/