python - 如果通过 process.crawl() 运行，Scrapy CrawlSpider 不会执行 LinkExtractor

我不明白为什么我的蜘蛛只抓取 start_url，而忽略提取与 allow 参数匹配的任何 URL。

from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
from scrapy.settings import Settings
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = "my_spider"
    allowed_domains = ["website.com/"]
    rules = [Rule(LinkExtractor(allow='/product_page/'), callback='parse', follow=True)]
    start_urls = ["http://www.website.com/list_of_products.php"]    
    custom_settings = {
        "ROBOTSTXT_OBEY": "True",
        "COOKIES_ENABLED": "False",
        "LOG_LEVEL": 'INFO'
    }

    def parse(self, response):
        try:
            item = {
                # populate "item" with data
            }
            yield MyItem(**item)
        except (DropItem, Exception) as e:
            raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)


if __name__ == '__main__':
    settings = Settings()
    settings.set('ITEM_PIPELINES', {
        'pipelines.csv_pipeline.CsvPipeline': 100
    })
    process = CrawlerProcess(settings)
    process.crawl(MySpider)
    process.start()

我不确定问题是否是由于从 __name__ 调用而发生的。

最佳答案

问题可能是您正在重新定义解析方法，这是应该避免的。来自 crawling rules docs :

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

因此，我尝试将函数命名为其他名称(我将其重命名为 parse_item，类似于文档中的 CrawlSpider 示例，但您可以使用任何名称) :

class MySpider(CrawlSpider):
    name = "my_spider"
    allowed_domains = ["website.com"]
    rules = [Rule(LinkExtractor(allow='/product_page/.+'), callback='parse_item', follow=True),
             Rule(LinkExtractor(allow='/list_of_products.+'), callback='parse', follow=True)]
    start_urls = ["http://www.website.com/list_of_products.php"]    
    custom_settings = {
        "ROBOTSTXT_OBEY": "True",
        "COOKIES_ENABLED": "False",
        "LOG_LEVEL": 'INFO'
    }

    def parse_item(self, response):
        try:
            item = {
                # populate "item" with data
            }
            yield MyItem(**item)
        except (DropItem, Exception) as e:
            raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)

关于python - 如果通过 process.crawl() 运行，Scrapy CrawlSpider 不会执行 LinkExtractor，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59740252/

python - 如果通过 process.crawl() 运行，Scrapy CrawlSpider 不会执行 LinkExtractor

上一篇：emacs，取消分割特定的窗口分割

下一篇：MFC 对话框上的多个单选按钮组