python - 如果通过 process.crawl() 运行,Scrapy CrawlSpider 不会执行 LinkExtractor

标签 python web-scraping scrapy

我不明白为什么我的蜘蛛只抓取 start_url,而忽略提取与 allow 参数匹配的任何 URL。

from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
from scrapy.settings import Settings
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = "my_spider"
    allowed_domains = ["website.com/"]
    rules = [Rule(LinkExtractor(allow='/product_page/'), callback='parse', follow=True)]
    start_urls = ["http://www.website.com/list_of_products.php"]    
    custom_settings = {
        "ROBOTSTXT_OBEY": "True",
        "COOKIES_ENABLED": "False",
        "LOG_LEVEL": 'INFO'
    }

    def parse(self, response):
        try:
            item = {
                # populate "item" with data
            }
            yield MyItem(**item)
        except (DropItem, Exception) as e:
            raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)


if __name__ == '__main__':
    settings = Settings()
    settings.set('ITEM_PIPELINES', {
        'pipelines.csv_pipeline.CsvPipeline': 100
    })
    process = CrawlerProcess(settings)
    process.crawl(MySpider)
    process.start()

我不确定问题是否是由于从 __name__ 调用而发生的。

最佳答案

问题可能是您正在重新定义解析方法,这是应该避免的。来自 crawling rules docs :

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

因此,我尝试将函数命名为其他名称(我将其重命名为 parse_item,类似于文档中的 CrawlSpider 示例,但您可以使用任何名称) :

class MySpider(CrawlSpider):
    name = "my_spider"
    allowed_domains = ["website.com"]
    rules = [Rule(LinkExtractor(allow='/product_page/.+'), callback='parse_item', follow=True),
             Rule(LinkExtractor(allow='/list_of_products.+'), callback='parse', follow=True)]
    start_urls = ["http://www.website.com/list_of_products.php"]    
    custom_settings = {
        "ROBOTSTXT_OBEY": "True",
        "COOKIES_ENABLED": "False",
        "LOG_LEVEL": 'INFO'
    }

    def parse_item(self, response):
        try:
            item = {
                # populate "item" with data
            }
            yield MyItem(**item)
        except (DropItem, Exception) as e:
            raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)

关于python - 如果通过 process.crawl() 运行,Scrapy CrawlSpider 不会执行 LinkExtractor,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59740252/

相关文章:

python - 实现luigi动态图配置

python - 调整窗口大小时 tkinter 获取 <ButtonRelease-1>

python - 是否可以将 python pickle 对象作为字符串存储在类中?

python - Scrapy:自定义回调不起作用

python - Scrapy 不响应 CloseSpider 异常

python - 为什么 xmlrpc 客户端不能将项目附加到可通过 xmlrpc 服务器过程访问的列表?

excel - VBA 和 IE8 - 输入值和搜索

javascript - 用 R 抓取 javascript

python - 合并字符串 Scrapy python

python - 使用 BeautifulSoup 保存网页内容