python - 爬行完成后运行自定义代码(scrapy)

标签 python scrapy

我需要在爬网完成后测试所有抓取的数据(某些字段可用性的百分比等)。数据包含在 csv 文件中，因此为了进行测试，我决定使用 Pandas。在 Scrapy 告诉我爬行已完成后，有什么方法可以启动测试 scrapy 蜘蛛内的 .csv 文件的代码吗？我尝试过使用扩展，但无法让它工作。谢谢

class Spider(scrapy.Spider):
    name = 'scrapyspider'
    allowed_domains = ['www.example.com']
    start_urls = ['https://www.example.com/1/', 'https://www.example.com/2/']


    def parse(self, response):
        for product_link in response.xpath(
                '//a[@class="product-link"]/@href').extract():
            absolute_url = response.urljoin(product_link)
            yield scrapy.Request(absolute_url, self.parse_product)
        for category_link in response.xpath(
            '//a[@class="navigation-item-link"]/@href').extract():
            absolute_url = response.urljoin(category_link)
            yield scrapy.Request(absolute_url, self.parse)

    def parse_product(self, response):
        ...
        yield item

最佳答案

Scrapy为您提供了控制Pipelines中项目的流程

在Pipelines中，您可以验证或可以对项目应用任何检查，如果它不符合您的条件或者您想要根据某些属性值更新数据，您可以在那里进行。

有关管道的更多信息，您可以阅读here

关于python - 爬行完成后运行自定义代码(scrapy)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54224645/

上一篇：python - importlib._bootstrap 和 python 解释器初始化

下一篇：python - 如果满足条件，Pandas iterrows 无法在迭代期间跳过行

相关文章：

python - 统计调查结果

python - 属性错误 : 'module' object has no attribute 'Spider'

python-2.7 - Scrapy 使用代理并得到扭曲的错误

python - xpath string() 从渲染中排除特定节点

python - 如何访问 Item Pipeline Scrapy 中的请求对象

python - Scrapy中如何下载根据Cookies url生成的文件

python 使用 __getitem__ 作为方法

python - 运算符(operator)模块和 pandas

python - 删除 Python 用户警告

python - 在 python 中使用 difflib.diff_bytes 比较两个文件