python - 配置spider忽略url参数，这样scrapy就不会两次抓取同一个页面

是否可以将 Scrapy 蜘蛛配置为忽略访问过的 URL 中的 URL 参数，这样如果 www. example.com/page?p=value1 已经被访问过？

最佳答案

你不能配置它，但是按照 documentation ，您可以继承标准重复过滤器类并覆盖它的 request_fingerprint 方法。

这未经测试，但应该可以工作。第一个子类化标准重复过滤器类(例如 dupefilters.py):

from w3lib.url import url_query_cleaner
from scrapy.dupefilters import RFPDupeFilter
from scrapy.utils.request import request_fingerprint

class MyRFPDupeFilter(RFPDupeFilter):

    def request_fingerprint(self, request):
        new_request = request.replace(url=url_query_cleaner(request.url))
        return request_fingerprint(new_request)

在 settings.py 中将 DUPEFILTER_CLASS 设置为您的类:

DUPEFILTER_CLASS = 'myproject.dupefilters.MyRFPDupeFilter'

关于python - 配置spider忽略url参数，这样scrapy就不会两次抓取同一个页面，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45939963/

上一篇：pyspark 滞后函数(基于列)

下一篇：google-api - 如何重命名 Google Storage 存储桶中的对象？

python - 无法在scrapy中导入项目

python - scrapy 蜘蛛中的多重继承

python - scrapy - 处理多种类型的项目 - 多个相关的 Django 模型并将它们保存到管道中的数据库

python-3.x - Scrapy - 类型错误 : 'Rule' object is not iterable

python - 计算按其他列的唯一值分组的唯一值百分比

Python SQLite : Update Statement TypeError: function takes exactly 2 arguments (1 given)

python - 为什么我需要在Keras中编译和拟合预训练模型？

python - 使用 scrapy 进行网页抓取论坛不会产生下一页

python - Scrapy 分页时间错误