python - 统计Scrapy中的重定向次数

已关闭。此问题需要 debugging details 。目前不接受答案。

编辑问题以包含 desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem 。这将有助于其他人回答问题。

已关闭 9 年前。

Improve this question

我正在使用 scrapy 来获取 URL 列表。一些 URL 被重定向到另一个 <302>。我想要的是计算单个 url 发生的重定向数量以及所有中间重定向 url 的完整集合，例如

Fetching - http://ign.com

Redirected to - http://de.ign.com/

redirect_count = 1

url_set = ['http://ign.com', 'http://de.ign.com/']

最佳答案

你需要的是处理302 httpstatus，

handle_httpstatus_list = [200, 302, 404] # any other if you want

这是一个例子:

将您的items.py定义为，

from scrapy.item import Item, Field

class myItems(Item):
    redirect_count = Field()

稍后在您的spider.py中，

from scrapy.spider import Spider
from scrapy.selector import Selector
from .items import myItems

class mainSpider(Spider):
    name = "crazyCrawler"
    allowed_domains = ['http://ign.com', 'http://de.ign.com/']
    handle_httpstatus_list = [200, 302, 404] # any other if you want

    start_urls = [
        "http://ign.com"
    ]

    def parse(self, response):
        # spider
        sel = Selector(response)

        items = []
        item = myItems()

        item['redirect_count'] = 0

        if response.status == 302:
            item['redirect_count'] += 1

现在你可以运行了，

scrapy crawl crazyCrawler -o items.json

关于python - 统计Scrapy中的重定向次数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26930572/

上一篇：python - 如何使用XMLRPC备份数据库？

下一篇：python - 求和函数概率类型错误: unsupported operand type(s) for +: 'int' and 'str'

相关文章：

python - scrapy 开始 url 中的迭代顺序

python - Scrapy Xpath 行为不一致(OS X + Linux)

python - 从被抓取的页面上的链接中检索信息

amazon-web-services - 如何打包或安装整个程序以在AWS Lambda函数中运行

python - 在这种情况下提取数据最有效的方法是什么？

python - 如何在 plotly express 中更改散点矩阵中的轴限制？

Python运行多个进程

python - 如何使用scrapy限制spider爬取某些xPath

Python - 写入文本文件，但它们显示为空？

python - 在 wxPython 中隐藏/删除静态文本