我正在使用 scrapy 来获取 URL 列表。一些 URL 被重定向到另一个 <302>。我想要的是计算单个 url 发生的重定向数量以及所有中间重定向 url 的完整集合,例如
Fetching - http://ign.com
Redirected to - http://de.ign.com/
redirect_count = 1
url_set = ['http://ign.com', 'http://de.ign.com/']
最佳答案
你需要的是处理302 httpstatus
,
handle_httpstatus_list = [200, 302, 404] # any other if you want
这是一个例子:
将您的items.py
定义为,
from scrapy.item import Item, Field
class myItems(Item):
redirect_count = Field()
稍后在您的spider.py
中,
from scrapy.spider import Spider
from scrapy.selector import Selector
from .items import myItems
class mainSpider(Spider):
name = "crazyCrawler"
allowed_domains = ['http://ign.com', 'http://de.ign.com/']
handle_httpstatus_list = [200, 302, 404] # any other if you want
start_urls = [
"http://ign.com"
]
def parse(self, response):
# spider
sel = Selector(response)
items = []
item = myItems()
item['redirect_count'] = 0
if response.status == 302:
item['redirect_count'] += 1
现在你可以运行了,
scrapy crawl crazyCrawler -o items.json
关于python - 统计Scrapy中的重定向次数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26930572/