python - Scrapy 多个 url 的问题

标签 python scrapy

我正在从多个网址抓取数据,这样:

import scrapy

from pogba.items import PogbaItem

class DmozSpider(scrapy.Spider):
    name = "pogba"
    allowed_domains = ["fourfourtwo.com"]
    start_urls = [
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459525/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459571/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459585/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459614/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459635/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459644/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459662/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459674/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459686/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459694/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459705/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459710/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459737/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459744/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459765/player-stats/74208/OVERALL_02"
    ]

    def parse(self, response):
        Coords = []
        for sel in response.xpath('//*[@id="pitch"]/*[contains(@class,"success")]'):
            item = PogbaItem()
            item['x'] = sel.xpath('(@x|@x1)').extract() 
            item['y'] = sel.xpath('(@y|@y1)').extract() 
            Coords.append(item)
        return Coords

问题是,在这种情况下,我有一个大约 200 行的 csv,而对于每个 url,我有大约 50 行。一次抓取一个 url 效果很好,但为什么如果我设置多个 url 会得到不同的结果?

最佳答案

我会尝试通过增加请求之间的延迟( DOWNLOAD_DELAY setting )和减少并发请求量( CONCURRENT_REQUESTS setting )来调整抓取速度并减慢一点,例如:

DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 4

关于python - Scrapy 多个 url 的问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36986116/

相关文章:

python - 无法使用类名包含尾随空格的 Scrapy 提取数据

python - 如何自定义散点矩阵以查看所有标题?

python - 属性错误 : generator object has no attribute 'sort'

python - 将元素列表映射到 pandas 中的元素类别?

python - Scrapy:通过管道发送到数据库时包含状态代码为 404 的项目

python - 响应不为空时,scrapy xpath 返回空列表

python - 我如何告诉 Scrapy 只抓取 Xpath 中的链接?

python - 尝试在 AWS Lambda 上测试 Scrapy Web-Crawler 时出现此错误 "raise error.reactornotrestartable() "

python - 我需要帮助将 Python 3.4.3 中的整数更改为单词

python - 类型错误 : UMat() takes at most 2 arguments (3 given)