python - Scrapy 脚本出现 502 错误

标签 python web-scraping scrapy

在这里抓取新手。我正在使用 Scrapy 从单个站点获取大量数据。当我运行脚本时,它可以正常工作几分钟,但随后速度变慢,几乎停止并不断抛出以下错误,其中包含它试图抓取的不同 URL:

2013-07-20 14:15:17-0700 [billboard_spider] DEBUG: Retrying <GET http://www.billboard.com/charts/1981-01-17/hot-100> (failed 1 times): Getting http://www.billboard.com/charts/1981-01-17/hot-100 took longer than 180 seconds.

2013-07-20 14:16:56-0700 [billboard_spider] DEBUG: Crawled (502) <GET http://www.billboard.com/charts/1981-01-17/hot-100> (referer: None) 

上述错误与不同的 URL 堆积在一起,我不确定是什么原因造成的...

这是脚本:

import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class BillBoardItem(Item):
    date = Field()
    song = Field()
    artist = Field()


BASE_URL = "http://www.billboard.com/charts/%s/hot-100"


class BillBoardSpider(BaseSpider):
    name = "billboard_spider"
    allowed_domains = ["billboard.com"]

    def __init__(self):
        date = datetime.date(year=1975, month=12, day=27)

        self.start_urls = []
        while True:
            if date.year >= 2013:
                break

            self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
            date += datetime.timedelta(days=7)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]

        songs = hxs.select('//div[@class="listing chart_listing"]/article')
        item = BillBoardItem()
        item['date'] = date
        for song in songs:
            try:
                track = song.select('.//header/h1/text()').extract()[0]
                track = track.rstrip()
                item['song'] = track
                item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
                break
            except:
                continue 

         yield item

最佳答案

Spider 为我工作,抓取数据没有任何问题。所以,正如@Tiago 假设的那样,你被禁止了。

阅读how to avoid getting banned在未来适本地调整你的 scrapy 设置。我会先尝试增加 DOWNLOAD_DELAY 并轮换您的 IP。

此外,考虑切换到使用真正的自动浏览器,例如 selenium .

此外,看看您是否可以从 RSS XML 提要中获取日期:http://www.billboard.com/rss .

希望对您有所帮助。

关于python - Scrapy 脚本出现 502 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17767324/

相关文章:

python - 在 Windows x64 中运行 Cython - fatal error C1083 : Cannot open include file: 'basetsd.h' : No such file or directory

Python/MySQL : "Truncated incorrect double value..." on a simple text insert

python - 通过 psycopg2 获取警告信息

python - Beautifulsoup - 收集 href 链接并创建链接列表

python - Pip 安装 Scrappy - "python setup.py egg_info"失败,错误代码为 1

python - 进程在 urllib2 套接字重置时挂起

Python 网页抓取 : Beautiful Soup

javascript - 使用 CasperJS 将源代码转储到本地文件中

python - Django Celery Scrappy 错误 : twisted. internet.error.ReactorNotRestartable

python - 使用 Win10 任务调度程序批量调度 Scrapy Spider