python - 为什么python scrapy显示 "twisted.internet.error.TimeoutError"错误

标签 python scrapy

我正在尝试使用 python scrapy 抓取一个页面。经过一些抓取操作后,scrapy 正在退出显示

twisted.internet.error.TimeoutError error

这是我的代码:

#infobel_spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http import FormRequest
from infobel.items import InfobelItem
import sys
import xlwt
import re
import codecs    

class InfobelSpider(BaseSpider):
    name = 'infobel'
    start_urls = ['http://www.infobel.com/en/italy/business/20300/accessories']

    def parse(self,response):

        hxs = HtmlXPathSelector(response)

        next_page = hxs.select("//a[@id='Pagination1_lnkNextRec']/@href").extract()
        if not not next_page:
             yield Request("http://www.infobel.com"+next_page[0],self.parse)

        qs = hxs.select("//div[@class='result-item clearfix']")
        items = []
        for q in qs:
            item = InfobelItem()
            item['name'] = q.select('div/div/h2/a/span/text()').extract()
            item['address'] = q.select('div/div/ul/li[1]/div/span/text()').extract()
            item['phone'] = q.select('div/div/ul/li[2]/div/text()').extract()
            item['email'] = q.select('div/div/ul/li[3]/div/a/text()').extract()
            item['website'] = q.select('div/div/ul/li[4]/div/a/@href').extract()
            item['category'] = q.select("div/div[@class='categories']/div/ul/li/text()").extract()
            items.append(item)
        for item in items:
            yield item

#items.py    
from scrapy.item import Item, Field

class InfobelItem(Item):
    # define the fields for your item here like:
    name = Field()
    address = Field()
    phone = Field()
    email = Field()
    category = Field()
    website = Field()

#middlewares.py    
import base64
import random
from settings import PROXIES

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)
        if proxy['user_pass'] is not None:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']
            encoded_user_pass = base64.encodestring(proxy['user_pass'])
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
        else:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']

#pipelines.py
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join
import re
import json
import csv

class InfobelPipeline(object):
    def __init__(self):
        self.file = csv.writer(open('items.csv','wb'))
    def process_item(self, item, spider):
        name = item['name']
        address = item['address']
        phone = item['phone']
        email = item['email']
        category = item['category']
        website = item['website']
        self.file.writerow((name,address,phone,email,category,website))
        return item

#settings.py    
BOT_NAME = 'infobel'
BOT_VERSION = '1.0'

SPIDER_MODULES = ['infobel.spiders']
NEWSPIDER_MODULE = 'infobel.spiders'
DEFAULT_ITEM_CLASS = 'infobel.items.InfobelItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
ITEM_PIPELINES = ['infobel.pipelines.InfobelPipeline']
PROXIES = [{'ip_port': '41.43.31.226:8080', 'user_pass': ''},
           {'ip_port': '64.120.226.94:8080', 'user_pass': ''},
           {'ip_port': '196.2.73.246:3128', 'user_pass': ''},]
DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'infobel.middlewares.ProxyMiddleware': 100,
}

这是输出:

[infobel] INFO: Passed InfobelItem(website=[u'track.aspx?id=0&url=http://www.bbmodena.it'], category=[u'TELEVISION, VIDEO AND HI-FI EMERGENCY BREAKDOWN SERVICES, REPAIRS AND SPARE PARTS'], name=[u'B & B (S.R.L.) (RIP.TVC VIDEO HI-FI)'], phone=[u'059254545'], address=[u'V. MALAVOLTI\xa047', u'41100', u'MODENA'], email=[u'info@bbmodena.it'])
[infobel] DEBUG: Scraped InfobelItem(website=[u'track.aspx?id=0&url=http://sitoinlavorazione.seat.it/boninispa'], category=[u'AUTOMOBILE AGENTS, DEALERS AND DEALERSHIPS'], name=[u'BONINI (S.P.A.) (CONCESSIONARIA RENAULT)'], phone=[u'035310333'], address=[u'V. S. BERNARDINO\xa0151', u'24126', u'BERGAMO'], email=[u'info@boniniautospa.it']) in <http://www.infobel.com/en/italy/business/20300/accessories>
[infobel] INFO: Passed InfobelItem(website=[u'track.aspx?id=0&url=http://sitoinlavorazione.seat.it/boninispa'], category=[u'AUTOMOBILE AGENTS, DEALERS AND DEALERSHIPS'], name=[u'BONINI (S.P.A.) (CONCESSIONARIA RENAULT)'], phone=[u'035310333'], address=[u'V. S. BERNARDINO\xa0151', u'24126', u'BERGAMO'], email=[u'info@boniniautospa.it'])
[infobel] DEBUG: Retrying <GET http://www.infobel.com/en/italy/business/20300/accessories/10> (failed 1 times): 200 OK
[infobel] DEBUG: Retrying <GET http://www.infobel.com/en/italy/business/20300/accessories/10> (failed 2 times): 200 OK
[infobel] DEBUG: Discarding <GET http://www.infobel.com/en/italy/business/20300/accessories/10> (failed 3 times): User timeout caused connection failure.
[infobel] ERROR: Error downloading <http://www.infobel.com/en/italy/business/20300/accessories/10>: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused connection failure.

[infobel] INFO: Closing spider (finished)
[infobel] INFO: Spider closed (finished)

最佳答案

我发现这个问题有同样的问题。提问的用户已经自己解决了他的问题,我将其张贴在这里以使其更加明显:

设置从网站下载页面之间的延迟有助于解决因过于频繁的请求而导致的超时错误。这是通过项目的 settings.py 文件中的 DOWNLOAD_DELAY 参数完成的。

The Scrapy documentation有这样的话:

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example:

DOWNLOAD_DELAY = 0.25 # 250 ms of delay

评论更新

除了DOWNLOAD_DELAY,如果RANDOMIZE_DOWNLOAD_DELAY设置参数为True,Scrapy会采样一个0.5到1.5倍DOWNLOAD_DELAY的延迟。他们的文档添加了

This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.

请注意,RANDOMIZE_DOWNLOAD_DELAY 默认为 True。

关于python - 为什么python scrapy显示 "twisted.internet.error.TimeoutError"错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10395161/

相关文章:

python - 如何将管道分隔的文本文件转换为 CSV?

python - 一起使用 django 和 scrapy

python - 无法正确获取 Twitter 抓取中的 min_position

python - 如何用scrapy抓取每个链接的所有内容?

python - 如何让颜色显示到我正在制作的游戏中?

python - 更改 python 解释器中间脚本

python - 使用aiohttp获取多个网站的状态

python - 如何计算代表最低可能重量差异的数字

python - 如何使用scrapy获取XMLHTTP请求的数据

selenium - 我正在使用Scrapy爬取数据,但服务器阻止了我的IP