我正在尝试使用 python scrapy 抓取一个页面。经过一些抓取操作后,scrapy 正在退出显示
twisted.internet.error.TimeoutError error
这是我的代码:
#infobel_spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http import FormRequest
from infobel.items import InfobelItem
import sys
import xlwt
import re
import codecs
class InfobelSpider(BaseSpider):
name = 'infobel'
start_urls = ['http://www.infobel.com/en/italy/business/20300/accessories']
def parse(self,response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select("//a[@id='Pagination1_lnkNextRec']/@href").extract()
if not not next_page:
yield Request("http://www.infobel.com"+next_page[0],self.parse)
qs = hxs.select("//div[@class='result-item clearfix']")
items = []
for q in qs:
item = InfobelItem()
item['name'] = q.select('div/div/h2/a/span/text()').extract()
item['address'] = q.select('div/div/ul/li[1]/div/span/text()').extract()
item['phone'] = q.select('div/div/ul/li[2]/div/text()').extract()
item['email'] = q.select('div/div/ul/li[3]/div/a/text()').extract()
item['website'] = q.select('div/div/ul/li[4]/div/a/@href').extract()
item['category'] = q.select("div/div[@class='categories']/div/ul/li/text()").extract()
items.append(item)
for item in items:
yield item
#items.py
from scrapy.item import Item, Field
class InfobelItem(Item):
# define the fields for your item here like:
name = Field()
address = Field()
phone = Field()
email = Field()
category = Field()
website = Field()
#middlewares.py
import base64
import random
from settings import PROXIES
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
encoded_user_pass = base64.encodestring(proxy['user_pass'])
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
else:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
#pipelines.py
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join
import re
import json
import csv
class InfobelPipeline(object):
def __init__(self):
self.file = csv.writer(open('items.csv','wb'))
def process_item(self, item, spider):
name = item['name']
address = item['address']
phone = item['phone']
email = item['email']
category = item['category']
website = item['website']
self.file.writerow((name,address,phone,email,category,website))
return item
#settings.py
BOT_NAME = 'infobel'
BOT_VERSION = '1.0'
SPIDER_MODULES = ['infobel.spiders']
NEWSPIDER_MODULE = 'infobel.spiders'
DEFAULT_ITEM_CLASS = 'infobel.items.InfobelItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
ITEM_PIPELINES = ['infobel.pipelines.InfobelPipeline']
PROXIES = [{'ip_port': '41.43.31.226:8080', 'user_pass': ''},
{'ip_port': '64.120.226.94:8080', 'user_pass': ''},
{'ip_port': '196.2.73.246:3128', 'user_pass': ''},]
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'infobel.middlewares.ProxyMiddleware': 100,
}
这是输出:
[infobel] INFO: Passed InfobelItem(website=[u'track.aspx?id=0&url=http://www.bbmodena.it'], category=[u'TELEVISION, VIDEO AND HI-FI EMERGENCY BREAKDOWN SERVICES, REPAIRS AND SPARE PARTS'], name=[u'B & B (S.R.L.) (RIP.TVC VIDEO HI-FI)'], phone=[u'059254545'], address=[u'V. MALAVOLTI\xa047', u'41100', u'MODENA'], email=[u'info@bbmodena.it'])
[infobel] DEBUG: Scraped InfobelItem(website=[u'track.aspx?id=0&url=http://sitoinlavorazione.seat.it/boninispa'], category=[u'AUTOMOBILE AGENTS, DEALERS AND DEALERSHIPS'], name=[u'BONINI (S.P.A.) (CONCESSIONARIA RENAULT)'], phone=[u'035310333'], address=[u'V. S. BERNARDINO\xa0151', u'24126', u'BERGAMO'], email=[u'info@boniniautospa.it']) in <http://www.infobel.com/en/italy/business/20300/accessories>
[infobel] INFO: Passed InfobelItem(website=[u'track.aspx?id=0&url=http://sitoinlavorazione.seat.it/boninispa'], category=[u'AUTOMOBILE AGENTS, DEALERS AND DEALERSHIPS'], name=[u'BONINI (S.P.A.) (CONCESSIONARIA RENAULT)'], phone=[u'035310333'], address=[u'V. S. BERNARDINO\xa0151', u'24126', u'BERGAMO'], email=[u'info@boniniautospa.it'])
[infobel] DEBUG: Retrying <GET http://www.infobel.com/en/italy/business/20300/accessories/10> (failed 1 times): 200 OK
[infobel] DEBUG: Retrying <GET http://www.infobel.com/en/italy/business/20300/accessories/10> (failed 2 times): 200 OK
[infobel] DEBUG: Discarding <GET http://www.infobel.com/en/italy/business/20300/accessories/10> (failed 3 times): User timeout caused connection failure.
[infobel] ERROR: Error downloading <http://www.infobel.com/en/italy/business/20300/accessories/10>: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused connection failure.
[infobel] INFO: Closing spider (finished)
[infobel] INFO: Spider closed (finished)
最佳答案
我发现这个问题有同样的问题。提问的用户已经自己解决了他的问题,我将其张贴在这里以使其更加明显:
设置从网站下载页面之间的延迟有助于解决因过于频繁的请求而导致的超时错误。这是通过项目的 settings.py 文件中的 DOWNLOAD_DELAY 参数完成的。
The Scrapy documentation有这样的话:
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example:
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
评论更新
除了DOWNLOAD_DELAY,如果RANDOMIZE_DOWNLOAD_DELAY设置参数为True,Scrapy会采样一个0.5到1.5倍DOWNLOAD_DELAY的延迟。他们的文档添加了
This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.
请注意,RANDOMIZE_DOWNLOAD_DELAY 默认为 True。
关于python - 为什么python scrapy显示 "twisted.internet.error.TimeoutError"错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10395161/