python - scrapy - 在 http 代码上调用 process_exception

标签 python proxy scrapy

我想根据 http 响应代码(例如代码=500 或 404)更改代理服务 我想触发 process_exception 来更改代理地址。我创建了自己的 proxyMiddleware,我在 process_request 中设置了代理,例如。当超时代理发生时,默认调用 process_exception。但是我怎样才能在自定义 http 状态上触发它呢?

来自 scrapy 文档:

Scrapy calls process_exception() when a download handler or a process_request() (from a downloader middleware) raises an exception (including an IgnoreRequest exception)

但我不知道如何实现。

编辑 我的蜘蛛代码

class Spider1(CrawlSpider):
#     pageNumber = 0

    keyword = ''
    page = range(0, 40, 10)

    allowed_domains = ['http://somedomain.com/search.html?query=football']
    start_urls = ['http://somedomain.com/search.html?query=football']
    rules = (Rule (LxmlLinkExtractor(), callback="parse", follow=True),) 

    def parse(self, response):
        return item

我的设置.py:

DOWNLOADER_MIDDLEWARES = {
    't.useragentmiddleware.RandomUserAgentMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
    'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware':720,
    't.cookiesmiddleware.CookiesMiddleware': 700,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 760,
    't.proxymiddleware.ProxyMiddleware': 750
}
REDIRECT_ENABLED = True

和proxymiddleware.py:

import json, os, random, socket
import t as spider1
import scrapy.exceptions as exception

socket.setdefaulttimeout(5)

class ProxyMiddleware(object):

    proxy = ''
    proxyList = []
    handle_httpstatus_list = [302, 400]

    def __init__(self, settings):
        f = open(t.location + '/data/proxy.json')
        self.proxyList = json.load(f)['proxy']
        f.close()

    def process_request(self, request, spider):
        if 'proxy' in request.meta:
            return

        self.proxy = 'http://' + random.choice(self.proxyList)

        os.environ['http_proxy'] = self.proxy
        request.meta['proxy'] = self.proxy


    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def process_exception(self, request, exception, spider):
        proxy = request.meta['proxy']

        try:
            del self.proxyList[self.proxyList.index(proxy[8:])]
        except ValueError:
            pass
        prox = 'http://' + random.choice(self.proxyList)
        request.meta['proxy'] = prox
        os.environ['http_proxy'] = prox

    def process_response(self, request, response, spider):
        '''this doesn't work'''
        #raise exception.NotConfigured()

最佳答案

有效的 HTTP 状态代码不是“异常”,因此它们通过 process_response 进行路由。提取一个方法并从 process_exceptionprocess_response 中调用它。

CHANGE_PROXY_STATUS_LIST = [502, 404]

class ProxyMiddleware(object):
    def change_proxy(request):
        # Change proxy here
        # Then check number of retries on the request 
        # and decide if you want to give it another chance.
        # If not - return None else
        return request  

    def process_exception(self, request, exception, spider):
        return_request = change_proxy(request)
        if return_request: 
            return return_request

    def process_response(self, request, response, spider):
        if response.status in CHANGE_PROXY_STATUS_LIST:
            return_request = change_proxy(request)
            if return_request: 
                return return_request
        return response

关于python - scrapy - 在 http 代码上调用 process_exception,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29276260/

相关文章:

proxy - 处理 https 的 Apache 转发代理

python - 如何阻止管道中的蜘蛛?

python - 我使用条件对象在 Python 中实现生产者-消费者有什么问题吗?

python - 如何检查文件是否已 checkout (Clearcase/Python)

python - AttributeError : module 'dis' has no attribute '_unpack_opargs' While Building Python 3. 6 可执行使用 CX_FREEZE

regex - Apache LocationMatch 正则表达式

apache - 如何在 .htaccess 中设置代理

python - Scrapy管道html解析

python - Scrapy SgmlLinkExtractor 忽略允许的链接

python - 在 Python 中加载字典最有效的方法是什么?