python - Scrapy 的代理池系统暂时停止使用慢速/超时代理

标签 python proxy scrapy

我一直在四处寻找,试图为 Scrapy 找到一个像样的池化系统,但我找不到任何我需要/想要的东西。

我正在寻找解决方案:

轮换代理

  • 我希望他们在代理之间随机切换,但绝不会连续两次选择同一个代理。 (Scrapoxy 有这个)

模拟已知浏览器

  • 模拟 Chrome、Firefox、Internet Explorer、Edge、Safari 等(Scrapoxy 有这个)

黑名单慢速代理

  • 如果代理超时或速度慢,则应通过一系列规则将其列入黑名单...(Scrapoxy 仅针对实例/启动数量列入黑名单)

  • 如果代理很慢(占用 x 时间),则应将其标记为 Slow,并应采用时间戳并增加计数器。

  • 如果代理超时,则应将其标记为失败,并记录时间戳并增加计数器。
  • 如果代理在收到最后一次减速后 15 分钟内没有减速,则计数器和时间戳应清零,代理返回到新状态。
  • 如果代理在收到最后一次失败后 30 分钟内没有失败,则计数器和时间戳应清零,代理返回到新状态。
  • 如果代理在 1 小时内变慢 5 次,则应将其从池中移除 1 小时。
  • 如果代理在 1 小时内超时 5 次,则应将其列入黑名单 1 小时
  • 如果代理在 3 小时内被阻止两次,则应将其列入黑名单 12 小时并标记为不良
  • 如果一个代理在 48 小时内两次被标记为错误,那么它应该通知我(电子邮件、push bullet...任何方式)

任何人都知道任何此类解决方案(主要功能是将慢速/超时代理列入黑名单...

最佳答案

由于您的轮询规则非常具体,您可以编写自己的代码,请参阅下面的代码,其中实现了部分规则(您必须实现其他部分):

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import pexpect,time
from random import shuffle

#this func is use to test a single proxy
def test_proxy(ip,port,max_timeout=1):
    child = pexpect.spawn("telnet " + ip + " " +str(port))
    time_send_request=time.time()
    try:
        i=child.expect(["Connected to","Connection refused"], timeout=max_timeout) #max timeout in seconds
    except pexpect.TIMEOUT:
        i=-1
    if i==0:
        time_request_ok=time.time()
        return {"status":True,"time_to_answer":time_request_ok-time_send_request}
    else:
        return {"status":False,"time_to_answer":max_timeout}


#this func is use to test all the current proxy and update status and apply your custom rules
def update_proxy_list_status(proxy_list):
    for i in range(0,len(proxy_list)):
        print ("testing proxy "+str(i)+" "+proxy_list[i]["ip"]+":"+str(proxy_list[i]["port"]))
        proxy_status = test_proxy(proxy_list[i]["ip"],proxy_list[i]["port"])
        proxy_list[i]["status_ok"]= proxy_status["status"]


        print proxy_status

        #here it is time to treat your own rule to update respective proxy dict

        #~ If a proxy is slow (takes over x time) it should be marked as Slow and a timestamp should be taken and a counter should be increased.
        #~ If a proxy timeout's it should be marked as Fail and a timestamp should be taken and a counter should be increased.
        #~ If a proxy has no slows for 15 minutes after receiving its last slow then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
        #~ If a proxy has no fails for 30 minutes after receiving its last fail then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
        #~ If a proxy is slow 5 times in 1 hour then it should be removed from the pool for 1 hour.
        #~ If a proxy timeout's 5 times in 1 hour then it should be blacklisted for 1 hour
        #~ If a proxy get's blocked twice in 3 hours it should be blacklisted for 12 hours and marked as bad
        #~ If a proxy gets marked as bad twice in 48 hours then it should notify me (email, push bullet... anything)        

        if proxy_status["status"]==True:
            #modify proxy dict with your own rules (adding timestamp, last check time, last down, last up eFIRSTtc...)
            #...
            pass
        else:
            #modify proxy dict with your own rules (adding timestamp, last check time, last down, last up etc...)
            #...
            pass        

    return proxy_list


#this func select a good proxy and do the job
def main():

    #first populate a proxy list | I get those example proxies list from http://free-proxy.cz/en/
    proxy_list=[
        {"ip":"167.99.2.12","port":8080}, #bad proxy
        {"ip":"167.99.2.17","port":8080},
        {"ip":"66.70.160.171","port":1080},
        {"ip":"192.99.220.151","port":8080},
        {"ip":"142.44.137.222","port":80}
        # [...]
    ]



    #this variable is use to keep track of last used proxy (to avoid to use the same one two consecutive time)
    previous_proxy_ip=""

    the_job=True
    while the_job:

        #here we update each proxy status
        proxy_list = update_proxy_list_status(proxy_list)

        #we keep only proxy considered as ok
        good_proxy_list = [d for d in proxy_list if d['status_ok']==True]

        #here you can shuffle the list
        shuffle(good_proxy_list)

        #select a proxy (not same last previous one)
        current_proxy={}
        for i in range(0,len(good_proxy_list)):
            if good_proxy_list[i]["ip"]!=previous_proxy_ip:
                previous_proxy_ip=good_proxy_list[i]["ip"]
                current_proxy=good_proxy_list[i]
                break

        #use this selected proxy to do the job
        print ("the current proxy is: "+str(current_proxy))

        #UPDATE SCRAPY PROXY

        #DO THE SCRAPY JOB
        print "DO MY SCRAPY JOB with the current proxy settings"

        #wait some seconds
        time.sleep(5)

main()

关于python - Scrapy 的代理池系统暂时停止使用慢速/超时代理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48910982/

相关文章:

python - 以 GitHub 为源的 CDK Codepipeline

Python:从 html 页面解析出发布日期

python - Bokeh 服务器 + 使用 Nginx 进行反向代理给出 404

node.js - Angular HttpClient 代理失败

web-scraping - 如何根据亚马逊的位置抓取数据?

python - 如何将一组图像转换为单个 data_test.npz 文件?

python - PyCryptodome AES CBC 加密未提供所需的输出

android - 在 SDK 8 中使用 Android MediaPlayer 进行流式传输

Python selenium 屏幕截图没有获取整个页面

python - Scrapy Spider 关闭后如何获取 `item_scraped_count` - Python27