python - For循环调用urllib.urlopen().getcode()很慢

我有一个Python程序，它运行字符串的所有组合(保存在列表comb中)并检查属于它的网站是否存在。该程序可以运行，但运行速度非常慢。在尝试了一些事情之后，我认为问题出在 getcode 方法上，因为除了该行之外，所有其他部分都工作得很快。我怎样才能让这个程序更快？

它只使用了不到 1% 的 CPU 和很少的互联网带宽。我尝试同时运行该程序的 3 个实例，每个实例的运行速度都与我只运行其中一个实例一样快。是否可以在程序中复制此内容？

for p in comb: 
    if urllib.urlopen(url + p).getcode()!=404:
        print "Sucessful: " + str(p)
        break
    else:
        print "Failure:" + str(p)

最佳答案

多线程的替代方法是使用异步请求。您可以使用 grequests(requests 库的变体)与 Gevent 结合使用来执行此操作。使用 Github page itself 中的代码.

import grequests

urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]

rs = (grequests.get(u) for u in urls)

for i in grequests.imap(rs):
    print i, i.url

我的结果是 7 秒。

<Response [200]> http://docs.python-tablib.org/en/latest/
<Response [200]> https://www.heroku.com/
<Response [200]> http://httpbin.org/
<Response [200]> http://docs.python-requests.org/en/latest/
<Response [200]> http://www.kennethreitz.org/
[Finished in 7.0s]

我对多线程方法的看法。

import requests as rq
import threading

urls = ["...={}".format(x) for x in range(100)]

def get_status(url):
    if rq.get(url, verify=False).status_code != 404:
        print "Successful: {}\n".format(url)
    else:
        print "Failed: {}".format(url)

for url in urls:
    t = threading.Thread(None, get_status, url, (url,))
    t.start()

这能够在大约 10 秒内获取 100 个网站的状态。

关于python - For循环调用urllib.urlopen().getcode()很慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27043693/

python - For循环调用urllib.urlopen().getcode()很慢

上一篇：python - buildout.cfg 中的版本范围

下一篇：python - matplotlib 设置自己的轴值