具有多线程的 Python 请求

过去两天我一直在尝试构建一个具有多线程功能的爬虫。不知怎的，我仍然无法管理它。起初我尝试了使用线程模块的常规多线程方法，但它并不比使用单线程快。后来我了解到请求是阻塞的，多线程方法并没有真正起作用。所以我一直在研究并发现了有关 grequests 和 gevent 的信息。现在我正在使用 gevent 运行测试，它仍然不比使用单线程快。我的代码错了吗？

这是我的课的相关部分:

import gevent.monkey
from gevent.pool import Pool
import requests

gevent.monkey.patch_all()

class Test:
    def __init__(self):
        self.session = requests.Session()
        self.pool = Pool(20)
        self.urls = [...urls...]

    def fetch(self, url):

        try:
            response = self.session.get(url, headers=self.headers)
        except:
            self.logger.error('Problem: ', id, exc_info=True)

        self.doSomething(response)

    def async(self):
        for url in self.urls:
            self.pool.spawn( self.fetch, url )

        self.pool.join()

test = Test()
test.async()

最佳答案

安装grequests module与 gevent 一起使用(requests 不是为异步设计的):

pip install grequests

然后把代码改成这样:

import grequests

class Test:
    def __init__(self):
        self.urls = [
            'http://www.example.com',
            'http://www.google.com', 
            'http://www.yahoo.com',
            'http://www.stackoverflow.com/',
            'http://www.reddit.com/'
        ]

    def exception(self, request, exception):
        print "Problem: {}: {}".format(request.url, exception)

    def async(self):
        results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
        print results

test = Test()
test.async()

这是officially recommended由 requests 项目:

Blocking Or Non-Blocking?

With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.

If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are grequests and requests-futures.

使用此方法可以显着提高 10 个 URL 的性能:0.877s vs 3.852s 使用您的原始方法。

关于具有多线程的 Python 请求，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38280094/

具有多线程的 Python 请求

上一篇：python - 如何使用 Python 的 RotatingFileHandler

下一篇：python - 使用 get_pip.py 安装 pip SNIMissingWarning