过去两天我一直在尝试构建一个具有多线程功能的爬虫。不知怎的,我仍然无法管理它。起初我尝试了使用线程模块的常规多线程方法,但它并不比使用单线程快。后来我了解到请求是阻塞的,多线程方法并没有真正起作用。所以我一直在研究并发现了有关 grequests 和 gevent 的信息。现在我正在使用 gevent 运行测试,它仍然不比使用单线程快。我的代码错了吗?
这是我的课的相关部分:
import gevent.monkey
from gevent.pool import Pool
import requests
gevent.monkey.patch_all()
class Test:
def __init__(self):
self.session = requests.Session()
self.pool = Pool(20)
self.urls = [...urls...]
def fetch(self, url):
try:
response = self.session.get(url, headers=self.headers)
except:
self.logger.error('Problem: ', id, exc_info=True)
self.doSomething(response)
def async(self):
for url in self.urls:
self.pool.spawn( self.fetch, url )
self.pool.join()
test = Test()
test.async()
最佳答案
安装grequests
module与 gevent
一起使用(requests
不是为异步设计的):
pip install grequests
然后把代码改成这样:
import grequests
class Test:
def __init__(self):
self.urls = [
'http://www.example.com',
'http://www.google.com',
'http://www.yahoo.com',
'http://www.stackoverflow.com/',
'http://www.reddit.com/'
]
def exception(self, request, exception):
print "Problem: {}: {}".format(request.url, exception)
def async(self):
results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
print results
test = Test()
test.async()
这是officially recommended由 requests
项目:
Blocking Or Non-Blocking?
With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The
Response.content
property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are
grequests
andrequests-futures
.
使用此方法可以显着提高 10 个 URL 的性能:0.877s
vs 3.852s
使用您的原始方法。
关于具有多线程的 Python 请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38280094/