如您所知,我可以使用多线程来更快地从 Internet 下载文件。 但是,如果我向同一个网站发送大量请求,我可能会被列入黑名单。
那么你能帮我实现类似的东西吗 “我有一个 URL 列表。 我希望你下载所有这些文件,但如果已经有 10 个下载,请等待空档。”
我将不胜感激任何帮助。
比努阿
这是我正在使用的代码(不起作用)。
class PDBDownloader(threading.Thread):
prefix = 'http://www.rcsb.org/pdb/files/'
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
self.pdbid = None
self.urlstr = ''
self.content = ''
def run(self):
while True:
self.pdbid = self.queue.get()
self.urlstr = self.prefix + pdbid + '.pdb'
print 'downloading', pdbid
self.download()
filename = '%s.pdb' %(pdbid)
f = open(filename, 'wt')
f.write(self.content)
f.close()
self.queue.task_done()
def download(self):
try:
f = urllib2.urlopen(self.urlstr)
except urllib2.HTTPError, e:
msg = 'HTTPError while downloading file %s at %s. '\
'Details: %s.' %(self.pdbid, self.urlstr, str(e))
raise OstDownloadException, msg
except urllib2.URLError, e:
msg = 'URLError while downloading file %s at %s. '\
'RCSB erveur unavailable.' %(self.pdbid, self.urlstr)
raise OstDownloadException, msg
except Exception, e:
raise OstDownloadException, str(e)
else:
self.content = f.read()
if __name__ == '__main__':
pdblist = ['1BTA', '3EAM', '1EGJ', '2BV9', '2X6A']
for i in xrange(len(pdblist)):
pdb = PDBDownloader(queue)
pdb.setDaemon(True)
pdb.start()
while pdblist:
pdbid = pdblist.pop()
queue.put(pdbid)
queue.join()
最佳答案
使用线程不会“更快地从 Internet 下载文件”。您只有一张网卡和一个互联网连接,所以这不是真的。
线程正用于等待,您不能等待得更快。
您可以使用单个线程,速度一样快,甚至更快 -- 只是不要等待一个文件的响应再启动另一个文件。换句话说,使用异步、非阻塞的网络编程。
这是一个完整的脚本,它使用 twisted.internet.task.coiterate
同时开始多个下载,不使用任何类型的线程,并考虑池大小(我使用 2演示同时下载,但您可以更改大小):
from twisted.internet import defer, task, reactor
from twisted.web import client
from twisted.python import log
@defer.inlineCallbacks
def deferMap(job, dataSource, size=1):
successes = []
failures = []
def _cbGather(result, dataUnit, succeeded):
"""This will be called when any download finishes"""
if succeeded:
# you could save the file to disk here
successes.append((dataUnit, result))
else:
failures.append((dataUnit, result))
@apply
def work():
for dataUnit in dataSource:
d = job(dataUnit).addCallbacks(_cbGather, _cbGather,
callbackArgs=(dataUnit, True), errbackArgs=(dataUnit, False))
yield d
yield defer.DeferredList([task.coiterate(work) for i in xrange(size)])
defer.returnValue((successes, failures))
def printResults(result):
successes, failures = result
print "*** Got %d pages total:" % (len(successes),)
for url, page in successes:
print ' * %s -> %d bytes' % (url, len(page))
if failures:
print "*** %d pages failed download:" % (len(failures),)
for url, failure in failures:
print ' * %s -> %s' % (url, failure.getErrorMessage())
if __name__ == '__main__':
import sys
log.startLogging(sys.stdout)
urls = ['http://twistedmatrix.com',
'XXX',
'http://debian.org',
'http://python.org',
'http://python.org/foo',
'https://launchpad.net',
'noway.com',
'somedata',
]
pool = deferMap(client.getPage, urls, size=2) # download 2 at once
pool.addCallback(printResults)
pool.addErrback(log.err).addCallback(lambda ign: reactor.stop())
reactor.run()
请注意,我故意包含了一些错误的 url,以便我们可以在结果中看到一些失败:
...
2010-06-29 08:18:04-0300 [-] *** Got 4 pages total:
2010-06-29 08:18:04-0300 [-] * http://twistedmatrix.com -> 16992 bytes
2010-06-29 08:18:04-0300 [-] * http://python.org -> 17207 bytes
2010-06-29 08:18:04-0300 [-] * http://debian.org -> 13820 bytes
2010-06-29 08:18:04-0300 [-] * https://launchpad.net -> 18511 bytes
2010-06-29 08:18:04-0300 [-] *** 4 pages failed download:
2010-06-29 08:18:04-0300 [-] * XXX -> Connection was refused by other side: 111: Connection refused.
2010-06-29 08:18:04-0300 [-] * http://python.org/foo -> 404 Not Found
2010-06-29 08:18:04-0300 [-] * noway.com -> Connection was refused by other side: 111: Connection refused.
2010-06-29 08:18:04-0300 [-] * somedata -> Connection was refused by other side: 111: Connection refused.
...
关于Python 限制多线程,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3139513/