python - 使用多线程优化 python 脚本

标签 python multithreading web-crawler web-scraping python-multithreading

<分区>

大家好!自己写过小网页爬虫功能。但是我是多线程的新手，我无法优化它。我的代码是:

alreadySeenURLs = dict() #the dictionary of already seen crawlers
candidates = set() #the set of URL candidates to crawl

def initializeCandidates(url):

    #gets page with urllib2
    page = getPage(url)

    #parses page with BeautifulSoup
    parsedPage = getParsedPage(page)

    #function which return all links from parsed page as set
    initialURLsFromRoot = getLinksFromParsedPage(parsedPage)

    return initialURLsFromRoot 

def updateCandidates(oldCandidates, newCandidates):
    return oldCandidates.union(newCandidates)

candidates = initializeCandidates(rootURL)

for url in candidates:

    print len(candidates)

    #fingerprint of URL
    fp = hashlib.sha1(url).hexdigest()

    #checking whether url is in alreadySeenURLs
    if fp in alreadySeenURLs:
        continue

    alreadySeenURLs[fp] = url

    #do some processing
    print url

    page = getPage(url)
    parsedPage = getParsedPage(page, fix=True)
    newCandidates = getLinksFromParsedPage(parsedPage)

    candidates = updateCandidates(candidates, newCandidates)

正如你所看到的，这里它在特定时间从候选人那里获取一个 url。我想让这个脚本多线程化，这样它至少可以从候选人那里获取 N 个 url，然后完成这项工作。谁能指导我？给任何链接或建议？

最佳答案

您可以从这两个链接开始:

Python 中线程的基本引用 http://docs.python.org/library/threading.html
他们在 python 中实际实现多线程 URL 爬虫的教程 http://www.ibm.com/developerworks/aix/library/au-threadingpython/

而且，你已经有了一个python的爬虫:http://scrapy.org/

关于python - 使用多线程优化 python 脚本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10722455/

上一篇：python - 在空行中重复数据，直到出现非空行

下一篇：python - PyGTK:移动自定义构建工具提示窗口

相关文章：

python - 使用Sklearn加载本地文件，尝试显示任何图像返回空

Python JSON 输入顺序

c# - 线程池是否在应用程序域之间共享？

python - 使用selenium爬取SPA网页，得到错误数据

xml - 无法在 import.io 的爬虫中为网页元素获取正确的 XPath

python - 如何在 .wav 文件末尾添加几秒的静音？

python - 使用 WSGI 运行 Flask-Ask 和 Apache2 时如何验证 Alexa 请求

java - 多线程增加了我的矩阵乘法示例中的时间？

c++ - 自定义创建 QFuture

html - 如何递归爬取url子目录？