python - Python 中的多线程可打开数千个 url 并更快地处理它们

我编写了一个python脚本来打开大约1k个网址并处理它们以获得所需的结果，但似乎即使引入了多线程，它的工作速度也很慢，并且在处理了一些网址后，该进程似乎被挂起，我无法决定它是否仍在运行或停止。如何创建多个线程来更快地处理它们。任何帮助将不胜感激。提前致谢。下面是我的脚本。

import threading
from multiprocessing.pool import ThreadPool
from selenium import webdriver
from selenium.webdriver.phantomjs.service import Service
from selenium.webdriver.common.desired_capabilities import 
DesiredCapabilities
from selenium.webdriver.remote.webdriver import WebDriver as 
RemoteWebDriver
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count
import csv

def fetch_url(url):
    driver = webdriver.PhantomJS()
    driver.get(url)
    html = driver.page_source
    print(html)
    print("'%s\' fetched in %ss" % (url[0], (time.time() - start)))

def thread_task(lock,data_set):
    lock.acquire()
    fetch_url(url)
    lock.release()

if __name__ == "__main__":
    data_set = []
    with open('file.csv', 'r') as csvfile:
        spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
        for row in spamreader:
            data_set.append(row)

    lock = threading.Lock()
    # data set will contain a list of 1k urls
    for url in data_set:
        t1 = threading.Thread(target=thread_task, args=(lock,url,))
        # start threads
        t1.start()

        # wait until threads finish their job
        t1.join()

    print("Elapsed Time: %s" % (time.time() - start))

最佳答案

您首先通过在开始下一个循环之前等待每个线程在 for url in data_set: 循环中完成来击败多线程，然后使用锁仅允许该线程的一个实例fetch_url 函数一次运行。您已经导入了ThreadPool，它是完成这项工作的一个合理的工具。以下是如何使用它

import threading
from multiprocessing.pool import ThreadPool
from selenium import webdriver
from selenium.webdriver.phantomjs.service import Service
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.remote.webdriver import WebDriver as RemoteWebDriver
import csv

def fetch_url(url):
    driver = webdriver.PhantomJS()
    driver.get(url)
    html = driver.page_source
    print(html)
    print("'%s\' fetched in %ss" % (url[0], (time.time() - start)))

def thread_task(lock,data_set):
    lock.acquire()
    fetch_url(url)
    lock.release()

if __name__ == "__main__":
    start = time.time()
    with open('file.csv', 'r') as csvfile:
        dataset = list(csv.reader(csvfile, delimiter=' ', quotechar='|'))

    # guess a thread pool size which is a tradeoff of number of cpu cores,
    # expected wait time for i/o and memory size.

    with ThreadPool(20) as pool:
        pool.map(fetch_url, dataset, chunksize=1)

    print("Elapsed Time: %s" % (time.time() - start))

关于python - Python 中的多线程可打开数千个 url 并更快地处理它们，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49590561/

python - Python 中的多线程可打开数千个 url 并更快地处理它们

上一篇：python - 在 csv 文件中的所有行组合中应用字符串匹配逻辑

下一篇：Python:将列表转换为字典