python - Python 中的多线程可打开数千个 url 并更快地处理它们

标签 python multithreading

我编写了一个python脚本来打开大约1k个网址并处理它们以获得所需的结果,但似乎即使引入了多线程,它的工作速度也很慢,并且在处理了一些网址后,该进程似乎被挂起,我无法决定它是否仍在运行或停止。如何创建多个线程来更快地处理它们。任何帮助将不胜感激。提前致谢。下面是我的脚本。

import threading
from multiprocessing.pool import ThreadPool
from selenium import webdriver
from selenium.webdriver.phantomjs.service import Service
from selenium.webdriver.common.desired_capabilities import 
DesiredCapabilities
from selenium.webdriver.remote.webdriver import WebDriver as 
RemoteWebDriver
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count
import csv

def fetch_url(url):
    driver = webdriver.PhantomJS()
    driver.get(url)
    html = driver.page_source
    print(html)
    print("'%s\' fetched in %ss" % (url[0], (time.time() - start)))

def thread_task(lock,data_set):
    lock.acquire()
    fetch_url(url)
    lock.release()

if __name__ == "__main__":
    data_set = []
    with open('file.csv', 'r') as csvfile:
        spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
        for row in spamreader:
            data_set.append(row)

    lock = threading.Lock()
    # data set will contain a list of 1k urls
    for url in data_set:
        t1 = threading.Thread(target=thread_task, args=(lock,url,))
        # start threads
        t1.start()

        # wait until threads finish their job
        t1.join()

    print("Elapsed Time: %s" % (time.time() - start))

最佳答案

您首先通过在开始下一个循环之前等待每个线程在 for url in data_set: 循环中完成来击败多线程,然后使用锁仅允许该线程的一个实例fetch_url 函数一次运行。您已经导入了ThreadPool,它是完成这项工作的一个合理的工具。以下是如何使用它

import threading
from multiprocessing.pool import ThreadPool
from selenium import webdriver
from selenium.webdriver.phantomjs.service import Service
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.remote.webdriver import WebDriver as RemoteWebDriver
import csv

def fetch_url(url):
    driver = webdriver.PhantomJS()
    driver.get(url)
    html = driver.page_source
    print(html)
    print("'%s\' fetched in %ss" % (url[0], (time.time() - start)))

def thread_task(lock,data_set):
    lock.acquire()
    fetch_url(url)
    lock.release()

if __name__ == "__main__":
    start = time.time()
    with open('file.csv', 'r') as csvfile:
        dataset = list(csv.reader(csvfile, delimiter=' ', quotechar='|'))

    # guess a thread pool size which is a tradeoff of number of cpu cores,
    # expected wait time for i/o and memory size.

    with ThreadPool(20) as pool:
        pool.map(fetch_url, dataset, chunksize=1)

    print("Elapsed Time: %s" % (time.time() - start))

关于python - Python 中的多线程可打开数千个 url 并更快地处理它们,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49590561/

相关文章:

python - 如何解释 Keras 中 LSTM 层中的权重

python - 如何在 matplotlib 中为 basemap 上的散点图制作动画?

python - 如何在Charm中序列化/存储混合CPabe_BSW07加密的密文

c# - 启动新线程时抛出IndexOutOfRangeException

c# - 使用 Linq to SQL 的多线程

perl - 如何在内存占用大的Perl守护程序中处理多个套接字?

python - 为什么列表中允许使用尾随逗号?

python - 为什么从数据框中检索单行作为字典而不是系列?

java - 线程中出现异常 "Thread-10"java.lang.IllegalStateException

ios - 使用 NSThread 生成线程时如何解决 "target does not implement selector"这个异常?