我编写了一个python脚本来打开大约1k个网址并处理它们以获得所需的结果,但似乎即使引入了多线程,它的工作速度也很慢,并且在处理了一些网址后,该进程似乎被挂起,我无法决定它是否仍在运行或停止。如何创建多个线程来更快地处理它们。任何帮助将不胜感激。提前致谢。下面是我的脚本。
import threading
from multiprocessing.pool import ThreadPool
from selenium import webdriver
from selenium.webdriver.phantomjs.service import Service
from selenium.webdriver.common.desired_capabilities import
DesiredCapabilities
from selenium.webdriver.remote.webdriver import WebDriver as
RemoteWebDriver
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
import csv
def fetch_url(url):
driver = webdriver.PhantomJS()
driver.get(url)
html = driver.page_source
print(html)
print("'%s\' fetched in %ss" % (url[0], (time.time() - start)))
def thread_task(lock,data_set):
lock.acquire()
fetch_url(url)
lock.release()
if __name__ == "__main__":
data_set = []
with open('file.csv', 'r') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
data_set.append(row)
lock = threading.Lock()
# data set will contain a list of 1k urls
for url in data_set:
t1 = threading.Thread(target=thread_task, args=(lock,url,))
# start threads
t1.start()
# wait until threads finish their job
t1.join()
print("Elapsed Time: %s" % (time.time() - start))
最佳答案
您首先通过在开始下一个循环之前等待每个线程在 for url in data_set:
循环中完成来击败多线程,然后使用锁仅允许该线程的一个实例fetch_url
函数一次运行。您已经导入了ThreadPool
,它是完成这项工作的一个合理的工具。以下是如何使用它
import threading
from multiprocessing.pool import ThreadPool
from selenium import webdriver
from selenium.webdriver.phantomjs.service import Service
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.remote.webdriver import WebDriver as RemoteWebDriver
import csv
def fetch_url(url):
driver = webdriver.PhantomJS()
driver.get(url)
html = driver.page_source
print(html)
print("'%s\' fetched in %ss" % (url[0], (time.time() - start)))
def thread_task(lock,data_set):
lock.acquire()
fetch_url(url)
lock.release()
if __name__ == "__main__":
start = time.time()
with open('file.csv', 'r') as csvfile:
dataset = list(csv.reader(csvfile, delimiter=' ', quotechar='|'))
# guess a thread pool size which is a tradeoff of number of cpu cores,
# expected wait time for i/o and memory size.
with ThreadPool(20) as pool:
pool.map(fetch_url, dataset, chunksize=1)
print("Elapsed Time: %s" % (time.time() - start))
关于python - Python 中的多线程可打开数千个 url 并更快地处理它们,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49590561/