python : efficiency concerns in parallel async calls to fetch data from web services

标签 python multithreading asynchronous concurrent.futures

我正在编写一个 python 脚本来获取与特定 group_id 相对应的主机列表。我将使用网络服务调用来获取相同的内容。主机数量可以是10,000个。现在,对于每个主机,我将从另一个 Web 服务获取一个名为“属性”的值。
所以 group-id ----(ws1)-----10000 台主机 --(ws2)----每个属性

我正在使用并发.futures,如以下代码所示。但它似乎不是一个干净的设计,并且不太可能很好地扩展。

def call_ws_1(group_id):
     #fetch list of hosts for group_id


def call_ws_2(host):
     #find property for host


def fetch_hosts(group_ids):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_grp_id = {executor.submit(call_ws_1, group_id): group_id for group_id in group_ids}
        for future in concurrent.futures.as_completed(future_to_grp_id):
            group_id = future_to_grp_id[future]
            try:
                hosts = future.result()#this is a list
            except Exception as exp:
                #logging etc
            else:
                 fetch_property(hosts)


def fetch_property(hosts):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_host = {executor.submit(call_ws_2, host): host for host in hosts}
        for future in concurrent.futures.as_completed(future_to_host):
            host = future_to_host[future]
            try:
                host_prop = future.result()#String
            except Exception as exp:
                #logging etc
            else:
                 #Save host and property to DB
  1. 使用 ProcessPoolExecuter 有什么优势吗?
  2. 首先获取所有主机(大约 40000 个),然后调用 ws 获取属性怎么样
  3. 还有其他改进此设计的建议吗?

最佳答案

  1. ProcessPoolExecutor的优点是不受GIL影响。使用 ThreadPoolExecutor,GIL 将阻止多个线程同时运行,除非您正在执行 I/O。好消息是,看起来您的两个线程将主要执行 I/O,但是每个线程在调用 Web 服务之前或之后发生的任何类型的处理都不会真正同时发生,这会损害您的性能。 ProcessPoolExecutor 不会有此限制,但它会增加在进程之间发送 group_idhost 数据的开销。如果您有数以万计的主机,在进程之间一次发送这些主机将产生相当大的开销。

  2. 我认为仅此更改不会对性能产生太大影响,因为最终您仍然将每个主机一次一个发送到一个线程进行处理。

对于第三点,如果您的工作线程实际上除了 I/O 之外几乎什么也不做,那么这种方法可能效果很好。但对于线程来说,工作线程中进行的任何受 CPU 限制的工作都会降低你的性能。我采用了您确切的程序布局并像这样实现了您的两个工作人员:

def call_ws_1(group_id):
    return list(range(20))

def call_ws_2(host):
    sum(range(33000000))  # CPU-bound
    #time.sleep(1)  # I/O-bound
    return "{} property".format(host)

并执行了这样的所有操作:

if __name__ == "__main__":
    start = time.time()
    fetch_hosts(['a', 'b', 'c', 'd', 'e'])
    end = time.time()
    print("Total time: {}".format(end-start))

使用time.sleep,输出为:

Fetching hosts for d
Fetching hosts for a
Fetching hosts for c
Fetching hosts for b
Fetching hosts for e
Total time: 25.051292896270752

使用sum(range(33000000))计算,性能要差很多:

Fetching hosts for d
Fetching hosts for a
Fetching hosts for c
Fetching hosts for b
Fetching hosts for e
Total time: 75.81612730026245

请注意,在我的笔记本电脑上,计算大约需要一秒钟:

>>> timeit.timeit("sum(range(33000000))", number=1)
1.023313045501709
>>> timeit.timeit("sum(range(33000000))", number=1)
1.029937982559204

因此每个工作人员大约需要一秒钟。但由于线程受 CPU 限制,因此受到 GIL 的影响,因此线程的性能非常糟糕。

这是一个使用 time.sleepProcessPoolExecutor:

Fetching hosts for a
Fetching hosts for b
Fetching hosts for c
Fetching hosts for d
Fetching hosts for e
Total time: 25.169482469558716

现在使用sum(range(33000000)):

Fetching hosts for a
Fetching hosts for b
Fetching hosts for c
Fetching hosts for d
Fetching hosts for e
Total time: 43.54587936401367

正如您所看到的,虽然性能仍然比 time.sleep 差(可能是因为计算时间比一秒长一点,并且受 CPU 限制的工作必须与运行在其上的其他所有工作竞争)笔记本电脑),它仍然大大优于线程版本。

但是,我怀疑随着主机数量的增加,IPC 的成本会大大降低你的速度。以下是 ThreadPoolExecutor 如何处理 10000 个主机,但工作进程不执行任何操作(它只是返回):

Fetching hosts for c
Fetching hosts for b
Fetching hosts for d
Fetching hosts for a
Fetching hosts for e
Total time: 9.535644769668579

ProcessPoolExecutor比较:

Fetching hosts for c
Fetching hosts for b
Fetching hosts for a
Fetching hosts for d
Fetching hosts for e
Total time: 36.59257411956787

因此,ProcessPoolExecutor 的速度慢了 4 倍,这都是 IPC 成本造成的。

那么,这一切意味着什么?我认为最好的性能可能是通过使用 ProcessPoolExecutor 来实现的,但还要对 IPC 进行批处理,以便将大量主机发送到子进程中,而不是一次只发送一台主机。

像这样的东西(未经测试,但给你的想法):

import time
import itertools
import concurrent.futures
from concurrent.futures import ProcessPoolExecutor as Pool

def call_ws_1(group_id):
    return list(range(10000))

def call_ws_2(hosts):  # This worker now works on a list of hosts
    host_results = []
    for host in hosts:
        host_results.append(( host, "{} property".format(host)))  # returns a list of (host, property) tuples
    return host_results

def chunk_list(l):
    chunksize = len(l) // 16  # Break the list into smaller pieces
    it = [iter(l)] * chunksize
    for item in itertools.zip_longest(*it):
        yield tuple(filter(None, item))

def fetch_property(hosts):
    with Pool(max_workers=4) as executor:
        futs = []
        for chunk in chunk_list(hosts):
            futs.append(concurrent.futures.submit(call_ws_2, chunk))
        for future in concurrent.futures.as_completed(futs):
            try:
                 results = future.result()
            except Exception as exp:
                print("Got %s" % exp)
            else:
                for result in results:
                    host, property = result
                    # Save host and property to DB

def fetch_hosts(group_ids):
    with Pool(max_workers=4) as executor:
        future_to_grp_id = {executor.submit(call_ws_1, group_id): group_id for group_id in group_ids}
        for future in concurrent.futures.as_completed(future_to_grp_id):
            group_id = future_to_grp_id[future]
            try:
                hosts = future.result()#this is a list
            except Exception as exp:
                print("Got %s" % exp)
            else:
                print("Fetching hosts for {}".format(group_id))
                fetch_property(hosts)

if __name__ == "__main__":
    start = time.time()
    fetch_hosts(['a', 'b', 'c', 'd', 'e'])
    end = time.time()
    print("Total time: {}".format(end-start))

关于 python : efficiency concerns in parallel async calls to fetch data from web services,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24921758/

相关文章:

performance - 如何并行化小型纯函数?

c# - 使用 async/await 时处理节流/速率限制(429 错误)

c# - 异步方法调用和模拟

python - 在 python nose-parameterized 中只跳过参数化列表中的一个参数

python - 我可以更优雅地打印列表数据吗?

java - 这需要同步吗?

java - 在 call() 方法的返回语句执行之前具有对象引用的 future 对象

python - 日期时间之间有多少 "premium"小时?

python - 如何从数据库中获取一个国家的城市列表?

c# - 使用 LINQ 创建等待任务