python - 单独线程中的 Pandas pd.concat() 显示没有加速

标签 python pandas multithreading

我正在尝试在多线程环境中使用pandas。我有一些需要连接的 pandas 框架列表(长列表，5000 个 pandas 框架，尺寸为 300x2500 尺寸)。由于我有多个列表，我想在自己的线程中为每个列表运行 concat(或使用线程池，至少获得一些并行处理)。

出于某种原因，我的多线程设置中的处理与单线程处理相同。我想知道我是否做错了什么系统性的事情。

这是我的代码片段，我使用ThreadPoolExecutor来实现并行化:


def func_merge(the_list, key):
    return (key, pd.concat(the_list))

def my_thread_starter():
    buffer = {
              'A': [df_1, ..., df_5000], 
              'B': [df_a1, ...., df_a5000]
              }
    with ThreadPoolExecutor(max_workers=2) as executor:
        submitted=[]

        for key, df_list in buffer.items():
            submitted.append(executor.submit(func_merge, df_list, key = key))

        for future in as_completed(submitted):
            out = future.result()
            // do with results

有没有在单独的线程中使用 Pandas 的 concat 的技巧？我至少希望在运行更多线程时我的 CPU 利用率会提高，但它似乎确实有任何效果。因此，时间优势也为零

有人知道问题出在哪里吗？

最佳答案

因为 Global Interpreter Lock -GIL) ，我不确定您的代码是否利用多线程。基本上，当工作负载不受 CPU 限制而是 IO 限制(例如同时进行许多 Web API 调用)时，ThreadPoolExecutor 非常有用。

Python 3.8 中可能有变化。但我不知道如何解释documentation中的“释放GIL的任务” .

ProcessPoolExecutor可以有帮助，但由于它需要序列化函数的输入和输出，数据量巨大，速度不会更快。

关于python - 单独线程中的 Pandas pd.concat() 显示没有加速，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58961916/