pandas - 为什么 Pandas 创建多个线程，而其内部操作是单线程的？

如果我没记错的话，Pandas 的内部操作是单线程的。然而，我今天注意到，运行如下所示的简单程序将导致与正在创建的系统中可用的 CPU 核心数一样多的线程。为什么它会创建这些额外的线程？

import threading
import pandas as pd

def use_some_cpu(row):
    print(f'thread id={threading.get_ident()}')
    x = 1.001
    for i in range(100000):
        x *= 1.001

df = pd.DataFrame(list(range(0, 10000)), columns=['foo'])
df.apply(use_some_cpu, axis=1)

如果您尝试运行该程序，您会看到打印出的所有 thred id 值都是相同的，这意味着实际处理是从单个线程完成的。但是，使用 htop 命令，您会看到程序创建了很多线程(与系统中的核心数一样多)，只有一个核心处于忙碌状态。

测试是在 Ubuntu 18.04 上用 pandas 1.0.2 和 python 3.7 完成的。

最佳答案

我不能用现代 Pandas 重现这个:

In [2]: import threading
   ...: import pandas as pd
   ...: 
   ...: thread_ids = set()
   ...: 
   ...: def use_some_cpu(row):
   ...:     thread_ids.add(threading.get_ident())
   ...:     x = 1.001
   ...:     for i in range(100000):
   ...:         x *= 1.001
   ...: 
   ...: df = pd.DataFrame(list(range(0, 10000)), columns=['foo'])
   ...: df.apply(use_some_cpu, axis=1)
Out[2]: 
0       None
1       None
2       None
3       None
4       None
        ... 
9995    None
9996    None
9997    None
9998    None
9999    None
Length: 10000, dtype: object

In [3]: thread_ids
Out[3]: {140372742666048}

然而，如今一些 pandas 操作发布了 GIL 或允许在引擎盖下进行不同程度的并行，请参阅 this GitHub comment .

关于pandas - 为什么 Pandas 创建多个线程，而其内部操作是单线程的？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60843493/

上一篇：react-native - 在 native webview 中跟踪支付状态

下一篇：automation - 一旦子问题状态从 "Pending"更改为 "In Development"，如何在 Jira 中自动转换父问题？

相关文章：

python - Pandas - 根据列值有条件地选择列名称

python Pandas : Merge two tables without keys (Multiply 2 dataframes with broadcasting all elements; NxN dataframe)

ios - 从 com.apple.main-thread 入队(线程 1)崩溃 | iOS | swift 4.1

c# - 通过线程以一定的时间间隔执行任务一定次数

java - 为什么我的 ThreadPool 没有在 Java 中并行运行？

pandas - 加权平均 Pandas

pandas - 在 Pandas 数据框中查找具有相同值的不同 ID

python - 如何消除数据框中的嵌套循环

java - 与 Thread.sleep 相比，new Thread().sleep 在 CPU 和内存利用率方面有多差？

java - 线程相关的程序不工作