python - "PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling ` 框架.插入 ` many times, which has poor performance."

标签 python pandas dataframe concatenation

本质上,我对 df 数据框中的每一列进行排名,并将其添加到 ranking 数据框中。显然我做这件事的效率不高,我想知道是否有人可以给我指出正确的方向。

for x in range(1,num_sims+1):
    ranking[x] = df[x].rank(ascending=False, method='min')

完整的错误消息是:

PerformanceWarning: DataFrame is highly fragmented.  This is usually
the result of calling `frame.insert` many times, which has poor
performance.  Consider using pd.concat instead.  To get a
de-fragmented frame, use `newframe = frame.copy()`   ranking[x] =
df[x].rank(ascending=False, method='min')"

最佳答案

重现警告的示例:

import numpy as np
import pandas as pd

# Sample `df`
np.random.seed(5)
df = pd.DataFrame(np.random.randint(1, 100, (4, 5000)))
df.columns = df.columns + 1

num_sims = len(df.columns)  # Placeholder for `num_sims`
ranking = pd.DataFrame()  # Placeholder for `ranking`

for x in range(1, num_sims + 1):
    ranking[x] = df[x].rank(ascending=False, method='min')

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy() ranking[x] = df[x].rank(ascending=False, method='min')


使用DataFrame.rank修复和 concat相反:

ranking = pd.DataFrame()  # Placeholder for `ranking`
ranking = pd.concat(
    [ranking, df[range(1, num_sims + 1)].rank(ascending=False, method='min')],
    axis=1
)

输出没有错误:

   1     2     3     4     5     6     ...  4995  4996  4997  4998  4999  5000
0   2.0   2.0   3.0   1.0   4.0   1.0  ...   3.0   3.0   3.0   4.0   4.0   1.0
1   1.0   1.0   4.0   2.0   3.0   2.0  ...   4.0   1.0   2.0   3.0   1.0   3.0
2   3.0   4.0   1.0   4.0   1.0   4.0  ...   1.0   4.0   1.0   1.0   3.0   4.0
3   4.0   3.0   2.0   3.0   2.0   3.0  ...   2.0   2.0   3.0   2.0   2.0   2.0

*当然,如果 ranking 为空,我们可以直接从 df 创建它:

ranking = df[range(1, num_sims + 1)].rank(ascending=False, method='min')

健全性检查它们是否产生相同的结果:

import numpy as np
import pandas as pd

np.random.seed(5)
df = pd.DataFrame(np.random.randint(1, 100, (4, 5000)))
df.columns = df.columns + 1
ranking = pd.DataFrame()
num_sims = len(df.columns)

for x in range(1, num_sims + 1):
    ranking[x] = df[x].rank(ascending=False, method='min')

print(ranking.eq(pd.concat(
    [pd.DataFrame(),
     df[range(1, num_sims + 1)].rank(ascending=False, method='min')],
    axis=1
)).all(axis=None))  # True

关于python - "PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling ` 框架.插入 ` many times, which has poor performance.",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68886155/

相关文章:

Python)有什么方法可以获取某个文件所在的当前目录位置?

python - 无法安装 Pandas!帮助! (pip 安装 Pandas )

python - 为什么 X.dot(X.T) 在 numpy 中需要这么多内存?

python - 有没有办法优化矩阵与一组样本的比较?

python - 切片 Pandas 数据框以获得不连续的列

python - pandas 按 n 秒分组并应用任意滚动功能

python - 如何在Python中拆分DataFrame列

根据条件重新排序变量

python - JSON 到 Python Pandas 数据框

python - 为什么 `describe` 函数使用科学记数法显示 float ?