本质上,我对 df 数据框中的每一列进行排名,并将其添加到 ranking
数据框中。显然我做这件事的效率不高,我想知道是否有人可以给我指出正确的方向。
for x in range(1,num_sims+1):
ranking[x] = df[x].rank(ascending=False, method='min')
完整的错误消息是:
PerformanceWarning: DataFrame is highly fragmented. This is usually
the result of calling `frame.insert` many times, which has poor
performance. Consider using pd.concat instead. To get a
de-fragmented frame, use `newframe = frame.copy()` ranking[x] =
df[x].rank(ascending=False, method='min')"
最佳答案
重现警告的示例:
import numpy as np
import pandas as pd
# Sample `df`
np.random.seed(5)
df = pd.DataFrame(np.random.randint(1, 100, (4, 5000)))
df.columns = df.columns + 1
num_sims = len(df.columns) # Placeholder for `num_sims`
ranking = pd.DataFrame() # Placeholder for `ranking`
for x in range(1, num_sims + 1):
ranking[x] = df[x].rank(ascending=False, method='min')
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling
frame.insert
many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy()
ranking[x] = df[x].rank(ascending=False, method='min')
使用DataFrame.rank
修复和 concat
相反:
ranking = pd.DataFrame() # Placeholder for `ranking`
ranking = pd.concat(
[ranking, df[range(1, num_sims + 1)].rank(ascending=False, method='min')],
axis=1
)
输出
没有错误:
1 2 3 4 5 6 ... 4995 4996 4997 4998 4999 5000
0 2.0 2.0 3.0 1.0 4.0 1.0 ... 3.0 3.0 3.0 4.0 4.0 1.0
1 1.0 1.0 4.0 2.0 3.0 2.0 ... 4.0 1.0 2.0 3.0 1.0 3.0
2 3.0 4.0 1.0 4.0 1.0 4.0 ... 1.0 4.0 1.0 1.0 3.0 4.0
3 4.0 3.0 2.0 3.0 2.0 3.0 ... 2.0 2.0 3.0 2.0 2.0 2.0
*当然,如果 ranking
为空,我们可以直接从 df
创建它:
ranking = df[range(1, num_sims + 1)].rank(ascending=False, method='min')
健全性检查它们是否产生相同的结果:
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame(np.random.randint(1, 100, (4, 5000)))
df.columns = df.columns + 1
ranking = pd.DataFrame()
num_sims = len(df.columns)
for x in range(1, num_sims + 1):
ranking[x] = df[x].rank(ascending=False, method='min')
print(ranking.eq(pd.concat(
[pd.DataFrame(),
df[range(1, num_sims + 1)].rank(ascending=False, method='min')],
axis=1
)).all(axis=None)) # True
关于python - "PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling ` 框架.插入 ` many times, which has poor performance.",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68886155/