python-3.x - 使用 scipy.stats.mstats.winsorize 对我的样本进行 1% 和 99% 的 winsorize 后，我的样本的最大值仍然大于 99% 时的值

我想对我的样本进行 1% 和 99% 的缩尾处理，因此我使用 scipy 对我的样本进行缩尾处理。缩尾后，我的样本最大值比 99% 百分位处的值大得惊人。我想知道为什么会这样呢？我的样本是:

Total Sales         Assets     Market value 
1000                 123        4892  
1232                 12         NaN
125                  1569       156

我用过:

import scipy.stats as sp

for col in df.columns: 
     sp.mstats.winsorize(df[col], limits=0.01, inplace=True)

用我的代码进行缩尾处理后，我发现样本中的最大值仍然大于 99% 时的值。我想我犯了一些错误，但我不知道它在哪里？

最佳答案

问题出在就地操作上。而是将列分配回来:

for col in df.columns: 
     df[col] = stats.mstats.winsorize(df[col], limits=0.01)

示例数据

import numpy as np
import pandas as pd
from scipy import stats

df = pd.DataFrame(np.random.randint(1, 10000, (500000, 2)))
print(df.describe())
#                   0              1
#count  500000.000000  500000.000000
#mean     4993.512288    5004.678502
#std      2888.254381    2884.128073
#min         1.000000       1.000000
#25%      2486.000000    2513.000000
#50%      4985.000000    5005.000000
#75%      7492.000000    7502.000000
#max      9999.000000    9999.000000

# inpalce doesn't change anything when looping over columns:
for col in df.columns: 
     stats.mstats.winsorize(df[col], limits=0.01, inplace=True)
print(df.describe())
#                   0              1
#count  500000.000000  500000.000000
#mean     4993.512288    5004.678502
#std      2888.254381    2884.128073
#min         1.000000       1.000000
#25%      2486.000000    2513.000000
#50%      4985.000000    5005.000000
#75%      7492.000000    7502.000000
#max      9999.000000    9999.000000

for col in df.columns: 
     df[col] = stats.mstats.winsorize(df[col], limits=0.01)
print(df.describe())
#                   0              1
#count  500000.000000  500000.000000
#mean     4993.505330    5004.690118
#std      2886.521538    2882.414353
#min       101.000000     101.000000
#25%      2486.000000    2513.000000
#50%      4985.000000    5005.000000
#75%      7492.000000    7502.000000
#max      9899.000000    9901.000000

关于python-3.x - 使用 scipy.stats.mstats.winsorize 对我的样本进行 1% 和 99% 的 winsorize 后，我的样本的最大值仍然大于 99% 时的值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55671182/

python-3.x - 使用 scipy.stats.mstats.winsorize 对我的样本进行 1% 和 99% 的 winsorize 后，我的样本的最大值仍然大于 99% 时的值

示例数据

上一篇：r - 如何获取具有给定数字的所有行

下一篇：Prolog: "chili"表示调用堆栈和选择点的解释器