python - 在Python中删除异常值并计算具有不同实际值数量的多列的修剪平均值

我有一个数据集。假设有 10010 行和 100 列，列值可能包含 NaN，并且每一列的 NaN 可以不同。

我要

从此数据集中选取 n 个列(假设为 20 个，无顺序，例如 Column1、Column2 等)。
修剪离群值(每个选定列的最高值的 2.5% 和最低值的 2.5%)，不包括 NaN 值(因此，如果 Column1 中 10010 个值中有 10 个值是 NaN，我需要从10000 个值中的顶部和 250 个实际最低值)
但是，如果 Column2 最初有 110 NaN，我想从每一侧修剪 2.5%，以获得实际值的数量(在本例中为 9900，而不是像 Column1 列中的 10000)
计算每个选定列的修剪平均值
修剪后有一个新数据集，其中所有修剪后的异常值均转换为 NaN

最佳答案

下面的这个简化示例展示了一种可能有用的方法，并使用 pd.quantile。可以根据您的要求开发代码(显然包括分位数参数)。

import pandas as pd

df = pd.DataFrame({'col1': [ 1, 2, 3, 4, None, 6, 7, 8, 54],
                   'col2': [3, 5, 13, 14, 2, 16, 17, 18, 19] })

cols = ['col1', 'col2']
for col in cols:
    lo = df[col].quantile(0.1)
    hi = df[col].quantile(0.9)
    df[col] = df[col].where((df[col]> lo) & (df[col] < hi), None)
    print(f'mean for {col} is: ', df[col].mean().round(2))


print(df)

给出:

mean for col1 is:  5.0
mean for col2 is:  12.29

   col1  col2
0   NaN   3.0
1   2.0   5.0
2   3.0  13.0
3   4.0  14.0
4   NaN   NaN
5   6.0  16.0
6   7.0  17.0
7   8.0  18.0
8   NaN   NaN

上面的代码使用一个值阈值将异常值更改为 NaN；这将是通常的做法。如果要求是改变任一极端的多个值，那么这可能是通过保存和操作索引、按值排序、更改异常值来完成比例，然后使用索引恢复原始顺序。下面的代码假设最初已使用默认数字索引；如果不是那么用户索引需要保存然后最终重新设置。

cut_val = 0.2     # proportion of non_NaN values to remove from each extreme
num_rows = len(df)

cols = ['col1', 'col2']
for col in cols:
    num_not_nan = num_rows - df[col].isna().sum()
    cut = int(num_not_nan*cut_val)
    dfx = df[col].sort_values()
    idx = dfx.index.to_list()   #save sorted index
    dfx.index = range(num_rows)       #use numerical re-index so .loc can be used
    dfx.loc[0:cut-1] = None
    dfx.loc[num_not_nan-cut:num_not_nan] = None
    dfx.index=idx              #impose original index
    df[col] = dfx.sort_index()
    print(f'mean for {col} is: ', df[col].mean().round(2))

print(df)

关于python - 在Python中删除异常值并计算具有不同实际值数量的多列的修剪平均值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/76899175/

python - 在Python中删除异常值并计算具有不同实际值数量的多列的修剪平均值

上一篇：terminology - 继续运动期限

下一篇：MySQL:如何加速归档表