python - 什么是西格玛裁剪？你怎么知道什么时候应用它？

标签 python pandas numpy statistics data-science

我正在阅读一本关于 Python 数据科学的书，作者应用“sigma-clipping 操作”来删除由于拼写错误引起的异常值。然而，这个过程根本没有解释。

什么是西格玛裁剪？它是否仅适用于某些数据(例如，在书中它用于计算美国的出生率)？

根据正文:

quartiles = np.percentile(births['births'], [25, 50, 75]) #so we find the 25th, 50th, and 75th percentiles
mu = quartiles[1] #we set mu = 50th percentile
sig = 0.74 * (quartiles[2] - quartiles[0]) #???

This final line is a robust estimate of the sample mean, where the 0.74 comes 
from the interquartile range of a Gaussian distribution.

为什么是 0.74？这有证据吗？

最佳答案

This final line is a robust estimate of the sample mean, where the 0.74 comes from the interquartile range of a Gaussian distribution.

就是这样，真的......

该代码尝试使用四分位数间距来估计 sigma，以使其对异常值具有鲁棒性。 0.74 是校正因子。计算方法如下:

p1 = sp.stats.norm.ppf(0.25)  # first quartile of standard normal distribution
p2 = sp.stats.norm.ppf(0.75)  # third quartile
print(p2 - p1)  # 1.3489795003921634

sig = 1  # standard deviation of the standard normal distribution  
factor = sig / (p2 - p1)
print(factor)  # 0.74130110925280102

在标准正态分布 sig==1 中，四分位数间距为 1.35。所以 0.74 是将四分位数间距转换为 sigma 的校正因子。当然，这仅适用于正态分布。

关于python - 什么是西格玛裁剪？你怎么知道什么时候应用它？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45666970/

上一篇：python - 在 Python 中使用正则表达式在日期和时间之间插入空格

下一篇：python - 遍历 pandas 中列名和行索引的成对组合

相关文章：

javascript - Bokeh 自定义保存工具

python - 使用 pandas 组合组

python - 从 Pandas 数据框中提取数据作为数据框

Python 等效于用 mu 参数化的 R 的 rnbinom

python - numpy csr 矩阵 "mean"函数是否对所有矩阵求平均值？我怎样才能删除某个值？

python - "vstack"ing 3d Numpy ndarray

python - importlib.reload 不会重新加载以编程方式生成的文件

python - 查找字符串中出现次数最多的字符

python - 过滤器函数中的 Lambda？

python - 如何在 matplotlib 中将 x 轴作为日期时间的条形图和线图结合起来