python - 为什么 stat_density (R; ggplot2) 和 gaussian_kde (Python; scipy) 不同？

我正在尝试对可能不是正态分布的一系列分布生成基于 KDE 的 PDF 估计。

我喜欢 R 中 ggplot 的 stat_density 似乎可以识别频率中的每一个增量颠簸的方式，但无法通过 Python 的 scipy-stats-gaussian_kde 方法复制它，这似乎过于平滑。

我已按如下方式设置我的 R 代码:

ggplot(test, aes(x=Val, color = as.factor(Class), group=as.factor(Class))) +
             stat_density(geom='line',kernel='gaussian',bw='nrd0' 
                                                            #nrd0='Silverman'
                                                            ,size=1,position='identity')

我的 python 代码是:

kde = stats.gaussian_kde(data.ravel())
kde.set_bandwidth(bw_method='silverman')

统计文档显示 here nrd0 是 bw 调整的 silverman 方法。

基于上面的代码，我使用了相同的内核(高斯)和带宽方法(Silverman)。

谁能解释为什么结果如此不同？

最佳答案

关于什么是西尔弗曼法则似乎存在分歧。 TL;DR - scipy 使用了一个更糟糕的规则版本，它只适用于正态分布的单峰数据。 R 使用了一个更好的版本，它是“两全其美”并且“适用于各种密度”。

scipy docs说 Silverman 的规则是 implemented as :

def silverman_factor(self):
    return power(self.neff*(self.d+2.0)/4.0, -1./(self.d+4))

其中 d 是维数(在您的情况下为 1)，neff 是有效样本大小(点数，假设没有权重)。所以 scipy 带宽是 (n * 3/4) ^ (-1/5)(乘以标准偏差，以不同的方法计算)。

相比之下，R 的 stats package docs将 Silverman 的方法描述为“0.9 乘以标准偏差和四分位间距的最小值除以样本量的 1.34 倍的负五分之一方”，这也可以在 R 代码中验证，键入 bw.nrd0 在控制台中给出:

function (x) 
{
    if (length(x) < 2L) 
        stop("need at least 2 data points")
    hi <- sd(x)
    if (!(lo <- min(hi, IQR(x)/1.34))) 
        (lo <- hi) || (lo <- abs(x[1L])) || (lo <- 1)
    0.9 * lo * length(x)^(-0.2)
}

Wikipedia ，另一方面，将“Silverman 的经验法则”作为估算器的许多可能名称之一:

1.06 * sigma * n ^ (-1 / 5)

维基百科版本相当于scipy版本。

所有三个来源(scipy 文档、维基百科和 R 文档)都引用了相同的原始引用资料:Silverman、B.W. (1986)。 统计和数据分析的密度估计。伦敦:Chapman & Hall/CRC。 p. 48. 国际标准书号 978-0-412-24620-3。维基百科和 R 特别引用了第 48 页，而 scipy 的文档没有提到页码。 (我已经向维基百科提交了一个编辑，以将其页面引用更新为 p.45，见下文。)

阅读 Silverman 论文，第 45 页，方程 3.28 是维基百科文章中使用的:(4/3) ^ (1/5) * sigma * n ^ (-1/5) ~= 1.06 * 西格玛 * n ^ (-1/5)。 Scipy 使用相同的方法，将 (4/3) ^ (1/5) 重写为等效的 (3/4) ^ (-1/5)。 Silverman 描述了这种方法:

While (3.28) will work well if the population really is normally distributed, it may oversmooth somewhat if the population is multimodal... as the mixture becomes more strongly bimodal the formula (3.28) will oversmooth more and more, relative to the optimal choice of smoothing parameter.

scipy 文档 reference this weakness ，说明:

It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.

但是，Silverman 的文章继续改进了 scipy 使用的方法以获取 R 和 Stata 使用的方法。在第 48 页，我们得到等式 3.31:

h = 0.9 * A * n ^ (-1 / 5)
# A defined on previous page, eqn 3.30
A = min(standard deviation, interquartile range / 1.34)

Silverman 将此方法描述为:

The best of both possible worlds... In summary, the choice ([eqn] 3.31) for the smoothing parameter will do very well for a wide range of densities and is trivial to evaluate. For many purposes it will certainly be an adequate choice of window width, and for others it will be a good starting point for subsequent fine-tuning.

因此，Wikipedia 和 Scipy 似乎使用了 Silverman 提出的具有已知弱点的估算器的简单版本。 R 和 Stata 使用更好的版本。

关于python - 为什么 stat_density (R; ggplot2) 和 gaussian_kde (Python; scipy) 不同？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55366188/

python - 为什么 stat_density (R; ggplot2) 和 gaussian_kde (Python; scipy) 不同？

上一篇：python - 如何使用正则表达式在 Python 3 中的特定字符串之后或之前找到一行？

下一篇：python - 在循环中的不同行上打印值时，如何避免在最后一次迭代后打印换行符？