我想在 R 中用离散数据绘制密度直方图。 鉴于我的数据集非常大,我需要在绘图之前计算密度。
但是,我发现与 ggplot2
中的 ..密度..
相比,使用 stats::密度
函数提供了不同的结果。这是为什么?此外,ggplot2 的 ..密度 .. 结果符合预期,而 stats::密度则不然。
请参阅下面的可重现示例。
非常感谢
library(tidyverse)
library(patchwork)
df <- data.frame(A = round(rnorm(1000)),
B = round(rnorm(1000)),
C = round(rnorm(1000))) %>%
pivot_longer(cols = everything(), names_to = "group")
dens_df <- df %>%
group_by(group) %>%
summarise(dens = list(density(value, from = -3, to = 6, n = length(-3:6)))) %>% #compute density and nest into list
mutate(density.x = map(dens, ~.x[["x"]]), #extract x values
density.y = map(dens, ~.x[["y"]])) %>% #extract y values
select(-dens) %>%
unnest(cols = c(density.x, density.y))
plot_dens <- dens_df %>%
ggplot()+
aes(x = density.x, y = density.y) %>%
geom_col()+
scale_x_continuous(breaks = seq(-3,10,1))+
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", col = "red") +
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", size = 2, col = "red") +
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "line", col = "red") +
facet_wrap(~group)+
labs(title = "using stats::density")
plot_geom <- df %>%
ggplot()+
aes(x = value, y = ..density..) %>%
geom_histogram(
binwidth = 1,
col = "white")+
scale_x_continuous(breaks = seq(-3,10,1))+
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", col = "red") +
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", size = 2, col = "red") +
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "line", col = "red") +
facet_wrap(~group)+
labs(title = "using ggplot2 ..density..")
plot_dens + plot_geom
最佳答案
简单的答案是,stats::密度
不是用于离散数据的正确函数。 stats::密度
使用平滑内核,即使仅在 10 个点采样,也会提供表示 -3 到 6 之间的连续密度的曲线。这与离散数据的密度不同,离散数据的密度只是每个箱的频率除以观测值的数量。这就是 geom_histogram
正在绘制的内容。
获得离散密度的计算有效方法是使用hist
,甚至只是table(x)/length(x)
。
这是一个使用 hist
中预先计算的密度的示例
dens_df <- df %>%
group_by(group) %>%
summarise(dens = list(hist(value, breaks = seq(-3.5, 6.5), plot = FALSE))) %>%
mutate(density.x = map(dens, ~.x[["mids"]]), #extract x values
density.y = map(dens, ~.x[["density"]])) %>% #extract y values
select(-dens) %>%
unnest(cols = c(density.x, density.y))
我们可以看到,这会导致与geom_histogram
相同的绘图,使用您自己的绘图代码,但使用更改后的dens_df
plot_dens <- dens_df %>%
ggplot()+
aes(x = density.x, y = density.y) %>%
geom_col()+
scale_x_continuous(breaks = seq(-3,10,1))+
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1),
geom = "point", col = "red") +
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1),
geom = "point", size = 2, col = "red") +
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1),
geom = "line", col = "red") +
facet_wrap(~group)+
labs(title = "using graphics::hist")
plot_geom <- df %>%
ggplot()+
aes(x = value, y = ..density..) %>%
geom_histogram(
binwidth = 1,
col = "white")+
scale_x_continuous(breaks = seq(-3,10,1))+
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1),
geom = "point", col = "red") +
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1),
geom = "point", size = 2, col = "red") +
stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1),
geom = "line", col = "red") +
facet_wrap(~group)+
labs(title = "using ggplot2 ..density..")
plot_dens + plot_geom
关于r - R 中离散数据的计算密度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72670820/