r - R 中离散数据的计算密度

标签 r ggplot2 kernel-density density-plot

我想在 R 中用离散数据绘制密度直方图。 鉴于我的数据集非常大,我需要在绘图之前计算密度。

但是,我发现与 ggplot2 中的 ..密度.. 相比,使用 stats::密度 函数提供了不同的结果。这是为什么?此外,ggplot2 的 ..密度 .. 结果符合预期,而 stats::密度则不然。

请参阅下面的可重现示例。

非常感谢

library(tidyverse)
library(patchwork)

df <- data.frame(A = round(rnorm(1000)),
           B = round(rnorm(1000)),
           C = round(rnorm(1000))) %>% 
  pivot_longer(cols = everything(), names_to = "group")

dens_df <- df %>% 
  group_by(group) %>% 
  summarise(dens = list(density(value, from = -3, to = 6, n = length(-3:6)))) %>% #compute density and nest into list
  mutate(density.x = map(dens, ~.x[["x"]]), #extract x values
         density.y = map(dens, ~.x[["y"]])) %>%  #extract y values
  select(-dens) %>% 
  unnest(cols = c(density.x, density.y))

plot_dens <- dens_df %>% 
  ggplot()+
  aes(x = density.x, y = density.y) %>% 
  geom_col()+
  scale_x_continuous(breaks = seq(-3,10,1))+
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", col = "red") +
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", size = 2, col = "red") +
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "line", col = "red") +
  facet_wrap(~group)+
  labs(title = "using stats::density")

plot_geom <- df %>% 
  ggplot()+
  aes(x = value, y = ..density..) %>% 
  geom_histogram(
    binwidth = 1,
    col = "white")+
  scale_x_continuous(breaks = seq(-3,10,1))+
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", col = "red") +
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", size = 2, col = "red") +
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "line", col = "red") +
  facet_wrap(~group)+
  labs(title = "using ggplot2 ..density..")


plot_dens + plot_geom

最佳答案

简单的答案是,stats::密度 不是用于离散数据的正确函数。 stats::密度 使用平滑内核,即使仅在 10 个点采样,也会提供表示 -3 到 6 之间的连续密度的曲线。这与离散数据的密度不同,离散数据的密度只是每个箱的频率除以观测值的数量。这就是 geom_histogram 正在绘制的内容。

获得离散密度的计算有效方法是使用hist,甚至只是table(x)/length(x)

这是一个使用 hist 中预先计算的密度的示例

dens_df <- df %>% 
  group_by(group) %>% 
  summarise(dens = list(hist(value, breaks = seq(-3.5, 6.5), plot = FALSE))) %>% 
  mutate(density.x = map(dens, ~.x[["mids"]]), #extract x values
         density.y = map(dens, ~.x[["density"]])) %>%  #extract y values
  select(-dens) %>% 
  unnest(cols = c(density.x, density.y))

我们可以看到,这会导致与geom_histogram相同的绘图,使用您自己的绘图代码,但使用更改后的dens_df

plot_dens <- dens_df %>% 
  ggplot()+
  aes(x = density.x, y = density.y) %>% 
  geom_col()+
  scale_x_continuous(breaks = seq(-3,10,1))+
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1),
                geom = "point", col = "red") +
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), 
                geom = "point", size = 2, col = "red") +
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), 
                geom = "line", col = "red") +
  facet_wrap(~group)+
  labs(title = "using graphics::hist")

plot_geom <- df %>% 
  ggplot()+
  aes(x = value, y = ..density..) %>% 
  geom_histogram(
    binwidth = 1,
    col = "white")+
  scale_x_continuous(breaks = seq(-3,10,1))+
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), 
                geom = "point", col = "red") +
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), 
                geom = "point", size = 2, col = "red") +
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), 
                geom = "line", col = "red") +
  facet_wrap(~group)+
  labs(title = "using ggplot2 ..density..")

plot_dens + plot_geom

enter image description here

关于r - R 中离散数据的计算密度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72670820/

相关文章:

r - 按组累计

linux - 如何从 Linux 命令行离线安装具有依赖项的 R 库?

r - R 中日期向量的核密度估计

r - 向直方图和累积直方图添加密度线

r - 在R中绘制线段

r - boxplot 中的 na.action 选项有哪些?

r - 使用 ggplot 2 使用线条或线段将堆栈条形图与多个组连接起来

r - 积水排序层

r - 在 ggplot2 中使用 facet_grid 进行成对值(热图)可视化

r - R 中的填充轮廓 : how to make the same density is same color