r - 改变一个数据集的分布以匹配另一个数据集

标签 r statistics

我有 2 个数据集,一个是建模(人工)数据,另一个是观察到的数据。它们的统计分布略有不同,我想强制建模数据与数据传播中观察到的数据分布相匹配。换句话说,我需要建模数据来更好地表示观察数据的尾部。这是一个例子。

model <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)

observed <- c(39.50,44.79,58.28,56.04,53.40,59.25,48.49,54.51,35.38,39.98,28.00,
28.49,27.74,51.92,42.53,44.91,44.91,40.00,41.51,47.92,36.98,53.40,
42.26,42.89,43.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
52.81,36.87,47.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
51.34,43.37,51.15,42.77,42.88,44.26,27.14,39.31,24.80,12.62,30.30,
34.39,25.60,38.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
34.65,39.54,47.70,38.11,43.05,29.95,22.48,24.63,35.33,41.34)

summary(model)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
16.62   36.98   40.38   40.28   44.91   54.15 

summary(observed)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
12.62   35.54   42.58   41.10   47.76   59.2

我如何强制模型数据具有观察到的 R 中的可变性?

最佳答案

您只是对observed 的分布建模吗?如果是这样,您可以根据观察结果生成核密度估计值,然后从该建模的密度分布中重新采样。例如:

library(ggplot2)

首先,我们根据观测值生成密度估计值。这是我们的观测值分布模型。 adjust 是决定带宽的参数。默认值为 1。值越小,平滑度越低(即密度估计更接近数据中的小规模结构):

dens.obs = density(observed, adjust=0.8)

现在,从密度估计中重新采样以获得建模值。我们设置 prob=dens.obs$y 以便 dens.obs$x 中的值被选中的概率与其建模密度成正比。

set.seed(439)
resample.obs = sample(dens.obs$x, 1000, replace=TRUE, prob=dens.obs$y)

将观察值和建模值放入数据框中以准备绘图:

dat = data.frame(value=c(observed,resample.obs), 
                 group=rep(c("Observed","Modeled"), c(length(observed),length(resample.obs))))

下面的 ECDF(经验累积分布函数)图显示,从核密度估计中抽样给出的样本的分布与观察到的数据相似:

ggplot(dat, aes(value, fill=group, colour=group)) +
  stat_ecdf(geom="step") +
  theme_bw()

enter image description here

您还可以绘制观察数据的密度分布和从建模分布中采样的值(使用与我们上面使用的相同的 adjust 参数值)。

ggplot(dat, aes(value, fill=group, colour=group)) +
  geom_density(alpha=0.4, adjust=0.8) +
  theme_bw()

enter image description here

关于r - 改变一个数据集的分布以匹配另一个数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39152038/

相关文章:

r - cast 函数非常耗内存,如何处理?

r - 在 R markdown 中,如何防止非缓存 block 的绘图被单独保存?

r - 如何修改 R 包以创建用于重新加权核密度估计的新包?

用于收集集成统计信息的 java lib

r - 使用 R 抓取您自己的 Stack Overflow 配置文件

r - tidyr::gather 与 reshape2::melt 在矩阵上

R 生成类似 111222333444555666777.....505050 的序列

python - 如何计算python中二进制变量之间的相关性?

machine-learning - 如何将sklearn.naive_bayes与(多个)分类功能一起使用?

c - 需要一些帮助来计算百分位数