我正在尝试使用一个数据集中的值来创建另一个数据集以进行模型预测。
我的数据集有两个站点(A 和 B),不同年份的数据,每个站点的范围不同,以及大量个体(站点和年份的比率也不同)。
我需要最终数据集包含站点、该站点的最小-最大年份以及从最小到最大的质量值(增量为 0.1)的所有独特组合。 例如,站点 A 有 5 年的数据,质量范围为 2-5,因此应该有 205 个组合(1 个站点 x 5 年 x 31 个质量值)
# example dataset
df <- data.frame(site = c(rep("A", 20), # 20 obs for site A
rep("B", 30)), # 30 obs for site B
year = c(sample(1:5, 20, replace = TRUE), # 5 years for site A
sample(c(1:4, 6:7), 30, replace = TRUE)), # 6 years for site B, resulting range should span 1-7 (including 5)
mass = c(sample(seq(2, 5, 0.1), 20, replace = TRUE), # different range for A than B
sample(seq(1, 6, 0.1), 30, replace = TRUE))) # different range for A than B
# I've tried using complete, but it doesn't recognize mass
df %>% complete(year, nesting(site),
fill = list(seq(min(mass), max(mass), 0.1)))
Error in seq(min(mass), max(mass), 0.1) : object 'mass' not found
# I've also tried reframe, but it doesn't cover the full range of masses
df %>% reframe(year = min(year):max(year), .by = c(site, mass))
最佳答案
您可以expand.grid
从seq
沿范围
影响。
> res <-
+ by(df, df$site, \(x)
+ cbind(site=x$site[1],
+ expand.grid(year=do.call('seq.int', c(as.list(range(x$year)), 1)),
+ mass=do.call('seq.int', c(as.list(range(x$mass)), .1))))) |>
+ do.call(what='rbind')
>
> by(res, res$site, summary)
res$site: A
site year mass
Length:130 Min. :1 Min. :2.00
Class :character 1st Qu.:2 1st Qu.:2.60
Mode :character Median :3 Median :3.25
Mean :3 Mean :3.25
3rd Qu.:4 3rd Qu.:3.90
Max. :5 Max. :4.50
---------------------------------------------------------------------------
res$site: B
site year mass
Length:336 Min. :1 Min. :1.100
Class :character 1st Qu.:2 1st Qu.:2.275
Mode :character Median :4 Median :3.450
Mean :4 Mean :3.450
3rd Qu.:6 3rd Qu.:4.625
Max. :7 Max. :5.800
数据:
> dput(df)
structure(list(site = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B"), year = c(1L, 5L, 1L, 1L, 2L, 4L, 2L, 2L, 1L,
4L, 1L, 5L, 4L, 2L, 2L, 3L, 1L, 1L, 3L, 4L, 6L, 6L, 6L, 4L, 2L,
4L, 3L, 2L, 1L, 2L, 7L, 3L, 7L, 2L, 4L, 4L, 7L, 2L, 6L, 4L, 6L,
4L, 2L, 2L, 3L, 1L, 6L, 2L, 2L, 7L), mass = c(2.5, 2.1, 3.9,
2.2, 4.1, 4, 2.1, 4.2, 2.5, 4.5, 2.9, 2.7, 2.4, 2, 3.6, 2.6,
2.3, 3.2, 2.9, 2.8, 3.8, 2.1, 2.9, 1.8, 5.2, 4.4, 3.8, 2.5, 4.6,
3.7, 5.5, 1.4, 3.7, 1.1, 2.7, 3.3, 5.8, 2.7, 1.4, 5.5, 4.9, 4.9,
3, 4.5, 4.5, 4.8, 5.1, 2.7, 3.6, 2.2)), class = "data.frame", row.names = c(NA,
-50L))
关于r - 使用 reframe 或complete 根据数据中的最小/最大值生成数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/77705643/