r - 使用 reframe 或complete 根据数据中的最小/最大值生成数据集

标签 r dplyr

我正在尝试使用一个数据集中的值来创建另一个数据集以进行模型预测。

我的数据集有两个站点(A 和 B),不同年份的数据,每个站点的范围不同,以及大量个体(站点和年份的比率也不同)。

我需要最终数据集包含站点、该站点的最小-最大年份以及从最小到最大的质量值(增量为 0.1)的所有独特组合。 例如,站点 A 有 5 年的数据,质量范围为 2-5,因此应该有 205 个组合(1 个站点 x 5 年 x 31 个质量值)

# example dataset
df <- data.frame(site = c(rep("A", 20),                      # 20 obs for site A
                          rep("B", 30)),                     # 30 obs for site B
                 year = c(sample(1:5, 20, replace = TRUE),           # 5 years for site A
                          sample(c(1:4, 6:7), 30, replace = TRUE)),  # 6 years for site B, resulting range should span 1-7 (including 5)
                 mass = c(sample(seq(2, 5, 0.1), 20, replace = TRUE),    # different range for A than B
                          sample(seq(1, 6, 0.1), 30, replace = TRUE)))   # different range for A than B

# I've tried using complete, but it doesn't recognize mass
df %>% complete(year, nesting(site), 
                fill = list(seq(min(mass), max(mass), 0.1)))
Error in seq(min(mass), max(mass), 0.1) : object 'mass' not found

# I've also tried reframe, but it doesn't cover the full range of masses
df %>% reframe(year = min(year):max(year), .by = c(site, mass))

最佳答案

您可以expand.gridseq沿范围影响。

> res <-
+   by(df, df$site, \(x) 
+      cbind(site=x$site[1], 
+            expand.grid(year=do.call('seq.int', c(as.list(range(x$year)), 1)),
+                        mass=do.call('seq.int', c(as.list(range(x$mass)), .1))))) |>
+   do.call(what='rbind')
> 
> by(res, res$site, summary)
res$site: A
     site                year        mass     
 Length:130         Min.   :1   Min.   :2.00  
 Class :character   1st Qu.:2   1st Qu.:2.60  
 Mode  :character   Median :3   Median :3.25  
                    Mean   :3   Mean   :3.25  
                    3rd Qu.:4   3rd Qu.:3.90  
                    Max.   :5   Max.   :4.50  
--------------------------------------------------------------------------- 
res$site: B
     site                year        mass      
 Length:336         Min.   :1   Min.   :1.100  
 Class :character   1st Qu.:2   1st Qu.:2.275  
 Mode  :character   Median :4   Median :3.450  
                    Mean   :4   Mean   :3.450  
                    3rd Qu.:6   3rd Qu.:4.625  
                    Max.   :7   Max.   :5.800  

数据:

> dput(df)
structure(list(site = c("A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B"), year = c(1L, 5L, 1L, 1L, 2L, 4L, 2L, 2L, 1L, 
4L, 1L, 5L, 4L, 2L, 2L, 3L, 1L, 1L, 3L, 4L, 6L, 6L, 6L, 4L, 2L, 
4L, 3L, 2L, 1L, 2L, 7L, 3L, 7L, 2L, 4L, 4L, 7L, 2L, 6L, 4L, 6L, 
4L, 2L, 2L, 3L, 1L, 6L, 2L, 2L, 7L), mass = c(2.5, 2.1, 3.9, 
2.2, 4.1, 4, 2.1, 4.2, 2.5, 4.5, 2.9, 2.7, 2.4, 2, 3.6, 2.6, 
2.3, 3.2, 2.9, 2.8, 3.8, 2.1, 2.9, 1.8, 5.2, 4.4, 3.8, 2.5, 4.6, 
3.7, 5.5, 1.4, 3.7, 1.1, 2.7, 3.3, 5.8, 2.7, 1.4, 5.5, 4.9, 4.9, 
3, 4.5, 4.5, 4.8, 5.1, 2.7, 3.6, 2.2)), class = "data.frame", row.names = c(NA, 
-50L))

关于r - 使用 reframe 或complete 根据数据中的最小/最大值生成数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/77705643/

相关文章:

r - 在 R 中创建大数据的数据结构

dplyr 中的回归输出

R:按所有因素水平汇总(存在和不存在)

sql - 在多个数据集中查找相同的行

r - ggplot2,更改标题大小

正则表达式进行过滤,然后确定最新日期

r - 图标签 : add text on graphs in the same location despite figure size

Rmarkdown 错误 : Paragraph ended before\text@ was complete. 无法找到源

r - 首先和最后使用 dplyr 但忽略 NA 值

r - 使用ggplot获取带有百分比标签的条形图的最有效方法