r - 聚合数据并排除一列中的重复项

标签 r data.table

我正在尝试将使用两个 SQL 查询的分析简化为一个。为此,我在单个 SQL 查询中将生物量数据加入到尺寸类数据中,这会创建重复项。这是因为生物量已经是一个总和,并且是每个sitetaxa_name的总生物量,即它在我的新表中是一对多值。

为了摆脱 2 个 SQL 查询,我通过两个 data.table 操作和最后的连接完成了这项工作。另一种方法是进行计算并删除重复项两次。有没有办法通过使用 data.table 来避免这些问题?

示例数据

testdf <- structure(list(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L)), row.names = c(NA, -15L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00362498>)

计算

# biomass
bm <- testdf
bm <- bm[, .(site = unique(site)),
   by = list(spcode, taxa_name, biomass)][, totbm := sum(biomass), by = list(spcode)][!duplicated(spcode), c(1,5)]

    > bm
   spcode totbm
1:  10008   0.5
2:  10002   0.3
3:  10006   0.6
4:  10011   0.5

接下来完成丰度,然后将两者在 spcode 上连接。

# abundance
testdf <- testdf[, .(totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
      by = list(spcode, taxa_name)]

# join
testdf[bm, on = 'spcode', bm := i.totbm]

> testdf
   spcode             taxa_name totabn n minlngth maxlngth  bm
1:  10008 Hippoglossina stomata     85 4       20       23 0.5
2:  10002  Symphurus atricaudus     83 7        5       16 0.3
3:  10006 Microstomus pacificus     85 8        9       14 0.6
4:  10011     Parophrys vetulus     17 1       17       17 0.5

上面的 testdf 输出是我想要的输出。我的其他尝试依赖于两个 !duplicated 调用。在我看来,我希望能够在丰度计算中使用 [, totbm := sum(biomass), by = list(unique(site), spcode)],但这并不'不工作。

testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)), by = list(spcode, taxa_name)][, totbm := sum(biomass), by = list(unique(site), spcode)]
Error in `[.data.table`(testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun),  : The items in the 'by' or 'keyby' list are length (3,15). Each must be length 15; the same length as there are rows in x (after subsetting if i is provided).

替代方法:

alt <- bm[, .(site = site, taxa_name = taxa_name, biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
by = list(spcode)]
alt <- alt[!duplicated(alt, by = c("site", "spcode"))]
alt[, totbm := sum(biomass), by = list(spcode)]
alt[!duplicated(alt, by = "spcode"), c(1,3,5:9)]

最佳答案

就像我在评论中提到的那样,我不喜欢具有数据冗余的表,但这是解决问题的一种方法。基本上,不要使用某种“唯一”函数,而是按 site/taxa_name 组给出索引号,以便您可以将除第一个生物量值之外的所有生物量值设置为 0。然后按 spcode/taxa_name 求和应该可以正常工作。当然,这假设一组 site/taxa_name 值恰好对应于一个生物量值。

testdf <- data.table(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), 
                         abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), 
                         biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), 
                         size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), 
                         site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), 
                         taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), 
                         lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L))

testdf[, biomassIdx := 1:.N, by = c('site', 'taxa_name')]
testdf[biomassIdx > 1, biomass := 0]
testdf[, .(tatabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class) , bm = sum(biomass)),
        by = list(spcode, taxa_name)]

关于r - 聚合数据并排除一列中的重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56874696/

相关文章:

r - 如何关联数据框上的所有列并构建热图?

r - 识别并计算法术(每组中的特殊事件)

在构面内重新排序分组条形图

r - 将矩阵转换为 data.table 的最快方法

r - 如何检查 data.table 键是否正常工作以及为什么不能正常工作?

R自动将图表/图形写入文件

r - 根据 bool 值标记数据帧的部分,包括前几行?

r - 使用 data.table 的 'by' 中的符号列表

r - 如何使用 data.table 有效地聚合时间和距离窗口?

r - 使用 R data.table 计算所有变量组合和 df 的不同计数