r - 聚合数据并排除一列中的重复项

我正在尝试将使用两个 SQL 查询的分析简化为一个。为此，我在单个 SQL 查询中将生物量数据加入到尺寸类数据中，这会创建重复项。这是因为生物量已经是一个总和，并且是每个site内taxa_name的总生物量，即它在我的新表中是一对多值。

为了摆脱 2 个 SQL 查询，我通过两个 data.table 操作和最后的连接完成了这项工作。另一种方法是进行计算并删除重复项两次。有没有办法通过使用 data.table 来避免这些问题？

示例数据

testdf <- structure(list(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L)), row.names = c(NA, -15L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00362498>)

计算

# biomass
bm <- testdf
bm <- bm[, .(site = unique(site)),
   by = list(spcode, taxa_name, biomass)][, totbm := sum(biomass), by = list(spcode)][!duplicated(spcode), c(1,5)]

    > bm
   spcode totbm
1:  10008   0.5
2:  10002   0.3
3:  10006   0.6
4:  10011   0.5

接下来完成丰度，然后将两者在 spcode 上连接。

# abundance
testdf <- testdf[, .(totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
      by = list(spcode, taxa_name)]

# join
testdf[bm, on = 'spcode', bm := i.totbm]

> testdf
   spcode             taxa_name totabn n minlngth maxlngth  bm
1:  10008 Hippoglossina stomata     85 4       20       23 0.5
2:  10002  Symphurus atricaudus     83 7        5       16 0.3
3:  10006 Microstomus pacificus     85 8        9       14 0.6
4:  10011     Parophrys vetulus     17 1       17       17 0.5

上面的 testdf 输出是我想要的输出。我的其他尝试依赖于两个 !duplicated 调用。在我看来，我希望能够在丰度计算中使用 [, totbm := sum(biomass), by = list(unique(site), spcode)]，但这并不'不工作。

testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)), by = list(spcode, taxa_name)][, totbm := sum(biomass), by = list(unique(site), spcode)]
Error in `[.data.table`(testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun),  : The items in the 'by' or 'keyby' list are length (3,15). Each must be length 15; the same length as there are rows in x (after subsetting if i is provided).

替代方法:

alt <- bm[, .(site = site, taxa_name = taxa_name, biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
by = list(spcode)]
alt <- alt[!duplicated(alt, by = c("site", "spcode"))]
alt[, totbm := sum(biomass), by = list(spcode)]
alt[!duplicated(alt, by = "spcode"), c(1,3,5:9)]

最佳答案

就像我在评论中提到的那样，我不喜欢具有数据冗余的表，但这是解决问题的一种方法。基本上，不要使用某种“唯一”函数，而是按 site/taxa_name 组给出索引号，以便您可以将除第一个生物量值之外的所有生物量值设置为 0。然后按 spcode/taxa_name 求和应该可以正常工作。当然，这假设一组 site/taxa_name 值恰好对应于一个生物量值。

testdf <- data.table(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), 
                         abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), 
                         biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), 
                         size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), 
                         site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), 
                         taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), 
                         lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L))

testdf[, biomassIdx := 1:.N, by = c('site', 'taxa_name')]
testdf[biomassIdx > 1, biomass := 0]
testdf[, .(tatabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class) , bm = sum(biomass)),
        by = list(spcode, taxa_name)]

关于r - 聚合数据并排除一列中的重复项，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56874696/

r - 聚合数据并排除一列中的重复项

示例数据

计算

上一篇：c# - 如何修复 .net standard 2.0 项目中的 'Could not load file or assembly System.IO.Packaging, Version=4.0.3.0'

下一篇：modeling - 如何将多个 roc 绘制在一起？