我正在尝试将使用两个 SQL
查询的分析简化为一个。为此,我在单个 SQL
查询中将生物量数据加入到尺寸类数据中,这会创建重复项。这是因为生物量已经是一个总和,并且是每个site
内taxa_name
的总生物量,即它在我的新表中是一对多值。
为了摆脱 2 个 SQL
查询,我通过两个 data.table
操作和最后的连接完成了这项工作。另一种方法是进行计算并删除重复项两次。有没有办法通过使用 data.table
来避免这些问题?
示例数据
testdf <- structure(list(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L)), row.names = c(NA, -15L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00362498>)
计算
# biomass
bm <- testdf
bm <- bm[, .(site = unique(site)),
by = list(spcode, taxa_name, biomass)][, totbm := sum(biomass), by = list(spcode)][!duplicated(spcode), c(1,5)]
> bm
spcode totbm
1: 10008 0.5
2: 10002 0.3
3: 10006 0.6
4: 10011 0.5
接下来完成丰度,然后将两者在 spcode
上连接。
# abundance
testdf <- testdf[, .(totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
by = list(spcode, taxa_name)]
# join
testdf[bm, on = 'spcode', bm := i.totbm]
> testdf
spcode taxa_name totabn n minlngth maxlngth bm
1: 10008 Hippoglossina stomata 85 4 20 23 0.5
2: 10002 Symphurus atricaudus 83 7 5 16 0.3
3: 10006 Microstomus pacificus 85 8 9 14 0.6
4: 10011 Parophrys vetulus 17 1 17 17 0.5
上面的 testdf
输出是我想要的输出。我的其他尝试依赖于两个 !duplicated
调用。在我看来,我希望能够在丰度计算中使用 [, totbm := sum(biomass), by = list(unique(site), spcode)]
,但这并不'不工作。
testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)), by = list(spcode, taxa_name)][, totbm := sum(biomass), by = list(unique(site), spcode)]
Error in `[.data.table`(testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun), : The items in the 'by' or 'keyby' list are length (3,15). Each must be length 15; the same length as there are rows in x (after subsetting if i is provided).
替代方法:
alt <- bm[, .(site = site, taxa_name = taxa_name, biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
by = list(spcode)]
alt <- alt[!duplicated(alt, by = c("site", "spcode"))]
alt[, totbm := sum(biomass), by = list(spcode)]
alt[!duplicated(alt, by = "spcode"), c(1,3,5:9)]
最佳答案
就像我在评论中提到的那样,我不喜欢具有数据冗余的表,但这是解决问题的一种方法。基本上,不要使用某种“唯一”函数,而是按 site/taxa_name 组给出索引号,以便您可以将除第一个生物量值之外的所有生物量值设置为 0。然后按 spcode/taxa_name 求和应该可以正常工作。当然,这假设一组 site/taxa_name 值恰好对应于一个生物量值。
testdf <- data.table(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L),
abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L),
biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5),
size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L),
site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L),
taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"),
lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L))
testdf[, biomassIdx := 1:.N, by = c('site', 'taxa_name')]
testdf[biomassIdx > 1, biomass := 0]
testdf[, .(tatabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class) , bm = sum(biomass)),
by = list(spcode, taxa_name)]
关于r - 聚合数据并排除一列中的重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56874696/