r - 根据 dplyr 中多个数据帧中的值将列添加到数据帧

我有一个数据框target，其中包含列SNP和value:

target <- data.frame("SNP" = c("rs2", "rs4", "rs6", "rs19", "rs8", "rs9"),
                     "value" = 1:6)

我还有 3 个其他数据框，其中包含列 SNP 和 int 作为列表:

ref1 <- data.frame("SNP" = c("rs1", "rs2", "rs8"), "int" = c(5, 7, 88))
ref2 <- data.frame("SNP" = c("rs9", "rs4", "rs3"), "int" = c(23, 4, 43))
ref3 <- data.frame("SNP" = c("rs10", "rs6", "rs5"), "int" = c(53, 22, 76))
mylist <- list(ref1, ref2, ref3)

我想为 target 添加一个新列 int ，其值对应于 ref1/2/3 的 int 值相同的SNP。例如，target 的第一个 int 值应为 7，因为 ref1 的第 2 行具有 rs2 和 int 的 SNP > 共 7 个。

我尝试了以下代码:

for (i in 1:3) {
    target <- target %>%
                left_join(mylist[[i]], by = "SNP")
}

匹配快速且成功。但是，我返回了 3 个新列而不是 1 个，如下所示:

然后我使用了以下代码:

target[, "ref"] <- NA
for (i in 1:3) {
    common <- Reduce(intersect, list(target$SNP, mylist[[i]]$SNP))

    tar.pos <- match(common, target$SNP)
    ref.pos <- match(common, mylist[[i]]$SNP)

    target$ref[tar.pos] <- mylist[[i]]$int[ref.pos]
}

在我的真实数据中，我有 22 个引用数据帧，每个数据帧有 1-6 百万行。我更愿意逐个引用进行匹配和连接，而不是将所有引用合并到一个大数据中。当我在真实数据上尝试上述第二种方法时，我注意到 match 函数运行速度非常慢。这就是为什么我更喜欢用一些聪明的方式来完成工作。我发现 left_join 即使对于我的大数据也运行得非常快。不幸的是，输出并不完全是我想要的。

我希望快速完成上述工作，最好是在 tidyverse 中。对于如何修改第一种编码方法或任何其他更聪明的方法有什么建议吗？

最佳答案

如果绑定(bind)mylist中的所有数据并合并到target占用太多内存，可以使用purrr::reduce合并一个一个。

library(tidyverse)

reduce(mylist,
       ~ left_join(.x, .y, by = "SNP") %>%
         mutate(int = coalesce(int.x, int.y)) %>%
         select(-c(int.x, int.y)),
       .init = mutate(target, int = NA_real_))

#    SNP value int
# 1  rs2     1   7
# 2  rs4     2   4
# 3  rs6     3  22
# 4 rs19     4  NA
# 5  rs8     5  88
# 6  rs9     6  23

关于r - 根据 dplyr 中多个数据帧中的值将列添加到数据帧，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60465679/

r - 根据 dplyr 中多个数据帧中的值将列添加到数据帧

上一篇：javascript - 我怎样才能得到clearInterval来停止我的计时器？

下一篇：python - 在 Python 3.x 中使用 matplotlib 和 Networkx 的曲线边缘