r - 通过选择正确的值来合并数据框

我有一个名为“ref”的数据框，其中包含允许将基因 entrez ID 映射到基因的起始位置和结束位置的信息。我有另一个数据框“ori_data”，其中每一行都包含来自样本的独特突变，这给出了基因组位置。我试图将“ori_data”中给出的每个位置分配给“ref”上的信息，以便为每个突变分配 entrez ID。我尝试使用 for 循环来匹配同一条染色体，然后选择“ori_data”中位于“ref”坐标之间的位置，尽管我没有成功。 “ori_data”数据集超过 100 万行，因此我不确定 for 循环是否是一个有效的解决方案。请注意，许多位置将映射到我的真实数据集中的相同 entrez ID。 “最终”是我想要发生的事情 - 这只会根据染色体/位置添加 entrezID 列。泰亚!

ref = data.frame("EntrezID" = c(1, 10, 100, 1000), "Chromosome" = c("19", "8", "20", "18"), "txStarts" = c("58345182", "18391281", "44619518", "27950965"), "txEnds" = c("58353492", "18401215", "44651758", "28177130"))

ori_data = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
             "Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"))

final = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
               "Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"), "EntrezID" = c(1,10,100,1000))

我已经尝试过这行代码，但不确定为什么它不起作用。

for (i in 1:dim(ori_data)[1])
{
  for (j in 1:dim(ref)[1])
  {
    ID = which(ori_data[i, "Chromosome"] == ref[j, 
     "Chromosome"])
    if (length(ID) > 0)
    {
      Pos = ori_data[ID, "POS"]
      IDj = which(Pos >= ref[j, "txStarts"] & Pos <= 
           ref[j, "txEnds"])
      print(IDj)
      if (length(IDj) > 0)
       {
        ori_data = cbind("Entrez" = ref[IDj, 
                  "EntrezID"], ori_data)
     }
   }
 }
}

最佳答案

在基础中apply可用于查找染色体每行的匹配项并测试Pos是否在txStarts范围内 txEnds。

ori_data$EntrezID <- apply(ori_data[c("Chromosome", "Pos")], 1, \(x)
  ref$EntrezID[ref$Chromosome == x["Chromosome"] &
    x["Pos"] >= ref$txStarts & x["Pos"] <= ref$txEnds][1])
ori_data
#  Chromosome      Pos Sample EntrezID
#1         19 58345186   HCC1        1
#2          8 18401213   HCC2       10
#3         20 44619519   HCC1      100
#4         18 27950966   HCC3     1000

一个可能更快的版本:

lup <- list2env(split(ref[c("EntrezID", "txStarts", "txEnds")], ref$Chromosome))
ori_data$EntrezID <- Map(\(x, y) {
  . <- get(x, envir=lup)
  .$EntrezID[y >= .$txStarts & y <= .$txEnds][1]
}, ori_data$Chromosome, ori_data$Pos)

或者另一种方式但不保持原始顺序。 (如果原始顺序很重要，请查看unsplit。)

#Assuming you have many rows with same Chromosome
x <- split(ori_data, ori_data$Chromosome)

#Assuming you have also here many rows with same Chromosome
lup <- split(ref[c("EntrezID", "txStarts", "txEnds")], ref$Chromosome)

#Now I am soting this by the names of x - try which Method ist faster
#Method 1:
lup <- lup[names(x)]
#Method 2:
lup <- mget(names(x), list2env(lup))

res <- do.call(rbind, Map(\(a, b) {
  cbind(a, b[1][a$Pos >= b[[2]] &  a$Pos <= b[[3]]][1])
}, x, lup))

关于r - 通过选择正确的值来合并数据框，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71915346/

r - 通过选择正确的值来合并数据框

上一篇：haskell - 如何使用 Pandoc 从 Markdown 文件中获取 YAML 元数据？ [ haskell ]

下一篇：java - 为什么我在尝试序列化对象时收到 NotSerializedException