用于分析基因分型数据的 R 代码。

标签 r bioinformatics

我有这个数据框(mydf)。我需要将 REF 和 ALT 列中的字母(DNA 字母)与列名(“A”、“T”、“G”、“C”)进行匹配,并将相应的数值粘贴在一起作为“REF,ALT”。但是,有些行的 TYPE 列中有“snp:+[0-9]”和“flat$”。现在,对于“flat$”行,我想对相应“start”id 的“snp:+[0-9]”中的 ALT 值进行求和,并将该 ALT 值再次粘贴为“REF,ALT”(REF 值对于具有相同起始 ID 的“snp:+[0-9]”和“flat$”将是相同的)并获得如结果中所示的输出。我怎样才能创建一个函数来做到这一点?

mydf<-structure(c("chr20:5363934", "chr5:8529759", "chr14:9620689", 
            "chr18:547375", "chr8:5952145", "chr14:8694382", "chr16:2530921", 
            "chr16:2530921", "chr16:2530921", "chr14:4214117", "chr4:7799768", 
            "chr3:9141263", "95", "24", "65", "94", "27", "68", "49", "49", 
            "49", "73", "36", "27", "29", " 1", "49", " 1", "80", "94", "15", 
            "15", "15", "49", "28", "41", "14", "28", "41", "51", "25", "26", 
            "79", "79", "79", "18", " 1", "93", "59", "41", "96", "67", "96", 
            "30", "72", "72", "72", "77", "16", "90", "C", "G", "T", "G", 
            "T", "A", "A", "A", "A", "G", "C", "A", "T", "C", "G", "C", "T", 
            "A", "T", "G", "T", "A", "A", "A", "snp", "snp", "snp", "snp", 
            "snp", "snp", "snp:2530921", "snp:2530921", "snp:flat", "snp", "snp", "snp"), .Dim = c(12L, 
                                                                                   8L), .Dimnames = list(NULL, c("start", "A", "T", "G", "C", "REF", 
                                                                                                                 "ALT", "TYPE")))

结果

    start              A    T    G    C    REF ALT TYPE       AD     
 [1,] "chr20:5363934" "95" "29" "14" "59" "C" "T" "snp"      "59,29"
 [2,] "chr5:8529759"  "24" " 1" "28" "41" "G" "C" "snp"      "28,41"
 [3,] "chr14:9620689" "65" "49" "41" "96" "T" "G" "snp"      "49,41"
 [4,] "chr18:547375"  "94" " 1" "51" "67" "G" "C" "snp"      "51,67"
 [5,] "chr8:5952145"  "27" "80" "25" "96" "T" "T" "snp"      "80,80"
 [6,] "chr14:8694382" "68" "94" "26" "30" "A" "A" "snp"      "68,68"
 [7,] "chr16:2530921" "49" "15" "79" "72" "A" "T" "snp:2530921" "49,15"
 [8,] "chr16:2530921" "49" "15" "79" "72" "A" "G" "snp:2530921" "49,79"
 [9,] "chr16:2530921" "49" "15" "79" "72" "A" "T" "snp:flat" "49,94"
[10,] "chr14:4214117" "73" "49" "18" "77" "G" "A" "snp"      "18,73"
[11,] "chr4:7799768"  "36" "28" " 1" "16" "C" "A" "snp"      "16,36"
[12,] "chr3:9141263"  "27" "41" "93" "90" "A" "A" "snp"      "27,27"

最佳答案

indx <- sapply(mydf[,c("REF", "ALT")], function(x) match(x, colnames(mydf)))
flat <- grepl("flat", mydf[,"TYPE"])
x <- `dim<-`(mydf[cbind(rep(1:nrow(mydf), 2), indx)], c(nrow(mydf), 2))
add_ids <- mydf[,"start"][mydf[,"start"] %in% mydf[,"start"][flat] & !flat]
toadd <- x[,2][mydf[,"start"] %in% mydf[,"start"][flat] & !flat]
x[,2][flat] <-tapply(as.numeric(toadd), factor(add_ids, levels=unique(add_ids)), sum)
cbind(mydf, paste(x[,1], x[,2],sep=","))
#       start           A    T    G    C    REF ALT TYPE                 
#  [1,] "chr20:5363934" "95" "29" "14" "59" "C" "T" "snp"         "59,29"
#  [2,] "chr5:8529759"  "24" " 1" "28" "41" "G" "C" "snp"         "28,41"
#  [3,] "chr14:9620689" "65" "49" "41" "96" "T" "G" "snp"         "49,41"
#  [4,] "chr18:547375"  "94" " 1" "51" "67" "G" "C" "snp"         "51,67"
#  [5,] "chr8:5952145"  "27" "80" "25" "96" "T" "T" "snp"         "80,80"
#  [6,] "chr14:8694382" "68" "94" "26" "30" "A" "A" "snp"         "68,68"
#  [7,] "chr16:2530921" "49" "15" "79" "72" "A" "T" "snp:2530921" "49,15"
#  [8,] "chr16:2530921" "49" "15" "79" "72" "A" "G" "snp:2530921" "49,79"
#  [9,] "chr16:2530921" "49" "15" "79" "72" "A" "T" "snp:flat"    "49,94"
# [10,] "chr14:4214117" "73" "49" "18" "77" "G" "A" "snp"         "18,73"
# [11,] "chr4:7799768"  "36" "28" " 1" "16" "C" "A" "snp"         "16,36"
# [12,] "chr3:9141263"  "27" "41" "93" "90" "A" "A" "snp"         "27,27"

我们首先创建一个将 REF 和 ALT 匹配到正确列的索引。创建逻辑索引来定位其中包含“平面”的列。创建包含所有匹配项的数字向量并指定维度。

为了将 'flat' 作为 TYPE 的 ids 值求和,我们首先识别与 ids 和值本身匹配的行。然后将它们分配到适当的列槽并将所有内容绑定(bind)在一起。

关于用于分析基因分型数据的 R 代码。,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31510440/

相关文章:

r - 迁移 R 库

arrays - 距数组的编辑距离百分比

python - 当我尝试在 keras 模型中嵌入序列数据时,如何解决 'could not convert string to float:' 错误

python - 计算 DNA 序列中的三联体

perl - fasta 文件的反向补码

r - 指示条形图基数 R 的统计显着差异

html - R - 在外部浏览器中渲染 HTML 小部件

r - R-数据帧列中唯一值的数量

r - 如何在用png()保存时固定R中ggplot的大小?

python - 生成小配体的构象异构体但保留正确的芳香性