我们有 2 个 DNA 序列(字符串):
>1
ATGCAT
135198
>2
ATCAT
预期输出:首先,我们需要对齐这两个字符串,然后通过索引获取相关注释:
ATGCAT
AT-CAT
13-198
第一部分可以使用Biostrings包完成:
library(Biostrings)
p <- DNAString("ATCAT")
s <- DNAString("ATGCAT")
s_annot <- "135198"
x <- pairwiseAlignment(pattern = p, subject = s)
aligned(x)
# A DNAStringSet instance of length 1
# width seq
# [1] 6 AT-CAT
as.character(x)
# [1] "AT-CAT"
as.matrix(x)
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "A" "T" "-" "C" "A" "T"
第二部分的当前解决方法:
annot <- unlist(strsplit(s_annot, ""))
annot[ which(c(as.matrix(x)) == "-") ] <- "-"
# [1] "1" "3" "-" "1" "9" "8"
工作正常,但我想知道是否有 Biostrings 方法(或任何其他包),也许可以通过将注释保留在元数据槽中,然后在对齐之后,我们得到元数据中匹配碱基的匹配注释,如下所示:
getSlots("DNAString")
# shared offset length elementMetadata metadata
# "SharedRaw" "integer" "integer" "DataTable_OR_NULL" "list"
# just an idea, non-working code
s@metadata <- unlist(strsplit(s_annot , ""))
x <- pairwiseAlignment(pattern = p, subject = s)
metadata(x)
# [[1]]
# [1] "1" "3" "-" "1" "9" "8"
<小时/>
注意:
- 经作者许可,盗自 BioStars - https://www.biostars.org/p/415896/
- 作者想要 biopython 解决方案,因此添加标签,如果可能的话,也发布 python 解决方案。
最佳答案
可能的解决方案:
dna_fun <- function(s, p, a) {
s <- strsplit(s, "")[[1]]
p <- strsplit(p, "")[[1]]
a <- strsplit(a, "")[[1]]
ls <- length(s)
lp <- length(p)
r <- lapply(c(1,seq(lp)), function(x) {
v <- rep(1, 5)
v[x] <- 2
v
})
mat <- sapply(r, rep, x = p)
tfm <- mat == matrix(rep(s, ls), ncol = ls)
m <- which.max(colSums(tfm))
p2 <- mat[, m]
p2[!tfm[,m]] <- "-"
a[!tfm[,m]] <- "-"
p2 <- paste(p2, collapse = "")
a <- paste(a, collapse = "")
return(list(p2, a))
}
与:
dna_fun(s1, s2, annot)
你得到:
<小时/>[[1]] [1] "AT-CAT" [[2]] [1] "13-198"
如果您有相应的向量,则可以将 Map
与 dna_fun
函数一起使用:
s11 <- c("ATGCAT","ATCGAT")
s22 <- c("ATCAT","ATCAT")
annot2 <- c("135198","145892")
lm <- Map(dna_fun, s11, s22, annot2)
data.table::rbindlist(lm, idcol = "dna")
这给出:
<小时/>dna V1 V2 1: ATGCAT AT-CAT 13-198 2: ATCGAT ATC-AT 145-92
数据:
s1 <- "ATGCAT"
s2 <- "ATCAT"
annot <- "135198"
关于python - 序列与字母注释的“位置感知”对齐,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59682107/