假设我有一个不重叠的基因组区间列表。
chr1 1 100
chr1 101 200
chr1 201 300
chr1 301 400
以及链接到不同样本的基因组位置列表:
chr1 50 sampleA
chr1 60 sampleB
chr1 110 sampleA
chr1 130 sampleB
chr1 160 sampleA
chr1 190 sampleC
chr1 350 sampleB
chr1 360 sampleB
我的目标是计算每个时间间隔的唯一样本数量。在我的真实数据集中,间隔表约为 400.000 行,基因组位置样本表约为 30.000 行。
此计算嵌入在模拟中,因此它应该尽可能快。我已经尝试使用 GenomicRanges 作为:
require(GenomicRanges)
interval.gr <- GRanges(intervals$chr,IRanges(intervals$start,intervals$end))
positions.gr <- GRanges(positions$chr,IRanges(positions$pos,positions$pos))
ov <- findOverlaps(interval.gr,positions.gr)
intervals %>%
slice(queryHits(ov)) %>%
mutate(sample=positions$sample[subjectHits(ov)]) %>%
group_by(chr,start,end) %>%
summarise(n_sample=length(unique(sample)))
结果
# A tibble: 3 x 4
# Groups: chr, start [3]
chr start end n_sample
<fct> <dbl> <dbl> <int>
1 chr1 1 100 2
2 chr1 101 200 3
3 chr1 301 400 1
但是,它仍然会在没有样本的情况下下降间隔(201-300),而且速度也不是很快。使用我的数据集:
Unit: milliseconds
expr min lq mean median uq max neval
x 159.3901 161.621 190.1703 164.4879 168.3116 297.8395 10
我想知道是否有更好更快的方法来进行这种分析?
谢谢
可重现的数据集:
intervals <- data.frame(chr=c("chr1","chr1","chr1","chr1"),start=c(1,101,201,301),end=c(100,200,300,400))
positions <- data.frame(chr=rep("chr1",8),pos=c(50,60,110,130,160,190,350,360),sample=c("sampleA","sampleB","sampleA","sampleB","sampleA","sampleC","sampleB","sampleB"))
edit
与我的真实数据集大小相同的可重现数据集
intervals <- data.frame(chr=paste0("chr",round(runif(400000,min = 1,max = 22))),start=round(runif(n = 400000,min = 1,max = 100000000)))
intervals$end <- intervals$start+100
positions <- data.frame(chr=paste0("chr",round(runif(30000,min = 1,max = 22))),pos=round(runif(n = 30000,min = 1,max = 100000000)),sample=sample(paste0("sample",1:400),size = 30000,replace=T))
最佳答案
根据 @Jon 所说,data.table 是解决这个问题的好方法。使用函数 foverlaps() 大大提高了速度。
library(data.table)
intervals <- data.frame(chr=c("chr1","chr1","chr1","chr1"),
start=c(1,101,201,301),
end=c(100,200,300,400))
positions <- data.frame(chr=rep("chr1",8),
pos=c(50,60,110,130,160,190,350,360),
sample=c("sampleA","sampleB","sampleA","sampleB","sampleA","sampleC","sampleB","sampleB"))
setDT(positions)
setDT(intervals)
positions[, pos_tmp := pos]
setkey(positions,chr, pos, pos_tmp)
overlap = foverlaps(intervals, positions, type="any",by.x=c("chr","start", "end")) ## return overlap indices
overlap[!is.na(sample),.(n_sample = .N), by = .(chr, start, end)]
与 @Jon 在我的机器上实现大约需要 6 秒相比,上述实现大约需要 180 毫秒
关于R : Count unique element per interval,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58096901/