R : Count unique element per interval

假设我有一个不重叠的基因组区间列表。

chr1    1   100
chr1    101 200
chr1    201 300
chr1    301 400

以及链接到不同样本的基因组位置列表:

chr1    50  sampleA
chr1    60  sampleB
chr1    110 sampleA
chr1    130 sampleB
chr1    160 sampleA
chr1    190 sampleC
chr1    350 sampleB
chr1    360 sampleB

我的目标是计算每个时间间隔的唯一样本数量。在我的真实数据集中，间隔表约为 400.000 行，基因组位置样本表约为 30.000 行。

此计算嵌入在模拟中，因此它应该尽可能快。我已经尝试使用 GenomicRanges 作为:

require(GenomicRanges)
interval.gr <- GRanges(intervals$chr,IRanges(intervals$start,intervals$end))
positions.gr <- GRanges(positions$chr,IRanges(positions$pos,positions$pos))
ov <- findOverlaps(interval.gr,positions.gr)
intervals %>%
  slice(queryHits(ov)) %>%
  mutate(sample=positions$sample[subjectHits(ov)]) %>% 
  group_by(chr,start,end) %>% 
  summarise(n_sample=length(unique(sample)))

结果

# A tibble: 3 x 4
# Groups:   chr, start [3]
  chr   start   end n_sample
  <fct> <dbl> <dbl>    <int>
1 chr1      1   100        2
2 chr1    101   200        3
3 chr1    301   400        1

但是，它仍然会在没有样本的情况下下降间隔(201-300)，而且速度也不是很快。使用我的数据集:

Unit: milliseconds
 expr      min      lq     mean   median       uq      max neval
    x 159.3901 161.621 190.1703 164.4879 168.3116 297.8395    10

我想知道是否有更好更快的方法来进行这种分析？

谢谢

可重现的数据集:

intervals <- data.frame(chr=c("chr1","chr1","chr1","chr1"),start=c(1,101,201,301),end=c(100,200,300,400))

positions <- data.frame(chr=rep("chr1",8),pos=c(50,60,110,130,160,190,350,360),sample=c("sampleA","sampleB","sampleA","sampleB","sampleA","sampleC","sampleB","sampleB"))

edit

与我的真实数据集大小相同的可重现数据集

intervals <- data.frame(chr=paste0("chr",round(runif(400000,min = 1,max = 22))),start=round(runif(n = 400000,min = 1,max = 100000000)))
intervals$end <- intervals$start+100

positions <- data.frame(chr=paste0("chr",round(runif(30000,min = 1,max = 22))),pos=round(runif(n = 30000,min = 1,max = 100000000)),sample=sample(paste0("sample",1:400),size = 30000,replace=T))

最佳答案

根据 @Jon 所说，data.table 是解决这个问题的好方法。使用函数 foverlaps() 大大提高了速度。

library(data.table)
intervals <- data.frame(chr=c("chr1","chr1","chr1","chr1"),
                        start=c(1,101,201,301),
                        end=c(100,200,300,400))

positions <- data.frame(chr=rep("chr1",8),
                        pos=c(50,60,110,130,160,190,350,360),
                        sample=c("sampleA","sampleB","sampleA","sampleB","sampleA","sampleC","sampleB","sampleB"))
setDT(positions)
setDT(intervals)

  positions[, pos_tmp := pos]
  setkey(positions,chr, pos, pos_tmp)
  overlap = foverlaps(intervals, positions, type="any",by.x=c("chr","start", "end")) ## return overlap indices
  overlap[!is.na(sample),.(n_sample = .N), by = .(chr, start, end)]

与 @Jon 在我的机器上实现大约需要 6 秒相比，上述实现大约需要 180 毫秒

关于R : Count unique element per interval，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58096901/

R : Count unique element per interval

上一篇：php - 如何在 Laravel 中创建种子场景

下一篇：nginx - OpenShift - nginx pod 作为 SSL 终止和负载均衡器