Python - 重叠范围 - 确定唯一位置

标签 python list range bigdata overlapping

我有一个大数据集,其中每个值分为 3 个部分[染色体、开始、结束]。计算每条染色体的所有唯一位置的最快方法是什么,因为我有很多重叠范围。

例如:
[['chr1:10:60'、'chr1:5:70'、'chr3:50:80'、'chr1:54:90'、'chr1:120:180'、'chr3:50: 90']]

应该导致:
['chr1:5:90', 'chr1:120:180', 'chr3:50:90']

不知道有没有简单的计算方法?但我发现在这里提问是值得的。下面是我的数据的一个子集。

提前致谢,

[['chr9:95149330:95149362', 'chr9:95149330:95149362', 'chr17:70386266:70386304', 'chr17:70386256:70386304', 'chr2:44672786:44672833', 'chr2:44672785:44672833', 'chr2:141966446:141966479', 'chr2:141966446:141966488', 'chr19:18126909:18126938', 'chr19:18126909:18127027', 'chr3:145082003:145082051', 'chr3:145082014:145082121', 'chr6:38835529:38835560', 'chr6:38835529:38835560', 'chr4:120372932:120372986', 'chr4:120372932:120372994', 'chr2:141014019:141014057', 'chr2:141014014:141014057', 'chr18:3445722:3445761', 'chr18:3445722:3445793', 'chr17:72329982:72330015', 'chr17:72329982:72330015', 'chr5:169911920:169911962', 'chr5:169911917:169911962', 'chr4:146482176:146482219', 'chr4:146482176:146482219', 'chr9:104285900:104285935', 'chr9:104285879:104285935', 'chr12:32941976:32942016', 'chr12:32941976:32942028', 'chrX:127923156:127923189', 'chrX:127923156:127923189', 'chr2:9535703:9535755', 'chr2:9535701:9535755', 'chr8:86476618:86476684', 'chr8:86476554:86476642', 'chr9:135756650:135756696', 'chr9:135756650:135756706', 'chr6:103004873:103004932', 'chr6:103004861:103004918', 'chr8:86476618:86476684', 'chr8:86476556:86476648', 'chr1:52280846:52280876', 'chr1:52280845:52280876', 'chr8:86476635:86476685', 'chr8:86476553:86476645', 'chr5:116046573:116046620', 'chr5:116046564:116046615', 'chrX:68039214:68039252', 'chrX:68039214:68039252', 'chr4:181491919:181491953', 'chr4:181491919:181491960', 'chr18:68050122:68050166', 'chr18:68050122:68050166', 'chr2:233985816:233985860', 'chr2:233985808:233985860', 'chr6:17020712:17020750', 'chr6:17020712:17020759', 'chr7:21950625:21950666', 'chr7:21950625:21950666', 'chr12:93292486:93292536', 'chr12:93292481:93292537', 'chr1:246515439:246515472', 'chr1:246515440:246515486', 'chr12:57084093:57084130', 'chr12:57084093:57084134', 'chr1:174801431:174801474', 'chr1:174801431:174801485', 'chr7:92499684:92499734', 'chr7:92499924:92499960', 'chr17:40328527:40328560', 'chr17:40328518:40328560', 'chr8:42944072:42944110', 'chr8:42944073:42944120', 'chr17:29890450:29890499']

最佳答案

我同意 jonrsharpe 关于一般方法的看法,但我认为有一种更优雅的方法来做到这一点。

首先,我们将获得每个染色体的范围(与 jonrsharpe 几乎相同,尽管我更喜欢元组而不是范围列表)。

from collections import defaultdict

processed = defaultdict(list)

for s in data:
    chr_, start, end = s.split(":")
    processed[chr_].append((int(start), int(end)))

现在,我们可以通过按范围的开头对每个染色体的列表进行排序来使合并变得更加简单。这为我们提供了一个很好的属性,如果以前的范围都不与当前范围重叠,那么我们就知道我们对以前的值所做的任何合并都是最终的,我们不必返回到它。 p>

for vals in processed.values():
    vals.sort()
    current = 1
    while current < len(vals):
      if vals[current-1][1] > vals[current][0]:
        # current and previous ranges overlap, so merge previous and current values.
        vals[current-1:current+1] = [(vals[current-1][0], vals[current][1])]
        # Because we reduced the number of values in the list by 1,
        # current now points at the next interesting value.
      else:
        current += 1 # We didn't merge, so we must increment current

现在我们可以像 jonrsharpe 那样将其重新组合起来:

final = []
for key, vals in processed.items():
    for start, end in vals:
        final.append("%s:%s:%s" % (key, str(start), str(end)))

这也给出了 final == ['chr3:50:90', 'chr1:5:90', 'chr1:120:180']

关于Python - 重叠范围 - 确定唯一位置,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21937784/

相关文章:

python - 在 for 循环中定义函数

python - 如何让 Django 在 XAMPP 中与 MySQL 一起工作?

c++ - 可以安全地存储 list::iterator 供以后使用吗?

Vim:命令行选择

r - stat_smooth() 的其他范围

python - 处理多类问题。随机森林分类器可以处理 >100,000 个类别吗?

python追加字典到列表

java - 如何在JAVA 8中处理对象的嵌套列表-顺序处理内部列表,而必须并行处理外部列表

reactjs - redux 形式的范围栏

python - 如何根据其他列向 Pandas 数据框添加新行?