python - 有没有更快的方法来查找两个数组(Python)中的匹配特征?

标签 python optimization comparison bioinformatics

我试图遍历一个文件中的每个功能(每行 1 个),并根据第二个文件中该行的一列查找所有匹配的功能。我有这个解决方案,它可以在小文件上完成我想要的操作,但在大文件上速度非常慢(我的文件有>20,000,000行)。 Here's a sample of the two input files.

我的(慢)代码:

FEATUREFILE = 'S2_STARRseq_rep1_vsControl_peaks.bed'
CONSERVATIONFILEDIR = './conservation/'
with open(str(FEATUREFILE),'r') as peakFile, open('featureConservation.td',"w+") as outfile:
for line in peakFile.readlines():
    chrom = line.split('\t')[0]
    startPos = int(line.split('\t')[1])
    endPos = int(line.split('\t')[2])
    peakName = line.split('\t')[3]
    enrichVal = float(line.split('\t')[4])

    #Reject negative peak starts, if they exist (sometimes this can happen w/ MACS)
    if startPos > 0:
        with open(str(CONSERVATIONFILEDIR) + str(chrom)+'.bed','r') as conservationFile:
            cumulConserv = 0.
            n = 0
            for conservLine in conservationFile.readlines():
                position = int(conservLine.split('\t')[1])
                conservScore = float(conservLine.split('\t')[3])
                if position >= startPos and position <= endPos:
                    cumulConserv += conservScore
                    n+=1
        featureConservation = cumulConserv/(n)
        outfile.write(str(chrom) + '\t' + str(startPos) + '\t' + str(endPos) + '\t' + str(peakName) + '\t' + str(enrichVal) + '\t' + str(featureConservation) + '\n')

最佳答案

对于我的目的来说,最好的解决方案似乎是为 pandas 重写上面的代码。以下是对一些非常大的文件最适合我的方法:

from __future__ import division
import pandas as pd

FEATUREFILE = 'S2_STARRseq_rep1_vsControl_peaks.bed'
CONSERVATIONFILEDIR = './conservation/'

peakDF = pd.read_csv(str(FEATUREFILE), sep = '\t', header=None, names=['chrom','start','end','name','enrichmentVal'])
#Reject negative peak starts, if they exist (sometimes this can happen w/ MACS)
peakDF.drop(peakDF[peakDF.start <= 0].index, inplace=True)
peakDF.reset_index(inplace=True)
peakDF.drop('index', axis=1, inplace=True)
peakDF['conservation'] = 1.0 #placeholder

chromNames = peakDF.chrom.unique()

for chromosome in chromNames: 
    chromSubset = peakDF[peakDF.chrom == str(chromosome)]
    chromDF = pd.read_csv(str(CONSERVATIONFILEDIR) + str(chromosome)+'.bed', sep='\t', header=None, names=['chrom','start','end','conserveScore'])

for i in xrange(0,len(chromSubset.index)):
    x = chromDF[chromDF.start >= chromSubset['start'][chromSubset.index[i]]]
    featureSubset = x[x.start < chromSubset['end'][chromSubset.index[i]]]
    x=None
    featureConservation = float(sum(featureSubset.conserveScore)/(chromSubset['end'][chromSubset.index[i]]-chromSubset['start'][chromSubset.index[i]]))
    peakDF.set_value(chromSubset.index[i],'conservation',featureConservation)
    featureSubset=None

 peakDF.to_csv("featureConservation.td", sep = '\t')

关于python - 有没有更快的方法来查找两个数组(Python)中的匹配特征?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37929772/

相关文章:

c++ - 高效的数据结构,通过查找和插入将整数映射到整数,无分配和固定上限

一个时钟周期内的 C++ 字符串比较

c - If 语句比较开销与赋值开销

php - 将字符串与 mysql 中的值进行比较

python - 有没有人使用 Sphinx 来记录 C++ 项目?

python - 在 Excel 中删除 Pandas 条件格式

python - OpenErp 新模块

mysql - 让服务器拥有持久的 mySQL 连接或在需要时连接会更高效

mysql - 此简单查询的正确 MySQL 索引

python - 如何在列表中以二进制前导 0b 为下标