python - 如何提高数百个文件中数千行的解析效率

我写了一个脚本，但是速度太慢了。我想知道是否有人可以建议如何加快速度。我认为脚本太慢的部分是这样的:

我有一个包含 1,000 个人类基因名称的列表(每个基因名称是一个数字)，读入一个名为“ListOfHumanGenes”的列表；例如，列表的开头如下所示:

[2314,2395,10672,8683,5075]

我有 100 个这样的文件，全部扩展名为“.HumanHomologs”:

HumanGene   OriginalGene    Intercept    age    pval 
2314       14248            5.3e-15      0.99   3.5e-33 
2395       14297            15.76       -0.05   0.59 
10672      14674            7.25         0.19   0.58 
8683       108014           21.63       -1.74   0.43 
5075       18503            -6.34        1.58   0.19

这部分脚本的算法是这样说的(用英语，不是代码):

for each gene in ListOfHumanGenes:
    open each of the 100 files labelled ".HumanHomologs"
      if the gene name is present:
           NumberOfTrials +=1
           if the p-val is <0.05: 
                 if the "Age" column < 0:
                       UnderexpressedSuccess +=1
                 elif "Age" column > 0:
                       OverexpressedSuccess +=1
print each_gene + "\t" + NumberOfTrials + "\t" UnderexpressedSuccess
print each_gene + "\t" + NumberOfTrials + "\t" OverexpressedSuccess

这部分的代码是:

for each_item in ListOfHumanGenes:
    OverexpressedSuccess = 0
    UnderexpressedSuccess = 0
    NumberOfTrials = 0
    for each_file in glob.glob("*.HumanHomologs"):
        open_each_file = open(each_file).readlines()[1:]
        for line in open_each_file:
            line = line.strip().split()
            if each_item == line[0]:
                NumberOfTrials +=1    #i.e if the gene is in the file, NumberOfTrials +=1. Not every gene is guaranteed to be in every file
                if line[-1] != "NA":
                    if float(line[-1]) < float(0.05):
                        if float(line[-2]) < float(0):
                            UnderexpressedSuccess +=1
                        elif float(line[-2]) > float(0):
                            OverexpressedSuccess +=1

    underexpr_output_file.write(each_item + "\t" + str(UnderexpressedSuccess) + "\t" + str(NumberOfTrials) + "\t" + str(UnderProbability) +"\n") #Note: the "Underprobabilty" float is obtained earlier in the script
    overexpr_output_file.write(each_item + "\t" + str(OverexpressedSuccess) + "\t" + str(NumberOfTrials) + "\t" + str(OverProbability) +"\n") #Note: the "Overprobability" float is obtained earlier in the script
overexpr_output_file.close()
underexpr_output_file.close()

这会产生两个输出文件(一个用于表达过度，一个用于表达不足)，如下所示；列是 GeneName、#Overexpressed/#Underexpressed、#NumberTrials，最后一列可以忽略:

2314    8   100 0.100381689982
2395    14  90  0.100381689982
10672   10  90  0.100381689982
8683    8   98  0.100381689982
5075    5   88  0.100381689982

每个“.HumanHomologs”文件都有超过 8,000 行，基因列表大约有 20,000 个基因长。所以我知道这很慢，因为对于 20,000 个基因中的每一个，它都会打开 100 个文件，并在每个文件超过 8,000 个基因中找到该基因。我想知道是否有人可以建议我进行编辑以使该脚本更快/更高效？

最佳答案

你的算法将打开所有这 100 个文件 1000 次。立即想到的优化是将文件作为最外层循环进行迭代，这将确保每个文件仅打开一次。然后检查每个基因的存在并记录您想要的任何其他记录。

此外，pandas 模块在处理这种 csv 文件时会非常方便。看看Pandas

关于python - 如何提高数百个文件中数千行的解析效率，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41588261/

python - 如何提高数百个文件中数千行的解析效率

上一篇：Windows 上的 Python(开发环境): Install Python 3. 5.2、pip 和 virtualenv

下一篇：python - 如何加快Pandas DataFrame除常数的计算速度？