python - 执行此搜索算法的更有效方法?

标签 python algorithm performance search

我只是想知道是否有更好的方法来执行此算法。我发现我需要经常执行此类操作,而我目前执行此操作的方式需要数小时,因为我认为它会被视为 n^2 算法。我会附在下面。

import csv

with open("location1", 'r') as main:
    csvMain = csv.reader(main)
    mainList = list(csvMain)

with open("location2", 'r') as anno:
    csvAnno = csv.reader(anno)
    annoList = list(csvAnno)

tempList = []
output = []

for full in mainList:
    geneName = full[2].lower()
    for annot in annoList:
        if geneName == annot[2].lower():
            tempList.extend(full)
            tempList.append(annot[3])
            tempList.append(annot[4])
            tempList.append(annot[5])
            tempList.append(annot[6])
            output.append(tempList)

        for i in tempList:
            del i

with open("location3", 'w') as final:
    a = csv.writer(final, delimiter=',')
    a.writerows(output)

我有两个 csv 文件,每个文件包含 15,000 个字符串,我希望比较每个文件的列,如果它们匹配,则将第二个 csv 的末尾连接到第一个 csv 的末尾。任何帮助将不胜感激!

谢谢!

最佳答案

这样应该效率更高:

import csv
from collections import defaultdict

with open("location1", 'r') as main:
  csvMain = csv.reader(main)
  mainList = list(csvMain)

with open("location2", 'r') as anno:
  csvAnno = csv.reader(anno)
  annoList = list(csvAnno)

output = []
annoMap = defaultdict(list)

for annot in annoList:
  tempList = annot[3:]  # adapt this to the needed columns
  annoMap[annot[2].lower()].append(tempList)  # put these columns into the map at position of the column of intereset

for full in mainList:
  geneName = full[2].lower()
  if geneName in annoMap:  # check if matching column exists
    output.extend(annoMap[geneName])

with open("location3", 'w') as final:
  a = csv.writer(final, delimiter=',')
  a.writerows(output)

它的效率更高,因为您只需要遍历每个列表一次。字典中的查找平均为 O(1),因此您基本上得到了一个线性算法。

关于python - 执行此搜索算法的更有效方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42256304/

相关文章:

database - 从 1 亿行字符串中搜索一个字符串

c# - 更新记录时性能非常慢

python - 用从变量中随机选择的整数替换整数

python - zip dict.items() 列表中的多个字典?

c# - 有效地计算卢卡斯序列

php - 调整生成随机强度值的算法

performance - 将大表从一个 Hive 数据库转移到另一个

.NET - 什么是 GC 更快 : few large objects or many small objects?

python - 全局名称 'ParseError' 未定义,我使用 try 和 except 来避免它,但这仍然显示

python - scipy ndimage 没有属性过滤器?