python - 优化Python在层次字典中的键搜索

标签 python optimization dictionary defaultdict

我正在尝试优化我的代码,因为当我尝试加载巨大的字典时,它变得非常慢。我认为这是因为它在字典中搜索键。我一直在阅读有关 python defaultdict 的内容,我认为这可能是一个很好的改进,但我未能在这里实现它。正如您所看到的,这是一个分层的字典结构。任何提示将不胜感激。

class Species:
    '''This structure contains all the information needed for all genes.
    One specie have several genes, one gene several proteins'''
    def __init__(self, name):
        self.name = name #name of the GENE
        self.genes = {}
    def addProtein(self, gene, protname, len):
        #Converting a line from the input file into a protein and/or an exon
        if gene in self.genes:
            #Gene in the structure
            self.genes[gene].proteins[protname] = Protein(protname, len)
            self.genes[gene].updateProts()
        else:
            self.genes[gene] = Gene(gene) 
            self.updateNgenes()
            self.genes[gene].proteins[protname] = Protein(protname, len)
            self.genes[gene].updateProts()
    def updateNgenes(self):
    #Updating the number of genes
        self.ngenes = len(self.genes.keys())    

基因和蛋白质的定义是:

class Protein:
    #The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
    def __init__(self, name, len):
        self.name = name
        self.len = len

class Gene:
    #The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
    def __init__(self, name):
        self.name = name
        self.proteins = {}
        self.updateProts()
    def updateProts(self):
        #Update number of proteins
        self.nproteins = len(self.proteins)

最佳答案

您不能使用 defaultdict,因为您的 __init__ 方法需要参数。

这可能是您的瓶颈之一:

def updateNgenes(self):
#Updating the number of genes
    self.ngenes = len(self.genes.keys()) 

len(self.genes.keys()) 在计算长度之前创建所有键的列表。这意味着每次添加基因时,您都会创建一个列表并将其丢弃。你拥有的基因越多,这个列表的创建就会变得越来越昂贵。要避免创建中间列表,只需执行 len(self.genes)

更好的方法是将 ngenes 设为 property因此仅在您需要时才计算。

@property
def ngenes(self):
    return len(self.genes)

使用 Gene 类中的 n Proteins 也可以完成同样的操作。

这是重构后的代码:

class Species:
    '''This structure contains all the information needed for all genes.
    One specie have several genes, one gene several proteins'''

    def __init__(self, name):
        self.name = name #name of the GENE
        self.genes = {}

    def addProtein(self, gene, protname, len):
        #Converting a line from the input file into a protein and/or an exon
        if gene not in self.genes:
            self.genes[gene] = Gene(gene) 
        self.genes[gene].proteins[protname] = Protein(protname, len)

    @property
    def ngenes(self):
        return len(self.genes)

class Protein:
    #The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
    def __init__(self, name, len):
        self.name = name
        self.len = len

class Gene:
    #The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
    def __init__(self, name):
        self.name = name
        self.proteins = {}

    @property
    def nproteins(self):
        return len(self.proteins)

关于python - 优化Python在层次字典中的键搜索,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13630906/

相关文章:

python - 如何向该数据框添加趋势线 (Python)

Python 切片多个输出

c - Gcc 编译器优化函数内联

java - 像 JMeter 这样的程序可以测量 Java 套接字上每秒的事务数?

python - 子集和组合不同长度数组的有效方法

python - Jupyter 笔记本导出为 .pdf

python - 字典到python中的字典

python - 是否有一种可接受的方法让函数从参数中弹出一个值?

objective-c - 查找 NSDictionary 键的更简单方法?

javascript - 为什么不能将对象返回给 Javascript 的 Array.map() 并正确映射它?