python - 优化使用 python (Django) 填充数据库的代码

我正在尝试使用 Django 使用包含 600 万条记录的文件中的数据填充 SQLite 数据库。然而，即使有 50000 条记录，我编写的代码也会给我带来很多时间问题。

这是我尝试填充数据库的代码:

import os

def populate():   
    with open("filename") as f:
        for line in f:
            col = line.strip().split("|")
            duns=col[1]
            name=col[8]
            job=col[12]        

            dun_add = add_c_duns(duns)   
            add_contact(c_duns = dun_add, fn=name, job=job)

def add_contact(c_duns, fn, job):
    c = Contact.objects.get_or_create(duns=c_duns, fullName=fn, title=job)
    return c

def add_c_duns(duns):
    cd = Contact_DUNS.objects.get_or_create(duns=duns)[0]
    return cd  

if __name__ == '__main__':
    print "Populating Contact db...."
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
    from web.models import Contact, Contact_DUNS
    populate()
    print "Done!!"

该代码工作正常，因为我已经使用虚拟记录对其进行了测试，并且它给出了所需的结果。我想知道是否有一种方法可以降低这段代码的执行时间。谢谢。

最佳答案

我没有足够的声誉来发表评论，但这是一个推测性的答案。

基本上，通过 django 的 ORM 执行此操作的唯一方法是使用 bulk_create 。所以首先要考虑的是get_or_create的使用。如果您的数据库现有记录可能在输入文件中存在重复项，那么您唯一的选择就是自己编写 SQL。如果您使用它来避免输入文件内出现重复，请对其进行预处理以删除重复的行。

因此，如果您可以在没有 get_or_create 的 get 部分的情况下生存，那么您可以遵循以下策略:

遍历输入文件的每一行并为每个条目实例化一个 Contact_DUNS 实例(实际上并不创建行，只需编写 Contact_DUNS(duns=duns) )并保存所有内容实例到数组。将数组传递给 bulk_create 以实际创建行。
使用 value_list 生成 DUNS-id 对列表，并将其转换为 dict，其中 DUNS 编号为键，行 ID 为值.
重复步骤 1，但使用 Contact 实例。在创建每个实例之前，使用 DUNS 编号从步骤 2 的字典中获取 Contact_DUNS id。按以下方式实例化每个联系人:Contact(duns_id=c_duns_id, fullName=fn, title=job) 。同样，收集 Contact 实例后，只需将它们传递给 bulk_create 以创建行。

这应该会从根本上提高性能，因为您将不再对每个输入行执行查询。但正如我上面所说，只有当您可以确定数据库或输入文件中没有重复项时，这才有效。

编辑这是代码:

import os

def populate_duns():
    # Will only work if there are no DUNS duplicates
    # (both in the DB and within the file)
    duns_instances = []   
    with open("filename") as f:
        for line in f:
            duns = line.strip().split("|")[1]        
            duns_instances.append(Contact_DUNS(duns=duns))

    # Run a single INSERT query for all DUNS instances
    # (actually it will be run in batches run but it's still quite fast)
    Contact_DUNS.objects.bulk_create(duns_instances)

def get_duns_dict():
    # This is basically a SELECT query for these two fields
    duns_id_pairs = Contact_DUNS.objects.values_list('duns', 'id')
    return dict(duns_id_pairs)

def populate_contacts():
    # Repeat the same process for Contacts
    contact_instances = []
    duns_dict = get_duns_dict()

    with open("filename") as f:
        for line in f:  
            col = line.strip().split("|")
            duns = col[1]
            name = col[8]
            job = col[12]

            ci = Contact(duns_id=duns_dict[duns],
                         fullName=name,
                         title=job)
            contact_instances.append(ci)

    # Again, run only a single INSERT query
    Contact.objects.bulk_create(contact_instances)

if __name__ == '__main__':
    print "Populating Contact db...."
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
    from web.models import Contact, Contact_DUNS
    populate_duns()
    populate_contacts()
    print "Done!!"

关于python - 优化使用 python (Django) 填充数据库的代码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32927925/

python - 优化使用 python (Django) 填充数据库的代码

上一篇：python - 如何继续询问用户输入，直到认为输入有效？

下一篇：python - 在Python中给定一个包含该子字符串的字符串，查找该子字符串的所有索引