python - 加速运行从数据库中获取的一些数据的大型进程

所以我在一个项目中工作，我必须读取一个包含 1000 万条记录的大型数据库(对我来说它很大)。我无法真正过滤它们，因为我必须单独对待它们。对于每条记录，我必须应用一个公式，然后根据记录的特定条件将此结果写入多个文件。

我已经实现了一些算法，完成整个处理过程大约需要 2-3 天。这是一个问题，因为我正在尝试优化已经花费了这段时间的流程。 1 天是可以接受的。

到目前为止，我已经尝试过数据库索引、线程(记录过程的线程，而不是 I/O 操作)。我不能得到更短的时间。

我正在使用 django，但由于它的懒惰行为，我无法衡量真正开始处理数据需要多少时间。我还想知道我是否可以在收到数据后立即开始处理数据，而不必等待所有数据都加载到内存中才能真正处理它。也可能是我对python写操作的理解。最后可能是我需要一台更好的机器(我对此表示怀疑，我有 4 个内核和 4GB RAM，它应该能够提供更好的速度)

有什么想法吗？我真的很感谢反馈。 :)

编辑:代码

解释:

我说的记录是客户(护照)的id，条件是公司(国家)的不同终端之间是否有协议(protocol)。该过程是一个散列。

第一个策略尝试处理整个数据库......我们在开始时为处理算法的条件部分(国家之间的协议(protocol))做了一些准备。然后通过是否属于某个集合进行大量验证。

由于我一直在尝试自己改进它，所以我尝试将问题分解为第二个策略，按部分处理查询(获取属于某个国家/地区的记录并写入这些国家/地区的文件)与他们达成协议(protocol))

没有描述线程策略，因为它是为单个国家设计的，与没有线程相比，我得到了糟糕的结果。老实说，我的直觉一定是内存和 sql。

def create_all_files(strategy=0):
    if strategy == 0:
        set_countries_agreements = set()
        file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
        set_countries_temp = set(line.strip() for line in file_countries)
        file_countries.close()
        set_countries = sorted_nicely(set_countries_temp)

        for each_country in set_countries:
            set_agreements = frozenset(get_agreements(each_country))
            set_countries_agreements.add(set_agreements)

        print("All agreements obtained")

        set_passports = Passport.objects.all()

        print("All passports obtained")


        for each_passport in set_passports:
            for each_agreement in set_countries_agreements:
                for each_country in each_agreement:
                    if each_passport.nationality == each_country:
                        with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % iter(each_agreement).next()), "a") as f:
                            f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
                    print(".")
                print("_")
            print("-")
        print("~")

    if strategy == 1:

        file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
        set_countries_temp = set(line.strip() for line in file_countries)
        file_countries.close()
        set_countries = sorted_nicely(set_countries_temp)

        while len(set_countries)!= 0:
            country = set_countries.pop()
            list_countries = get_agreements(country)
            list_passports = Passport.objects.filter(nationality=country)
            for each_passport in list_passports:
                for each_country in list_countries:
                    with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % each_country), "a") as f:
                        f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
                        print("r")
                print("c")
            print("p")
        print("P")

最佳答案

在您的问题中，您描述的是一个 ETL过程。 我建议你使用 ETL工具。

要提到一些 python ETL 工具，我可以谈谈 Pygrametl ，由 Christian Thomsen 撰写，在我看来它运行良好并且性能令人印象深刻。测试它并返回结果。

我无法在不提及 MapReduce 的情况下发布此答案.如果您打算通过节点分发任务，此编程模型可以满足您的要求。

关于python - 加速运行从数据库中获取的一些数据的大型进程，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25115431/

python - 加速运行从数据库中获取的一些数据的大型进程

上一篇：mysql - MVC 删除具有多个所需外键约束的记录

下一篇：php - 无法在mysql中加入3个表来链接用户ID