Python For 循环在增加迭代次数后速度变慢

标签 python pandas

我试图理解为什么循环在增加迭代次数后会变慢。该代码只是一些从 API 复制数据的实际代码的模拟。我必须批量下载数据,因为如果一次全部下载,内存就会耗尽。但是,我的批处理循环实现并不是非常理想。我怀疑使用 pandas 会增加开销,但除此之外,还有什么可能导致问题?

import timeit
import pandas as pd
from tqdm import tqdm


def some_generator():
    for i in range(1_000_000):
        yield {
            'colA': 'valA',
            'colB': 'valA',
            'colC': 'valA',
            'colD': 'valA',
            'colE': 'valA',
            'colF': 'valA',
            'colG': 'valA',
            'colH': 'valA',
            'colI': 'valA',
            'colJ': 'valA'
        }


def main():
    batch_size = 10_000
    generator = some_generator()
    output = pd.DataFrame()
    batch_round = 1

    while True:

        for _ in tqdm(range(batch_size), desc=f"Batch {batch_round}"):

            try:
                row = next(generator)
                row.pop('colA')
                output = pd.concat([output, pd.DataFrame(row, index=[0])], ignore_index=True)

            except StopIteration:
                break

        if output.shape[0] != batch_size * batch_round:
            break
        else:
            batch_round += 1

    print(output)

此代码模拟 1M 行数据帧,如果我分批下载 10k 数据,这是我在前 20 批中获得的性能。

Batch 1: 100%|██████████| 10000/10000 [00:21<00:00, 460.89it/s]
Batch 2: 100%|██████████| 10000/10000 [00:28<00:00, 349.16it/s]
Batch 3: 100%|██████████| 10000/10000 [00:38<00:00, 263.12it/s]
Batch 4: 100%|██████████| 10000/10000 [00:43<00:00, 228.76it/s]
Batch 5: 100%|██████████| 10000/10000 [00:53<00:00, 187.44it/s]
Batch 6: 100%|██████████| 10000/10000 [01:02<00:00, 159.92it/s]
Batch 7: 100%|██████████| 10000/10000 [01:09<00:00, 144.79it/s]
Batch 8: 100%|██████████| 10000/10000 [01:18<00:00, 127.59it/s]
Batch 9: 100%|██████████| 10000/10000 [01:25<00:00, 116.92it/s]
Batch 10: 100%|██████████| 10000/10000 [01:34<00:00, 105.96it/s]
Batch 11: 100%|██████████| 10000/10000 [01:40<00:00, 99.81it/s]
Batch 12: 100%|██████████| 10000/10000 [01:46<00:00, 93.92it/s]
Batch 13: 100%|██████████| 10000/10000 [01:55<00:00, 86.49it/s]
Batch 14: 100%|██████████| 10000/10000 [02:03<00:00, 80.92it/s]
Batch 15: 100%|██████████| 10000/10000 [02:10<00:00, 76.46it/s]
Batch 16: 100%|██████████| 10000/10000 [02:18<00:00, 71.99it/s]
Batch 17: 100%|██████████| 10000/10000 [02:25<00:00, 68.69it/s]
Batch 18: 100%|██████████| 10000/10000 [02:32<00:00, 65.57it/s]
Batch 19: 100%|██████████| 10000/10000 [02:42<00:00, 61.53it/s]
Batch 20: 100%|██████████| 10000/10000 [02:39<00:00, 62.84it/s]

最佳答案

Pd.Concat 很贵 ->

在这里,您可以做什么 - 使用空列表并将行字典附加到该特定列表。最后,在所有操作之后将输出转换回 pandas 数据帧。这样会 super 快:)

import timeit
import pandas as pd
from tqdm import tqdm


def some_generator():
    for _ in range(1_000_000):
        yield {
            'colA': 'valA',
            'colB': 'valA',
            'colC': 'valA',
            'colD': 'valA',
            'colE': 'valA',
            'colF': 'valA',
            'colG': 'valA',
            'colH': 'valA',
            'colI': 'valA',
            'colJ': 'valA'
        }


def main():
    batch_size = 10_000
    generator = some_generator()
    output = []
    batch_round = 1

    while True:

        for _ in tqdm(range(batch_size), desc=f"Batch {batch_round}"):

            try:
                row = next(generator)
                row.pop('colA')
                output.append(row)

            except for StopIteration:
                break

        shape = len(output)  
        if shape != batch_size * batch_round:
            break
        else:
            batch_round += 1
            

    # print(pd.DataFrame(output))

main()

输出 -

Batch 1: 100%|██████████| 10000/10000 [00:00<00:00, 826724.48it/s]
Batch 2: 100%|██████████| 10000/10000 [00:00<00:00, 978765.55it/s]
Batch 3: 100%|██████████| 10000/10000 [00:00<00:00, 1072629.72it/s]
Batch 4: 100%|██████████| 10000/10000 [00:00<00:00, 1267237.90it/s]
Batch 5: 100%|██████████| 10000/10000 [00:00<00:00, 1351301.27it/s]
Batch 6: 100%|██████████| 10000/10000 [00:00<00:00, 1402918.02it/s]
Batch 7: 100%|██████████| 10000/10000 [00:00<00:00, 1374370.54it/s]
Batch 8: 100%|██████████| 10000/10000 [00:00<00:00, 1435520.57it/s]
Batch 9: 100%|██████████| 10000/10000 [00:00<00:00, 1499947.79it/s]
Batch 10: 100%|██████████| 10000/10000 [00:00<00:00, 1458381.08it/s]
Batch 11: 100%|██████████| 10000/10000 [00:00<00:00, 1366178.30it/s]
Batch 12: 100%|██████████| 10000/10000 [00:00<00:00, 1396844.17it/s]
Batch 13: 100%|██████████| 10000/10000 [00:00<00:00, 1376309.76it/s]
Batch 14: 100%|██████████| 10000/10000 [00:00<00:00, 1453881.94it/s]
Batch 15: 100%|██████████| 10000/10000 [00:00<00:00, 1373245.59it/s]
Batch 16: 100%|██████████| 10000/10000 [00:00<00:00, 1470756.72it/s]
Batch 17: 100%|██████████| 10000/10000 [00:00<00:00, 1450964.82it/s]
Batch 18: 100%|██████████| 10000/10000 [00:00<00:00, 1495882.16it/s]
Batch 19: 100%|██████████| 10000/10000 [00:00<00:00, 1477960.46it/s]
Batch 20: 100%|██████████| 10000/10000 [00:00<00:00, 1479733.29it/s]
Batch 21: 100%|██████████| 10000/10000 [00:00<00:00, 1383528.17it/s]
Batch 22: 100%|██████████| 10000/10000 [00:00<00:00, 1361521.78it/s]
Batch 23: 100%|██████████| 10000/10000 [00:00<00:00, 1420594.07it/s]
Batch 24: 100%|██████████| 10000/10000 [00:00<00:00, 1468850.99it/s]
Batch 25: 100%|██████████| 10000/10000 [00:00<00:00, 1477960.46it/s]
Batch 26: 100%|██████████| 10000/10000 [00:00<00:00, 1055755.13it/s]
Batch 27: 100%|██████████| 10000/10000 [00:00<00:00, 952104.06it/s]
Batch 28: 100%|██████████| 10000/10000 [00:00<00:00, 1260231.96it/s]
Batch 29: 100%|██████████| 10000/10000 [00:00<00:00, 1433705.01it/s]
Batch 30: 100%|██████████| 10000/10000 [00:00<00:00, 1404703.44it/s]

关于Python For 循环在增加迭代次数后速度变慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67292197/

相关文章:

python - 类作为函数的输入

java - 使用 numpy.fromfile 在 Python 中使用 ObjectOutputStream 读取用 Java 编写的 double 的二进制文件

python - Pyspark RDD ReduceByKey 多函数

python - Pandas:重新索引仅对具有唯一值的 Index 对象有效

python - 使用 Python 和绘图按日期对 Pandas 数据框进行分组

python - 如何将列表字典转换为 Pandas 中的数据框

python - 如何有条件地跳过pd.read_html()中不包含表的html文件?

python - 从 pandas 数据框中删除非零单元格并删除索引

python - 如何在 Python 中对一组单词进行分词

python - 如何将数据框列列表值转换为元素