我试图理解为什么循环在增加迭代次数后会变慢。该代码只是一些从 API 复制数据的实际代码的模拟。我必须批量下载数据,因为如果一次全部下载,内存就会耗尽。但是,我的批处理循环实现并不是非常理想。我怀疑使用 pandas 会增加开销,但除此之外,还有什么可能导致问题?
import timeit
import pandas as pd
from tqdm import tqdm
def some_generator():
for i in range(1_000_000):
yield {
'colA': 'valA',
'colB': 'valA',
'colC': 'valA',
'colD': 'valA',
'colE': 'valA',
'colF': 'valA',
'colG': 'valA',
'colH': 'valA',
'colI': 'valA',
'colJ': 'valA'
}
def main():
batch_size = 10_000
generator = some_generator()
output = pd.DataFrame()
batch_round = 1
while True:
for _ in tqdm(range(batch_size), desc=f"Batch {batch_round}"):
try:
row = next(generator)
row.pop('colA')
output = pd.concat([output, pd.DataFrame(row, index=[0])], ignore_index=True)
except StopIteration:
break
if output.shape[0] != batch_size * batch_round:
break
else:
batch_round += 1
print(output)
此代码模拟 1M 行数据帧,如果我分批下载 10k 数据,这是我在前 20 批中获得的性能。
Batch 1: 100%|██████████| 10000/10000 [00:21<00:00, 460.89it/s]
Batch 2: 100%|██████████| 10000/10000 [00:28<00:00, 349.16it/s]
Batch 3: 100%|██████████| 10000/10000 [00:38<00:00, 263.12it/s]
Batch 4: 100%|██████████| 10000/10000 [00:43<00:00, 228.76it/s]
Batch 5: 100%|██████████| 10000/10000 [00:53<00:00, 187.44it/s]
Batch 6: 100%|██████████| 10000/10000 [01:02<00:00, 159.92it/s]
Batch 7: 100%|██████████| 10000/10000 [01:09<00:00, 144.79it/s]
Batch 8: 100%|██████████| 10000/10000 [01:18<00:00, 127.59it/s]
Batch 9: 100%|██████████| 10000/10000 [01:25<00:00, 116.92it/s]
Batch 10: 100%|██████████| 10000/10000 [01:34<00:00, 105.96it/s]
Batch 11: 100%|██████████| 10000/10000 [01:40<00:00, 99.81it/s]
Batch 12: 100%|██████████| 10000/10000 [01:46<00:00, 93.92it/s]
Batch 13: 100%|██████████| 10000/10000 [01:55<00:00, 86.49it/s]
Batch 14: 100%|██████████| 10000/10000 [02:03<00:00, 80.92it/s]
Batch 15: 100%|██████████| 10000/10000 [02:10<00:00, 76.46it/s]
Batch 16: 100%|██████████| 10000/10000 [02:18<00:00, 71.99it/s]
Batch 17: 100%|██████████| 10000/10000 [02:25<00:00, 68.69it/s]
Batch 18: 100%|██████████| 10000/10000 [02:32<00:00, 65.57it/s]
Batch 19: 100%|██████████| 10000/10000 [02:42<00:00, 61.53it/s]
Batch 20: 100%|██████████| 10000/10000 [02:39<00:00, 62.84it/s]
最佳答案
Pd.Concat 很贵 ->
在这里,您可以做什么 - 使用空列表并将行字典附加到该特定列表。最后,在所有操作之后将输出转换回 pandas 数据帧。这样会 super 快:)
import timeit
import pandas as pd
from tqdm import tqdm
def some_generator():
for _ in range(1_000_000):
yield {
'colA': 'valA',
'colB': 'valA',
'colC': 'valA',
'colD': 'valA',
'colE': 'valA',
'colF': 'valA',
'colG': 'valA',
'colH': 'valA',
'colI': 'valA',
'colJ': 'valA'
}
def main():
batch_size = 10_000
generator = some_generator()
output = []
batch_round = 1
while True:
for _ in tqdm(range(batch_size), desc=f"Batch {batch_round}"):
try:
row = next(generator)
row.pop('colA')
output.append(row)
except for StopIteration:
break
shape = len(output)
if shape != batch_size * batch_round:
break
else:
batch_round += 1
# print(pd.DataFrame(output))
main()
输出 -
Batch 1: 100%|██████████| 10000/10000 [00:00<00:00, 826724.48it/s]
Batch 2: 100%|██████████| 10000/10000 [00:00<00:00, 978765.55it/s]
Batch 3: 100%|██████████| 10000/10000 [00:00<00:00, 1072629.72it/s]
Batch 4: 100%|██████████| 10000/10000 [00:00<00:00, 1267237.90it/s]
Batch 5: 100%|██████████| 10000/10000 [00:00<00:00, 1351301.27it/s]
Batch 6: 100%|██████████| 10000/10000 [00:00<00:00, 1402918.02it/s]
Batch 7: 100%|██████████| 10000/10000 [00:00<00:00, 1374370.54it/s]
Batch 8: 100%|██████████| 10000/10000 [00:00<00:00, 1435520.57it/s]
Batch 9: 100%|██████████| 10000/10000 [00:00<00:00, 1499947.79it/s]
Batch 10: 100%|██████████| 10000/10000 [00:00<00:00, 1458381.08it/s]
Batch 11: 100%|██████████| 10000/10000 [00:00<00:00, 1366178.30it/s]
Batch 12: 100%|██████████| 10000/10000 [00:00<00:00, 1396844.17it/s]
Batch 13: 100%|██████████| 10000/10000 [00:00<00:00, 1376309.76it/s]
Batch 14: 100%|██████████| 10000/10000 [00:00<00:00, 1453881.94it/s]
Batch 15: 100%|██████████| 10000/10000 [00:00<00:00, 1373245.59it/s]
Batch 16: 100%|██████████| 10000/10000 [00:00<00:00, 1470756.72it/s]
Batch 17: 100%|██████████| 10000/10000 [00:00<00:00, 1450964.82it/s]
Batch 18: 100%|██████████| 10000/10000 [00:00<00:00, 1495882.16it/s]
Batch 19: 100%|██████████| 10000/10000 [00:00<00:00, 1477960.46it/s]
Batch 20: 100%|██████████| 10000/10000 [00:00<00:00, 1479733.29it/s]
Batch 21: 100%|██████████| 10000/10000 [00:00<00:00, 1383528.17it/s]
Batch 22: 100%|██████████| 10000/10000 [00:00<00:00, 1361521.78it/s]
Batch 23: 100%|██████████| 10000/10000 [00:00<00:00, 1420594.07it/s]
Batch 24: 100%|██████████| 10000/10000 [00:00<00:00, 1468850.99it/s]
Batch 25: 100%|██████████| 10000/10000 [00:00<00:00, 1477960.46it/s]
Batch 26: 100%|██████████| 10000/10000 [00:00<00:00, 1055755.13it/s]
Batch 27: 100%|██████████| 10000/10000 [00:00<00:00, 952104.06it/s]
Batch 28: 100%|██████████| 10000/10000 [00:00<00:00, 1260231.96it/s]
Batch 29: 100%|██████████| 10000/10000 [00:00<00:00, 1433705.01it/s]
Batch 30: 100%|██████████| 10000/10000 [00:00<00:00, 1404703.44it/s]
关于Python For 循环在增加迭代次数后速度变慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67292197/