python - 如何使用 Python Pandas 处理包含数百万条记录的 DataFrame?

标签 python pandas dataframe csv

我有一个包含大约 800 万条记录的 csv 文件,但完成该过程需要一个多小时,所以请您帮我解决这个问题吗?

注意:python代码没有问题;它工作得很好,没有任何错误。唯一的问题是加载和处理 8M 记录需要花费太多时间。

这是代码

import pandas as pd
import numpy as np
import ipaddress
from pathlib import Path
import shutil
import os
from time import time
start = time()

inc_path = 'C:/Users/phani/OneDrive/Desktop/pandas/inc'
arc_path = 'C:/Users/phani/OneDrive/Desktop/pandas/arc'
dropZone_path = 'C:/Users/phani/OneDrive/Desktop/pandas/dropZone'

for src_file in Path(dropZone_path).glob('XYZ*.csv*'):
  process_file = shutil.copy(os.path.join(dropZone_path, src_file), arc_path)

for sem_file in Path(dropZone_path).glob('XYZ*.sem'):
  semaphore_file = shutil.copy(os.path.join(dropZone_path, sem_file), inc_path)

 # rename the original file
 for file in os.listdir(dropZone_path):
file_path = os.path.join(dropZone_path, file)
shutil.copy(file_path, os.path.join(arc_path, "Original_" + file))

 for sema_file in 
   Path(arc_path).glob('Original_XYZ*.sem*'):
   os.remove(sema_file)

  ## Read CSVfile from TEMP folder
  df = pd.read_csv(process_file)
  df.sort_values(["START_IP_ADDRESS"], ascending=True,)

  i = 0
  while i < len(df) - 1:
     i += 1
    line = df.iloc[i:i + 1].copy(deep=True)
curr_START_IP_NUMBER = line.START_IP_NUMBER.values[0]
curr_END_IP_NUMBER = line.END_IP_NUMBER
prev_START_IP_NUMBER = df.loc[i - 1, 'START_IP_NUMBER']
prev_END_IP_NUMBER = df.loc[i - 1, 'END_IP_NUMBER']
# if no gap - continue
if curr_START_IP_NUMBER == (prev_END_IP_NUMBER + 1):
    continue
# else fill the gap
# new line start ip number
line.START_IP_NUMBER = prev_END_IP_NUMBER + 1
line.START_IP_ADDRESS = (ipaddress.ip_address(int(line.START_IP_NUMBER)))
# new line end ip number
line.END_IP_NUMBER = curr_START_IP_NUMBER - 1
line.END_IP_ADDRESS = (ipaddress.ip_address(int(line.END_IP_NUMBER)))
line.COUNTRY_CODE = ''
line.LATITUDE_COORDINATE = ''
line.LONGITUDE_COORDINATE = ''
line.ISP_NAME = ''
line.AREA_CODE = ''
line.CITY_NAME = ''
line.METRO_CODE = ''
line.ORGANIZATION_NAME = ''
line.ZIP_CODE = ''
line.REGION_CODE = ''
# insert the line between curr index to previous index
df = pd.concat([df.iloc[:i], line, df.iloc[i:]]).reset_index(drop=True)
df.to_csv(process_file, index=False)
for process_file in Path(arc_path).glob('XYZ*.csv*'):
   EREFile_CSV = shutil.copy(os.path.join(arc_path, process_file), inc_path)

最佳答案

您可以使用 Pandas 库以 block 的形式读取 .csv 文件,然后单独处理每个 block ,或者将所有 block 连接到一个数据帧中(如果您有足够的 RAM 来容纳所有 block )数据):

#read data in chunks of 1 million rows at a time
chunks = pd.read_csv(process_file, chunksize=1000000)

# Process each chunk
for chunk in chunks:
    # Process the chunk
    print(len(chunk))
    
# or concat the chunks in a single dataframe
#pd_df = pd.concat(chunks)

或者,您可以使用 Dask库,它可以通过内部对数据帧进行分块并并行处理来处理大型数据集:

from dask import dataframe as dd
dask_df = dd.read_csv(process_file)

关于python - 如何使用 Python Pandas 处理包含数百万条记录的 DataFrame?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70719146/

相关文章:

python - 在 Tkinter 网格上绘图

python - 按日期和城市重新采样和聚合数据帧

Python Matplotlib (1) 将 x 轴标签格式设置为“年-季度”,(2) 将 Major_locator 设置为月末

R - 数据框列中序列中数字的平均值

R:如何将数据框列表与特定列相交

python - 使用 python 使用逗号分隔值针对键聚合/压缩数据

python - DJANGO_SETTINGS_MODULE envar 未设置

python - 获取两个值之间的整数

python - 有没有办法将函数转换为接受在函数之前评估的可调用对象?

python - pandas Series'对象没有属性 'find'