我有一个包含大约 800 万条记录的 csv 文件,但完成该过程需要一个多小时,所以请您帮我解决这个问题吗?
注意:python代码没有问题;它工作得很好,没有任何错误。唯一的问题是加载和处理 8M 记录需要花费太多时间。
这是代码
import pandas as pd
import numpy as np
import ipaddress
from pathlib import Path
import shutil
import os
from time import time
start = time()
inc_path = 'C:/Users/phani/OneDrive/Desktop/pandas/inc'
arc_path = 'C:/Users/phani/OneDrive/Desktop/pandas/arc'
dropZone_path = 'C:/Users/phani/OneDrive/Desktop/pandas/dropZone'
for src_file in Path(dropZone_path).glob('XYZ*.csv*'):
process_file = shutil.copy(os.path.join(dropZone_path, src_file), arc_path)
for sem_file in Path(dropZone_path).glob('XYZ*.sem'):
semaphore_file = shutil.copy(os.path.join(dropZone_path, sem_file), inc_path)
# rename the original file
for file in os.listdir(dropZone_path):
file_path = os.path.join(dropZone_path, file)
shutil.copy(file_path, os.path.join(arc_path, "Original_" + file))
for sema_file in
Path(arc_path).glob('Original_XYZ*.sem*'):
os.remove(sema_file)
## Read CSVfile from TEMP folder
df = pd.read_csv(process_file)
df.sort_values(["START_IP_ADDRESS"], ascending=True,)
i = 0
while i < len(df) - 1:
i += 1
line = df.iloc[i:i + 1].copy(deep=True)
curr_START_IP_NUMBER = line.START_IP_NUMBER.values[0]
curr_END_IP_NUMBER = line.END_IP_NUMBER
prev_START_IP_NUMBER = df.loc[i - 1, 'START_IP_NUMBER']
prev_END_IP_NUMBER = df.loc[i - 1, 'END_IP_NUMBER']
# if no gap - continue
if curr_START_IP_NUMBER == (prev_END_IP_NUMBER + 1):
continue
# else fill the gap
# new line start ip number
line.START_IP_NUMBER = prev_END_IP_NUMBER + 1
line.START_IP_ADDRESS = (ipaddress.ip_address(int(line.START_IP_NUMBER)))
# new line end ip number
line.END_IP_NUMBER = curr_START_IP_NUMBER - 1
line.END_IP_ADDRESS = (ipaddress.ip_address(int(line.END_IP_NUMBER)))
line.COUNTRY_CODE = ''
line.LATITUDE_COORDINATE = ''
line.LONGITUDE_COORDINATE = ''
line.ISP_NAME = ''
line.AREA_CODE = ''
line.CITY_NAME = ''
line.METRO_CODE = ''
line.ORGANIZATION_NAME = ''
line.ZIP_CODE = ''
line.REGION_CODE = ''
# insert the line between curr index to previous index
df = pd.concat([df.iloc[:i], line, df.iloc[i:]]).reset_index(drop=True)
df.to_csv(process_file, index=False)
for process_file in Path(arc_path).glob('XYZ*.csv*'):
EREFile_CSV = shutil.copy(os.path.join(arc_path, process_file), inc_path)
最佳答案
您可以使用 Pandas 库以 block 的形式读取 .csv
文件,然后单独处理每个 block ,或者将所有 block 连接到一个数据帧中(如果您有足够的 RAM 来容纳所有 block )数据):
#read data in chunks of 1 million rows at a time
chunks = pd.read_csv(process_file, chunksize=1000000)
# Process each chunk
for chunk in chunks:
# Process the chunk
print(len(chunk))
# or concat the chunks in a single dataframe
#pd_df = pd.concat(chunks)
或者,您可以使用 Dask
库,它可以通过内部对数据帧进行分块并并行处理来处理大型数据集:
from dask import dataframe as dd
dask_df = dd.read_csv(process_file)
关于python - 如何使用 Python Pandas 处理包含数百万条记录的 DataFrame?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70719146/