Python:如何优化这个 CSV 解析循环?

标签 python parsing optimization

我编写这个循环来解析 100 万行的 .csv 文件。它可以工作,但只能处理大约 7k 行/分钟。有没有合理的方法可以让它运行得更快?

循环当前正在将数据 block 转换为一行,并去除多余的字符,并将每行写入新的 .csv 文件。

pattern = re.compile(r",{2,}")

with open("OceanData.csv") as infile, open("OceanParsed.csv","w", newline="") as fout:
    outfile = csv.writer(fout)
    data =[]
    for line in infile:
        if line.startswith("#--------------------------------------------------------------------------------"):
            outfile.writerow(data)
            continue
        for ch in ["[","]","'"," ","\n"]:
            if ch in line:
                line = line.replace(ch,"")
        for i in line:
            line =re.sub(pattern,",", line)
            continue

        if not line: continue
        data.append(line)

示例数据:http://www.sharecsv.com/s/674dc42035c29eb4f250b5c2365c8dc6/OceanParseTest.csv

最佳答案

不要重新发明轮子来读取 csv 文件。

您可以使用pandas .

import pandas as pd

df = pd.read_csv('file.csv')

或者使用csv还有标准库。

要读取大的 csv 文件,如果上述方法不起作用。您可以将文件分割成小文件,创建一个进程来读取每个文件。

您的数据 sample .

我认为您的格式文件不是 csv 文件。然后假设您有一个像这样的部分:

#--------------------------------------------------------------------------------,,,,,,
CAST                        ,,9285001,WOD Unique Cast Number,WOD code,,
NODC Cruise ID              ,,US-10209       ,,,,
Originators Station ID      ,,82,,,integer,
Originators Cruise ID       ,,               ,,,,
Latitude                    ,,-76.477,decimal degrees,,,
Longitude                   ,,166.3137,decimal degrees,,,
Year                        ,,1997,,,,
Month                       ,,1,,,,
Day                         ,,1,,,,
Time                        ,,3.9931,decimal hours (UT),,,
METADATA,,,,,,
Country                     ,,             US,NODC code,UNITED STATES,,
Accession Number            ,,520,NODC code,,,
Project                     ,,406,NODC code,RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPON
SE IN ROSS SEA,,
Platform                    ,,3596,OCL code,NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c
.s.KUS1475;IMO900725,,
Institute                   ,,431,NODC code,US DOC NOAA NESDIS,,
Cast/Tow Number             ,,1,,,,
High resolution CTD - Bottle,,9182488,,,,
probe_type                  ,,7,OCL_code,bottle/rossette/net,,
scale            ,Temperature,103,WOD code,Temperature: ITS-90,,
Instrument       ,Temperature,411,WOD code,CTD: SBE 911plus (Sea-Bird Electronics, Inc.),
VARIABLES ,Depth     ,F,O,Temperatur ,F,O
UNITS     ,m         , , ,degrees C ,, 
Prof-Flag ,          ,0, ,          ,0, 
1,0,0, ,-1.591,0, 
2,5,0, ,-1.668,0, 
3,10,0, ,-1.702,0, 
4,15,0, ,-1.733,0, 
5,20,0, ,-1.746,0, 
6,25,0, ,-1.76,0, 
7,30,0, ,-1.773,0, 
8,35,0, ,-1.785,0, 
9,40,0, ,-1.796,0, 
10,45,0, ,-1.805,0, 
11,50,0, ,-1.813,0, 
12,55,0, ,-1.823,0, 
13,60,0, ,-1.832,0, 
14,65,0, ,-1.84,0, 
15,70,0, ,-1.848,0, 
16,75,0, ,-1.855,0, 
17,80,0, ,-1.861,0, 
18,85,0, ,-1.867,0, 
19,90,0, ,-1.873,0, 
20,95,0, ,-1.878,0, 
21,100,0, ,-1.882,0, 
22,125,0, ,-1.892,0, 
23,150,0, ,   ---0---,0, 
24,175,0, ,   ---0---,0, 
25,200,0, ,   ---0---,0, 
26,225,0, ,   ---0---,0, 
27,250,0, ,   ---0---,0, 
28,275,0, ,   ---0---,0, 
29,300,0, ,   ---0---,0, 
30,325,0, ,   ---0---,0, 
31,350,0, ,   ---0---,0, 
32,375,0, ,   ---0---,0, 
33,400,0, ,   ---0---,0, 
34,425,0, ,   ---0---,0, 
35,450,0, ,   ---0---,0, 
36,475,0, ,   ---0---,0, 
37,500,0, ,   ---0---,0, 
38,550,0, ,-1.898,0, 
END OF VARIABLES SECTION,,,,,,

使用以下命令清理此部分:

format.sh:

#!/usr/bin/env bash
# use : bash format.sh pathname    

cat "$1" | \
    grep -v '^#\|^END' | \
    sed 's/,/ /g' | tr -s " " | sed 's/ /,/' 

获取:

CAST,9285001 WOD Unique Cast Number WOD code 
NODC,Cruise ID US-10209 
Originators,Station ID 82 integer 
Originators,Cruise ID 
Latitude,-76.477 decimal degrees 
Longitude,166.3137 decimal degrees 
Year,1997 
Month,1 
Day,1 
Time,3.9931 decimal hours (UT) 
METADATA,
Country,US NODC code UNITED STATES 
Accession,Number 520 NODC code 
Project,406 NODC code RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPONSE IN ROSS SEA 
Platform,3596 OCL code NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c.s.KUS1475;IMO900725 
Institute,431 NODC code US DOC NOAA NESDIS 
Cast/Tow,Number 1 
High,resolution CTD - Bottle 9182488 
probe_type,7 OCL_code bottle/rossette/net 
scale,Temperature 103 WOD code Temperature: ITS-90 
Instrument,Temperature 411 WOD code CTD: SBE 911plus (Sea-Bird Electronics Inc.) 
VARIABLES,Depth F O Temperatur F O
UNITS,m degrees C 
Prof-Flag,0 0 
1,0 0 -1.591 0 
2,5 0 -1.668 0 
3,10 0 -1.702 0 
4,15 0 -1.733 0 
5,20 0 -1.746 0 
6,25 0 -1.76 0 
7,30 0 -1.773 0 
8,35 0 -1.785 0 
9,40 0 -1.796 0 
10,45 0 -1.805 0 
11,50 0 -1.813 0 
12,55 0 -1.823 0 
13,60 0 -1.832 0 
14,65 0 -1.84 0 
15,70 0 -1.848 0 
16,75 0 -1.855 0 
17,80 0 -1.861 0 
18,85 0 -1.867 0 
19,90 0 -1.873 0 
20,95 0 -1.878 0 
21,100 0 -1.882 0 
22,125 0 -1.892 0 
23,150 0 ---0--- 0 
24,175 0 ---0--- 0 
25,200 0 ---0--- 0 
26,225 0 ---0--- 0 
27,250 0 ---0--- 0 
28,275 0 ---0--- 0 
29,300 0 ---0--- 0 
30,325 0 ---0--- 0 
31,350 0 ---0--- 0 
32,375 0 ---0--- 0 
33,400 0 ---0--- 0 
34,425 0 ---0--- 0 
35,450 0 ---0--- 0 
36,475 0 ---0--- 0 
37,500 0 ---0--- 0 
38,550 0 -1.898 0 

如果您有 1M 行,我想您大约有 15,000 个部分。

我明白了:

for _ in `seq 1 15000`; do cat one_section.txt >> data.txt; done

检查:

grep -n ^# data.txt | cut -d : -f1 | wc -l
wc -l data.txt
ls -sh data.txt   

提供 15 000 个节、960000 行和 34MB。

...

关于Python:如何优化这个 CSV 解析循环?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44809434/

相关文章:

json - 如何在条件下使用多个模式

java - 计算一系列数字的 LCM 的最有效算法是什么?

python - 使用非线性刻度将 twinx 与第二轴对齐

python - Tensorflow中复杂的切片操作

python - 比较两个字典并打印缺失或没有匹配项

python - 使用 Python Opencv 在图像中查找问题文本 block

java - 从 Graql 转换为 Java API

parsing - 左分解和左递归之间的区别

c++ - 有任何论文探讨了可用于基于 C++ 的 COM 应用程序的性能问题和优化策略吗?

c - 在 C 中强制无序结构字段