python - 使用 python 清理凌乱的 CSV,保存在 Excel 中

标签 python pandas csv encoding utf-8

我对使用 python 还很陌生,我正在尝试将 CSV 文件的负载(100 个)读取到一个数据帧中。然而,csv 文件相当困惑,使用多个分隔符等。我尝试搜索这个网站,但我发现的所有东西都不起作用。我尝试过 readlines 和 pd.read 有很多选项,但我得到的只是错误或空数据帧。当我在 Excel 中打开 CSV 时,它看起来很好,当我将其保存为 UTF-8 csv 时,一切正常。然而,为每个 Excel 文件执行此操作的工作量很大,即使使用宏也是如此。有没有办法使用 python 代码(例如 in2csv)来复制此过程?下面我提供了我需要使用的 csv 文件的一部分,以及来自 excel 的 csv 文件的一部分(有效)。对我来说,主要区别在于空格和逗号分隔符,但在 pd.read 中更改它并没有帮助。非常感谢!

凌乱的 csv:

"Device name:UU-WGB-JV_1  Device type:SUN2000  Device address:IP Address=62.72.193.88   Device No.=2  Date:2018-01-23 08:51:23  "   
"Generated On"  "Device Status" "Energy Yield of Current Day (kWh)" "Inv. efficiency"(%)    "Total Energy Yield (kWh)"  "Input Power (kW)"  "Active Power (kW)" "Reactive Power (kVar)" "Power Factor"  "Grid Frequency (Hz)"   "Grid A Current (A)"    "Grid B Current (A)"    "Grid C Current (A)"    "Grid A Phase Voltage (V)"  "Grid B Phase Voltage (V)"  "Grid C Phase Voltage (V)"  "PV1 Input Current (A)" "PV2 Input Current (A)" "PV3 Input Current (A)" "PV4 Input Current (A)" "PV5 Input Current (A)" "PV6 Input Current (A)" "PV1 Input Voltage (V)" "PV2 Input Voltage (V)" "PV3 Input Voltage (V)" "PV4 Input Voltage (V)" "PV5 Input Voltage (V)" "PV6 Input Voltage (V)" "Cabinet Temperature (℃)"   
"2017-12-22 00:00:00    "   "Idle: No irradiation"  "0.00"  "0.00"  "45803.34"  "0.000" "0.000" "0.000" "0.000" "0.00"  "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"
"2017-12-22 00:15:00    "   "Idle: No irradiation"  "0.00"  "0.00"  "45803.34"  "0.000" "0.000" "0.000" "0.000" "0.00"  "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"   "0.0"

良好的 CSV:

Device name:UU-CB_1  Device type:SUN2000  Device address:IP Address=62.140.137.136   Device No.=1  Date:2018-01-22 13:31:51  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Generated On,Device Status,Energy Yield of Current Day (kWh),Inv. efficiency(%),Total Energy Yield (kWh),Input Power (kW),Active Power (kW),Reactive Power (kVar),Power Factor,Grid Frequency (Hz),Grid A Current (A),Grid B Current (A),Grid C Current (A),Grid A Phase Voltage (V),Grid B Phase Voltage (V),Grid C Phase Voltage (V),PV1 Input Current (A),PV2 Input Current (A),PV3 Input Current (A),PV4 Input Current (A),PV5 Input Current (A),PV6 Input Current (A),PV1 Input Voltage (V),PV2 Input Voltage (V),PV3 Input Voltage (V),PV4 Input Voltage (V),PV5 Input Voltage (V),PV6 Input Voltage (V),Cabinet Temperature (℃)
"2017-11-01 00:00:00    ",Idle: No irradiation,0,-,36670.07,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

最佳答案

似乎第一行(标题)无法挽救,因为它包含空格和未加引号的字段。可以通过特定的正则表达式来修复。我会跳过它。

其余行不是 csv,但包含用空格分隔的引号标记,对于 shlex.split 来说轻而易举:

import shlex,csv

with open("input.csv") as f:
   title = next(f)   # discard title line

   with open("output.csv","w",newline="",encoding="utf-8") as fw:
      cw = csv.writer(fw,delimiter=";")  # may be changed to ","
      cw.writerows(shlex.split(l) for l in f)

输出:

Generated On;Device Status;Energy Yield of Current Day (kWh);Inv. efficiency(%);Total Energy Yield (kWh);Input Power (kW);Active Power (kW);Reactive Power (kVar);Power Factor;Grid Frequency (Hz);Grid A Current (A);Grid B Current (A);Grid C Current (A);Grid A Phase Voltage (V);Grid B Phase Voltage (V);Grid C Phase Voltage (V);PV1 Input Current (A);PV2 Input Current (A);PV3 Input Current (A);PV4 Input Current (A);PV5 Input Current (A);PV6 Input Current (A);PV1 Input Voltage (V);PV2 Input Voltage (V);PV3 Input Voltage (V);PV4 Input Voltage (V);PV5 Input Voltage (V);PV6 Input Voltage (V);Cabinet Temperature (℃)
2017-12-22 00:00:00    ;Idle: No irradiation;0.00;0.00;45803.34;0.000;0.000;0.000;0.000;0.00;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0
2017-12-22 00:15:00    ;Idle: No irradiation;0.00;0.00;45803.34;0.000;0.000;0.000;0.000;0.00;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0

文件现在可以在 Excel 中正确打开(请注意,各个版本的 Excel 默认情况下需要逗号或分号分隔符)

enter image description here

关于python - 使用 python 清理凌乱的 CSV,保存在 Excel 中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48404294/

相关文章:

csv - 当字符串以\结尾时 readtable()

Python Gtk3 可执行文件

python - 在 Pandas 中结合 CustomBusinessDay 和 BusinessHour 类

python - 根据字典键将字典值分配给 DataFrame 列

mysql - 使用命令行客户端格式化从 excel 到 mysql 的输入

mysql - 以下 MySQL csv 导入查询术语是什么意思?

python - 返回非平面索引的 numpy 数组的 Argmax

python - 如何在 matplotlib 图例中放置多个颜色图 block ?

python - RDF/XML 格式转 JSON

python - Pandas 外部合并返回错误值和额外的 Nan