python - 修复 csv 文件一列中多余逗号的最简单方法

标签 python excel pandas csv

我有一个非常大的 CSV 文件,如下所示:

rownum, id, first, last, age, ADDRESS, weight, hair, pet, food
1, 123450, John, Bingo, 47, 123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US, 299, red, cat, lasagna
2, 125379, Joe, Durante, 61, 19345 S. 1st Ave., Seattle, WA, 16748, 180, blonde, dog, hotdogs
3, 197572, James, Gringo, 39, 123 Maypole St., Northside Castle, upper east side, NY, NY 30594, 202, brown, dog, lo-mein
4, 129358, Jim, Dingus, 22, 0985 Martyr Ave, Fancytown, MA 49436, USA, 163, brown, goldfish, hamburgers
5, 987543, Dwayne 'The Rock', Johnson, 42, 555 Fitness Ln, Los Angeles, CA, 90210, 260, black, dog, steak
6, 048573, Jean, Grey, 33, 987 X-Men Rd., Rm. 3F, outside boston?, MA 34972, 130, red, <null>, salad
7, 756432, Jose, Cuervo, 59, 444 Jalisco Rd., agave_town, Mexico, not, sure, what, their zipcode system is?, 145, black, dog, margaritas
8, 845384, Junebug, Messerschmit, 2, 22nd Ave N, Boston, MA 45678, 130, blonde, turtle, lollipops
9, 634839, Jimbo, Humboldt, 99, 111 1st Street Kansas City KS 84638, 220, brown, ferrets, tacos
10, 483629, Julius, Caesar, 30, Emperors Estate in Ancient Rome, 145, brown, servants, grapes

由于多余的逗号,我在解析 ADDRESS 列时遇到问题。 我想要的输出将如下所示:

rownum| id| first| last| age| ADDRESS| weight| hair| pet| food
1| 123450| John| Bingo| 47| 123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US| 299| red| cat| lasagna
2| 125379| Joe| Durante| 61| 19345 S. 1st Ave., Seattle, WA, 16748 180| blonde| dog| hotdogs
3| 197572| James| Gringo| 39| 123 Maypole St., Northside Castle, upper east side, NY, NY 30594| 202| brown| dog| lo-mein
4| 129358| Jim| Dingus| 22| 0985 Martyr Ave, Fancytown, MA 49436, USA| 163| brown| goldfish| hamburgers
5| 987543| Dwayne 'The Rock'| Johnson| 42| 555 Fitness Ln, Los Angeles, CA, 90210| 260| black| dog| steak
6| 048573| Jean| Grey| 33| 987 X-Men Rd., Rm. 3F, outside boston?,. MA 34972| 130| red| <null>| salad
7| 756432| Jose| Cuervo| 59| 444 Jalisco Rd., agave_town, Mexico, not, sure, what| their zipcode system is?| 145| black| dog| margaritas
8| 845384| Junebug| Messerschmit| 2| 22nd Ave N, Boston, MA 45678| 130| blonde| turtle| lollipops
9| 634839| Jimbo| Humboldt| 99| 111 1st Street Kansas City KS 84638| 220| brown| ferrets| tacos
10| 483629| Julius| Caesar| 30| Emperors Estate in Ancient Rome| 145| brown| servants| grapes

它不必是管道分隔的,我只需要 Excel 可以正确读取的格式。我无法使用文本导入向导在 Excel 中执行此操作,但也许我遗漏了某些内容?我是否忽略了一个简单的解决方案?

首先,我想我可以使用 Notepad++ 简单地执行正则表达式查找和替换(例如,将第一个“,”替换为“|”,转到下一行,重复。运行 5 次。然后从每行末尾并运行 4 次。我无法让它工作。

现在我正在尝试使用 python re 或 pandas 来完成此操作,但我还没有走得太远,因为我对 python 还很陌生。我对 python ETL、csv/文本文件读/写、每行迭代正则表达式等没有太多经验。

我确信有不同的方法可以解决这个问题(例如,在从开始的第 5 个逗号之后和从结束的第 4 个逗号之前添加 dbl 引号转义字符)。

这是我的 jupyter 笔记本中到目前为止的困惑情况

## Idea01a: The user defines the delimiter character and proper qty per line

delimiter=input("Type delimiter example here, then press enter: ")
l=input("paste 1st line of csv here (e.g. column headers only), then press enter: ")
d={}
## print(l)
for i in l:
 if i not in d:
  d[i]=l.count(i)
 else:
  pass

qty_proper_delimiters_total = (d[delimiter])
print("Delimiter character chosen:")
print(delimiter)
print("Proper number of delimiters:")
print(qty_proper_delimiters_total)

## Idea01b: User defines problem column

bad_column=input("""Enter afflicted column number, then press enter: 
(e.g., A=1, B=2, C=3, D=4, E=5, F=6, G=7, etc.)""")
print(bad_column)
qty_proper_delimiters_before_bad_column = (int(bad_column)-1)
print("Proper qty commas BEFORE bad column:")
print(qty_proper_delimiters_before_bad_column)
qty_proper_delimiters_after_bad_column = (qty_proper_delimiters_total-(int(bad_column)-1))
print("Proper qty commas AFTER bad column:")
print(qty_proper_delimiters_after_bad_column)

## Idea02: Insert escape character just right/left of flanking commas 

## (iterate n1 times from line start, then n2 times backwards from line end)
n1 = qty_proper_delimiters_before_bad_column
n2 = qty_proper_delimiters_after_bad_column

txt = input("Copy/Paste entire CSV here:")
## need to figure out how to iterate line by line

感谢您的帮助。

最佳答案

使用新数据,我将根据我在评论部分中所写的想法发布答案:您可以通过剥离第一个和最后一个(一些固定数量的)字段来识别地址部分。因此,请尝试以下操作,并仔细检查它是否达到您想要的效果。

import csv

with open('foo.csv') as f:
    for record in csv.reader(f):
        print(*record[:5], ', '.join(record[5:-4]), *record[-4:], sep='|')

输出:

rownum| id| first| last| age| ADDRESS| weight| hair| pet| food
1| 123450| John| Bingo| 47| 123 Odd St.,  Waverly Place Apts,  PO Box 12345,  Apt#5E,  Upper-Ontario,  Eastern Province A12-E765,  Not Puerto Rico,  US| 29
9| red| cat| lasagna
2| 125379| Joe| Durante| 61| 19345 S. 1st Ave.,  Seattle,  WA,  16748| 180| blonde| dog| hotdogs
3| 197572| James| Gringo| 39| 123 Maypole St.,  Northside Castle,  upper east side,  NY,  NY 30594| 202| brown| dog| lo-mein
4| 129358| Jim| Dingus| 22| 0985 Martyr Ave,  Fancytown,  MA 49436,  USA| 163| brown| goldfish| hamburgers
5| 987543| Dwayne 'The Rock'| Johnson| 42| 555 Fitness Ln,  Los Angeles,  CA,  90210| 260| black| dog| steak
6| 048573| Jean| Grey| 33| 987 X-Men Rd.,  Rm. 3F,  outside boston?,  MA 34972| 130| red| <null>| salad
7| 756432| Jose| Cuervo| 59| 444 Jalisco Rd.,  agave_town,  Mexico,  not,  sure,  what,  their zipcode system is?| 145| black| dog| margaritas
8| 845384| Junebug| Messerschmit| 2| 22nd Ave N,  Boston,  MA 45678| 130| blonde| turtle| lollipops
9| 634839| Jimbo| Humboldt| 99| 111 1st Street Kansas City KS 84638| 220| brown| ferrets| tacos
10| 483629| Julius| Caesar| 30| Emperors Estate in Ancient Rome| 145| brown| servants| grapes

关于python - 修复 csv 文件一列中多余逗号的最简单方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73464196/

相关文章:

xml - 需要使用 Power Query 将 XML 导入 Excel 的帮助

c++ - 如何在 C++ 中捕获错误并调试在 Visual Studio 2010 下创建的 Excel DLL 加载项?

Python 使用 xarray 从 NETCDF 文件中提取多个纬度/经度

sql - 操作错误: near ",": syntax error - SQLITE3

python - 如何获取 pandas 数据帧的删除重复项索引

Python 在一项和列表之间创建排列

python - python类构造函数中的缩进错误

vba - 用VBA在excel单元格中插入公式出错了?

python - sys.argv[0] 总是什么都不返回

python - 从字符串中提取信息