我需要从 csv
文件创建一个 pandas.DataFrame
。为此,我使用方法pandas.csv_reader(...)
。该文件的问题是一列或多列的值中包含逗号(我不控制文件格式)。
我一直在尝试实现此 question 的解决方案,但出现以下错误:
pandas.errors.EmptyDataError: No columns to parse from file
出于某种原因,在实现此解决方案后,我尝试修复的 csv 文件是空白的。
这是我正在使用的代码:
# fix csv file
with open ("/Users/username/works/test.csv",'rb') as f,\
open("/Users/username/works/test.csv",'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
row = line.split(',', 4)
writer.writerow(row)
# Manipulate csv file
data = pd.read_csv(os.path.expanduser\
("/Users/username/works/test.csv"),error_bad_lines=False)
有什么想法吗?
数据概览:
Id0 Id 1 Id 2 Country Company Title Email
23 123 456 AR name cargador <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="147179757d78547179757d783a777b79" rel="noreferrer noopener nofollow">[email protected]</a>
24 123 456 AR name Executive assistant <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ed88808c8481ad88808c8481c38e8280" rel="noreferrer noopener nofollow">[email protected]</a>
25 123 456 AR name Asistente Administrativo <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5b3e363a32371b3e363a323775383436" rel="noreferrer noopener nofollow">[email protected]</a>
26 123 456 AR name Atención al cliente vía telefónica vía online <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="395c54585055795c54585055175a5654" rel="noreferrer noopener nofollow">[email protected]</a>
39 123 456 AR name Asesor de ventas <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a8cdc5c9c1c4e8cdc5c9c1c486cbc7c5" rel="noreferrer noopener nofollow">[email protected]</a>
40 123 456 AR name inc. International company representative <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9df8f0fcf4f1ddf8f0fcf4f1b3fef2f0" rel="noreferrer noopener nofollow">[email protected]</a>
41 123 456 AR name Vendedor de campo <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2c49414d45406c49414d4540024f4341" rel="noreferrer noopener nofollow">[email protected]</a>
42 123 456 AR name PUBLICIDAD ATENCIÓN AL CLIENTE <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="31545c50585d71545c50585d1f525e5c" rel="noreferrer noopener nofollow">[email protected]</a>
43 123 456 AR name Asistente de Marketing <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="8beee6eae2e7cbeee6eae2e7a5e8e4e6" rel="noreferrer noopener nofollow">[email protected]</a>
44 123 456 AR name SOLDADOR <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="03666e626a6f43666e626a6f2d606c6e" rel="noreferrer noopener nofollow">[email protected]</a>
217 123 456 AR name Se requiere vendedores Loja Quevedo Guayas) <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="60050d01090c20050d01090c4e030f0d" rel="noreferrer noopener nofollow">[email protected]</a>
218 123 456 AR name Ing. Civil recién graduado Yaruquí <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="dcb9b1bdb5b09cb9b1bdb5b0f2bfb3b1" rel="noreferrer noopener nofollow">[email protected]</a>
219 123 456 AR name ayudantes enfermeria <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ee8b838f8782ae8b838f8782c08d8183" rel="noreferrer noopener nofollow">[email protected]</a>
220 123 456 AR name Trip Leader for International Youth Exchange <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="86e3ebe7efeac6e3ebe7efeaa8e5e9eb" rel="noreferrer noopener nofollow">[email protected]</a>
221 123 456 AR name COUNTRY MANAGER / DIRECTOR COMERCIAL <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="22474f434b4e62474f434b4e0c414d4f" rel="noreferrer noopener nofollow">[email protected]</a>
250 123 456 AR name Ayudante de Pasteleria <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="dcb9b1bdb5b09cb9b1bdb5b0f2bfb3b1" rel="noreferrer noopener nofollow">[email protected]</a> Asesor <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7a1f171b13163a1f171b131654191517" rel="noreferrer noopener nofollow">[email protected]</a> <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="62070f030b0e22070f030b0e4c010d0f" rel="noreferrer noopener nofollow">[email protected]</a>
预解析的 CSV:
#,Id 1,Id 2,Country,Company,Title,Email,,,,
23,123,456,AR,name,cargador,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="10757d71797c50757d71797c3e737f7d" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
24,123,456,AR,name,Executive assistant,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c1a4aca0a8ad81a4aca0a8adefa2aeac" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
25,123,456,AR,name,Asistente Administrativo,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b9dcd4d8d0d5f9dcd4d8d0d597dad6d4" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
26,123,456,AR,name,Atención al cliente vía telefónica , vía online,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="553038343c39153038343c397b363a38" rel="noreferrer noopener nofollow">[email protected]</a>,,,
39,123,456,AR,name,Asesor de ventas,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f99c94989095b99c94989095d79a9694" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
40,123,456,AR,name, inc.,International company representative,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="42272f232b2e02272f232b2e6c212d2f" rel="noreferrer noopener nofollow">[email protected]</a>,,,
41,123,456,AR,name,Vendedor de campo,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ee8b838f8782ae8b838f8782c08d8183" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
42,123,456,AR,name,PUBLICIDAD, ATENCIÓN AL CLIENTE,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b0d5ddd1d9dcf0d5ddd1d9dc9ed3dfdd" rel="noreferrer noopener nofollow">[email protected]</a>,,,
43,123,456,AR,name,Asistente de Marketing,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="41242c20282d01242c20282d6f222e2c" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
44,123,456,AR,name,SOLDADOR,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4d28202c24210d28202c2421632e2220" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
217,123,456,AR,name,Se requiere vendedores,, Loja , Quevedo, Guayas),<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a2c7cfc3cbcee2c7cfc3cbce8cc1cdcf" rel="noreferrer noopener nofollow">[email protected]</a>
218,123,456,AR,name,Ing. Civil recién graduado, Yaruquí,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="244149454d48644149454d480a474b49" rel="noreferrer noopener nofollow">[email protected]</a>,,,
219,123,456,AR,name,ayudantes enfermeria,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7e1b131f17123e1b131f1712501d1113" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
220,123,456,AR,name,Trip Leader for International Youth Exchange,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="dabfb7bbb3b69abfb7bbb3b6f4b9b5b7" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
221,123,456,AR,name,COUNTRY MANAGER / DIRECTOR COMERCIAL,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ef8a828e8683af8a828e8683c18c8082" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
250,123,456,AR,name,Ayudante de Pasteleria,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e1848c80888da1848c80888dcf828e8c" rel="noreferrer noopener nofollow">[email protected]</a>, Asesor,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0d68606c64614d68606c6461236e6260" rel="noreferrer noopener nofollow">[email protected]</a>,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="acc9c1cdc5c0ecc9c1cdc5c082cfc3c1" rel="noreferrer noopener nofollow">[email protected]</a>,
251,123,456,AR,name,Ejecutiva de Ventas,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f0959d91999cb0959d91999cde939f9d" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
最佳答案
如果您可以假设对于 Comapny,任何逗号后面都跟有空格,并且所有剩余的错误逗号都位于电子邮件地址之前的列中,那么可以编写一个小型解析器来处理它。
代码:
import csv
import re
VALID_EMAIL = re.compile(r'[^@]+@[^@]+\.[^@]+')
def read_my_csv(file_handle):
# build csv reader
reader = csv.reader(file_handle)
# get the header, and find the e-mail and title columns
header = next(reader)
email_column = header.index('Email')
title_column = header.index('Title')
# yield the header up to the e-mail column
yield header[:email_column+1]
# for each row, go through rebuild columns
for row in reader:
# for each row, put the Company column back together
while row[title_column].startswith(' '):
row[title_column-1] += ',' + row[title_column]
del row[title_column]
# for each row, put the Title column back together
while not VALID_EMAIL.match(row[email_column]):
row[email_column-1] += ',' + row[email_column]
del row[email_column]
yield row[:email_column+1]
测试代码:
with open ("test.csv", 'rU') as f:
generator = read_my_csv(f)
columns = next(generator)
df = pd.DataFrame(generator, columns=columns)
print(df)
结果:
# Id 1 Id 2 Country Company \
0 23 123 456 AR name
1 24 123 456 AR name
2 25 123 456 AR name
3 26 123 456 AR name
4 39 123 456 AR name
5 40 123 456 AR name, inc.
6 41 123 456 AR name
7 42 123 456 AR name
8 43 123 456 AR name
9 44 123 456 AR name
10 217 123 456 AR name
11 218 123 456 AR name
12 219 123 456 AR name
13 220 123 456 AR name
14 221 123 456 AR name
15 250 123 456 AR name
16 251 123 456 AR name
Title Email
0 cargador <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7c19111d15103c19111d1510521f1311" rel="noreferrer noopener nofollow">[email protected]</a>
1 Executive assistant <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1c79717d75705c79717d7570327f7371" rel="noreferrer noopener nofollow">[email protected]</a>
2 Asistente Administrativo <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="61040c00080d21040c00080d4f020e0c" rel="noreferrer noopener nofollow">[email protected]</a>
3 Atención al cliente vía telefónica , vía online <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="096c64686065496c64686065276a6664" rel="noreferrer noopener nofollow">[email protected]</a>
4 Asesor de ventas <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5d38303c34311d38303c3431733e3230" rel="noreferrer noopener nofollow">[email protected]</a>
5 International company representative <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="593c34383035193c34383035773a3634" rel="noreferrer noopener nofollow">[email protected]</a>
6 Vendedor de campo <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="badfd7dbd3d6fadfd7dbd3d694d9d5d7" rel="noreferrer noopener nofollow">[email protected]</a>
7 PUBLICIDAD, ATENCIÓN AL CLIENTE <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="8de8e0ece4e1cde8e0ece4e1a3eee2e0" rel="noreferrer noopener nofollow">[email protected]</a>
8 Asistente de Marketing <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="aecbc3cfc7c2eecbc3cfc7c280cdc1c3" rel="noreferrer noopener nofollow">[email protected]</a>
9 SOLDADOR <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6f0a020e06032f0a020e0603410c0002" rel="noreferrer noopener nofollow">[email protected]</a>
10 Se requiere vendedores,, Loja , Quevedo, Guayas) <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ee8b838f8782ae8b838f8782c08d8183" rel="noreferrer noopener nofollow">[email protected]</a>
11 Ing. Civil recién graduado, Yaruquí <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="07626a666e6b47626a666e6b2964686a" rel="noreferrer noopener nofollow">[email protected]</a>
12 ayudantes enfermeria <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="10757d71797c50757d71797c3e737f7d" rel="noreferrer noopener nofollow">[email protected]</a>
13 Trip Leader for International Youth Exchange <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9efbf3fff7f2defbf3fff7f2b0fdf1f3" rel="noreferrer noopener nofollow">[email protected]</a>
14 COUNTRY MANAGER / DIRECTOR COMERCIAL <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="fa9f979b9396ba9f979b9396d4999597" rel="noreferrer noopener nofollow">[email protected]</a>
15 Ayudante de Pasteleria <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="91f4fcf0f8fdd1f4fcf0f8fdbff2fefc" rel="noreferrer noopener nofollow">[email protected]</a>
16 Ejecutiva de Ventas <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e88d85898184a88d85898184c68b8785" rel="noreferrer noopener nofollow">[email protected]</a>
关于python - 用逗号解析 pandas 中的 CSV 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44122091/