python - 用逗号解析 pandas 中的 CSV 文件

标签 python python-3.x pandas

我需要从 csv 文件创建一个 pandas.DataFrame 。为此,我使用方法pandas.csv_reader(...)。该文件的问题是一列或多列的值中包含逗号(我不控制文件格式)。 我一直在尝试实现此 question 的解决方案,但出现以下错误:

pandas.errors.EmptyDataError: No columns to parse from file 

出于某种原因,在实现此解决方案后,我尝试修复的 csv 文件是空白的。

这是我正在使用的代码:

# fix csv file
with open ("/Users/username/works/test.csv",'rb') as f,\
open("/Users/username/works/test.csv",'wb') as g:
    writer = csv.writer(g, delimiter=',')
    for line in f:
        row = line.split(',', 4)
        writer.writerow(row)
# Manipulate csv file
data = pd.read_csv(os.path.expanduser\
("/Users/username/works/test.csv"),error_bad_lines=False)

有什么想法吗?

数据概览:

 Id0    Id 1    Id 2 Country Company Title       Email                  
  23    123     456   AR     name    cargador   <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="147179757d78547179757d783a777b79" rel="noreferrer noopener nofollow">[email protected]</a>                 

  24    123     456   AR     name    Executive assistant    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ed88808c8481ad88808c8481c38e8280" rel="noreferrer noopener nofollow">[email protected]</a>                 

  25    123     456   AR     name   Asistente Administrativo    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5b3e363a32371b3e363a323775383436" rel="noreferrer noopener nofollow">[email protected]</a>                 

  26    123     456   AR     name   Atención al cliente vía telefónica   vía online <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="395c54585055795c54585055175a5654" rel="noreferrer noopener nofollow">[email protected]</a>             
  39    123     456   AR     name   Asesor de ventas    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a8cdc5c9c1c4e8cdc5c9c1c486cbc7c5" rel="noreferrer noopener nofollow">[email protected]</a>                 

  40    123     456   AR     name    inc.   International company representative    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9df8f0fcf4f1ddf8f0fcf4f1b3fef2f0" rel="noreferrer noopener nofollow">[email protected]</a>             
  41    123     456   AR     name   Vendedor de campo   <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2c49414d45406c49414d4540024f4341" rel="noreferrer noopener nofollow">[email protected]</a>                 

  42    123     456   AR     name   PUBLICIDAD   ATENCIÓN AL CLIENTE    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="31545c50585d71545c50585d1f525e5c" rel="noreferrer noopener nofollow">[email protected]</a>             
  43    123     456   AR     name   Asistente de Marketing  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="8beee6eae2e7cbeee6eae2e7a5e8e4e6" rel="noreferrer noopener nofollow">[email protected]</a>                 

  44    123     456   AR     name   SOLDADOR    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="03666e626a6f43666e626a6f2d606c6e" rel="noreferrer noopener nofollow">[email protected]</a>                 
  217   123     456   AR     name   Se requiere vendedores       Loja    Quevedo     Guayas)    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="60050d01090c20050d01090c4e030f0d" rel="noreferrer noopener nofollow">[email protected]</a> 
  218   123     456   AR     name   Ing. Civil recién graduado   Yaruquí    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="dcb9b1bdb5b09cb9b1bdb5b0f2bfb3b1" rel="noreferrer noopener nofollow">[email protected]</a>             
 219    123     456   AR     name   ayudantes enfermeria    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ee8b838f8782ae8b838f8782c08d8183" rel="noreferrer noopener nofollow">[email protected]</a>                 

 220    123     456   AR     name   Trip Leader for International Youth Exchange    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="86e3ebe7efeac6e3ebe7efeaa8e5e9eb" rel="noreferrer noopener nofollow">[email protected]</a>                 
 221    123     456   AR     name   COUNTRY MANAGER / DIRECTOR COMERCIAL    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="22474f434b4e62474f434b4e0c414d4f" rel="noreferrer noopener nofollow">[email protected]</a>                 
 250    123     456   AR     name   Ayudante de Pasteleria  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="dcb9b1bdb5b09cb9b1bdb5b0f2bfb3b1" rel="noreferrer noopener nofollow">[email protected]</a>  Asesor <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7a1f171b13163a1f171b131654191517" rel="noreferrer noopener nofollow">[email protected]</a> <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="62070f030b0e22070f030b0e4c010d0f" rel="noreferrer noopener nofollow">[email protected]</a>     

预解析的 CSV:

#,Id 1,Id 2,Country,Company,Title,Email,,,,
23,123,456,AR,name,cargador,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="10757d71797c50757d71797c3e737f7d" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
24,123,456,AR,name,Executive assistant,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c1a4aca0a8ad81a4aca0a8adefa2aeac" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
25,123,456,AR,name,Asistente Administrativo,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b9dcd4d8d0d5f9dcd4d8d0d597dad6d4" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
26,123,456,AR,name,Atención al cliente vía telefónica , vía online,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="553038343c39153038343c397b363a38" rel="noreferrer noopener nofollow">[email protected]</a>,,,
39,123,456,AR,name,Asesor de ventas,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f99c94989095b99c94989095d79a9694" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
40,123,456,AR,name, inc.,International company representative,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="42272f232b2e02272f232b2e6c212d2f" rel="noreferrer noopener nofollow">[email protected]</a>,,,
41,123,456,AR,name,Vendedor de campo,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ee8b838f8782ae8b838f8782c08d8183" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
42,123,456,AR,name,PUBLICIDAD, ATENCIÓN AL CLIENTE,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b0d5ddd1d9dcf0d5ddd1d9dc9ed3dfdd" rel="noreferrer noopener nofollow">[email protected]</a>,,,
43,123,456,AR,name,Asistente de Marketing,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="41242c20282d01242c20282d6f222e2c" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
44,123,456,AR,name,SOLDADOR,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4d28202c24210d28202c2421632e2220" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
217,123,456,AR,name,Se requiere vendedores,, Loja , Quevedo, Guayas),<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a2c7cfc3cbcee2c7cfc3cbce8cc1cdcf" rel="noreferrer noopener nofollow">[email protected]</a>
218,123,456,AR,name,Ing. Civil recién graduado, Yaruquí,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="244149454d48644149454d480a474b49" rel="noreferrer noopener nofollow">[email protected]</a>,,,
219,123,456,AR,name,ayudantes enfermeria,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7e1b131f17123e1b131f1712501d1113" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
220,123,456,AR,name,Trip Leader for International Youth Exchange,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="dabfb7bbb3b69abfb7bbb3b6f4b9b5b7" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
221,123,456,AR,name,COUNTRY MANAGER / DIRECTOR COMERCIAL,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ef8a828e8683af8a828e8683c18c8082" rel="noreferrer noopener nofollow">[email protected]</a>,,,,
250,123,456,AR,name,Ayudante de Pasteleria,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e1848c80888da1848c80888dcf828e8c" rel="noreferrer noopener nofollow">[email protected]</a>, Asesor,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0d68606c64614d68606c6461236e6260" rel="noreferrer noopener nofollow">[email protected]</a>,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="acc9c1cdc5c0ecc9c1cdc5c082cfc3c1" rel="noreferrer noopener nofollow">[email protected]</a>,
251,123,456,AR,name,Ejecutiva de Ventas,<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f0959d91999cb0959d91999cde939f9d" rel="noreferrer noopener nofollow">[email protected]</a>,,,,

最佳答案

如果您可以假设对于 Comapny,任何逗号后面都跟有空格,并且所有剩余的错误逗号都位于电子邮件地址之前的列中,那么可以编写一个小型解析器来处理它。

代码:

import csv
import re

VALID_EMAIL = re.compile(r'[^@]+@[^@]+\.[^@]+')

def read_my_csv(file_handle):
    # build csv reader
    reader = csv.reader(file_handle)

    # get the header, and find the e-mail and title columns
    header = next(reader)
    email_column = header.index('Email')
    title_column = header.index('Title')

    # yield the header up to the e-mail column
    yield header[:email_column+1]

    # for each row, go through rebuild columns
    for row in reader:

        # for each row, put the Company column back together
        while row[title_column].startswith(' '):
            row[title_column-1] += ',' + row[title_column]
            del row[title_column]

        # for each row, put the Title column back together
        while not VALID_EMAIL.match(row[email_column]):
            row[email_column-1] += ',' + row[email_column]
            del row[email_column]
        yield row[:email_column+1]

测试代码:

with open ("test.csv", 'rU') as f:
    generator = read_my_csv(f)
    columns = next(generator)
    df = pd.DataFrame(generator, columns=columns)

print(df)

结果:

      # Id 1 Id 2 Country     Company  \
0    23  123  456      AR        name   
1    24  123  456      AR        name   
2    25  123  456      AR        name   
3    26  123  456      AR        name   
4    39  123  456      AR        name   
5    40  123  456      AR  name, inc.   
6    41  123  456      AR        name   
7    42  123  456      AR        name   
8    43  123  456      AR        name   
9    44  123  456      AR        name   
10  217  123  456      AR        name   
11  218  123  456      AR        name   
12  219  123  456      AR        name   
13  220  123  456      AR        name   
14  221  123  456      AR        name   
15  250  123  456      AR        name   
16  251  123  456      AR        name   

                                               Title            Email  
0                                           cargador  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7c19111d15103c19111d1510521f1311" rel="noreferrer noopener nofollow">[email protected]</a>  
1                                Executive assistant  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1c79717d75705c79717d7570327f7371" rel="noreferrer noopener nofollow">[email protected]</a>  
2                           Asistente Administrativo  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="61040c00080d21040c00080d4f020e0c" rel="noreferrer noopener nofollow">[email protected]</a>  
3    Atención al cliente vía telefónica , vía online  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="096c64686065496c64686065276a6664" rel="noreferrer noopener nofollow">[email protected]</a>  
4                                   Asesor de ventas  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5d38303c34311d38303c3431733e3230" rel="noreferrer noopener nofollow">[email protected]</a>  
5               International company representative  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="593c34383035193c34383035773a3634" rel="noreferrer noopener nofollow">[email protected]</a>  
6                                  Vendedor de campo  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="badfd7dbd3d6fadfd7dbd3d694d9d5d7" rel="noreferrer noopener nofollow">[email protected]</a>  
7                    PUBLICIDAD, ATENCIÓN AL CLIENTE  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="8de8e0ece4e1cde8e0ece4e1a3eee2e0" rel="noreferrer noopener nofollow">[email protected]</a>  
8                             Asistente de Marketing  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="aecbc3cfc7c2eecbc3cfc7c280cdc1c3" rel="noreferrer noopener nofollow">[email protected]</a>  
9                                           SOLDADOR  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6f0a020e06032f0a020e0603410c0002" rel="noreferrer noopener nofollow">[email protected]</a>  
10  Se requiere vendedores,, Loja , Quevedo, Guayas)  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ee8b838f8782ae8b838f8782c08d8183" rel="noreferrer noopener nofollow">[email protected]</a>  
11               Ing. Civil recién graduado, Yaruquí  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="07626a666e6b47626a666e6b2964686a" rel="noreferrer noopener nofollow">[email protected]</a>  
12                              ayudantes enfermeria  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="10757d71797c50757d71797c3e737f7d" rel="noreferrer noopener nofollow">[email protected]</a>  
13      Trip Leader for International Youth Exchange  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9efbf3fff7f2defbf3fff7f2b0fdf1f3" rel="noreferrer noopener nofollow">[email protected]</a>  
14              COUNTRY MANAGER / DIRECTOR COMERCIAL  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="fa9f979b9396ba9f979b9396d4999597" rel="noreferrer noopener nofollow">[email protected]</a>  
15                            Ayudante de Pasteleria  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="91f4fcf0f8fdd1f4fcf0f8fdbff2fefc" rel="noreferrer noopener nofollow">[email protected]</a>  
16                               Ejecutiva de Ventas  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e88d85898184a88d85898184c68b8785" rel="noreferrer noopener nofollow">[email protected]</a>  

关于python - 用逗号解析 pandas 中的 CSV 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44122091/

相关文章:

python - 从Python类中的另一个方法访问在一个方法内创建的变量

python - Pandas 纬度经度分箱至 100x100 分箱

python-3.x - Pytest 使用装置比较几个类实例

python - 如果我的类型是函数,我可以在 Python 中使用什么类型提示?

python - 使用 BS4 "lxml"抓取 XML 数据

python - 当它们都共享相同的日期时获取列中值最高的行?

Python:连接pandas多索引

python - 统计类函数执行次数

python - 将 .txt 或 excel 文件的行读入元组

python - 使用 python pandas 编辑 excel 文件