python - 值错误 : import data via chunks into pandas. csv_reader()

标签 python pandas chunking

我有一个很大的 gzip 文件,我想将其导入到 pandas 数据框中。不幸的是,该文件的列数不均匀。数据大致有这样的格式:

.... Col_20: 25    Col_21: 23432    Col22: 639142
.... Col_20: 25    Col_22: 25134    Col23: 243344
.... Col_21: 75    Col_23: 79876    Col25: 634534    Col22: 5    Col24: 73453
.... Col_20: 25    Col_21: 32425    Col23: 989423
.... Col_20: 25    Col_21: 23424    Col22: 342421    Col23: 7    Col24: 13424    Col 25: 67
.... Col_20: 95    Col_21: 32121    Col25: 111231

作为测试,我尝试了以下方法:

import pandas as pd
filename = `path/to/filename.gz`

for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python'):
    print(chunk)

这是我收到的错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 795, in __next__
    return self.get_chunk()
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 836, in get_chunk
    return self.read(nrows=size)
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 1761, in read
    alldata = self._rows_to_cols(content)
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 2166, in _rows_to_cols
    raise ValueError(msg)
ValueError: Expected 18 fields in line 28, saw 22

如何为 pandas.read_csv() 分配一定数量的列?

最佳答案

你也可以尝试这个:

for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python', error_bad_lines=False):
print(chunk)

error_bad_lines 会跳过错误的行。我会看看是否可以找到更好的替代方案

编辑:为了维护 error_bad_lines 跳过的行,我们可以检查错误并将其添加回数据帧

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

关于python - 值错误 : import data via chunks into pandas. csv_reader(),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39391597/

相关文章:

python - py.test 不收集不是从 'object' 继承的测试

python - 如何用模式(正则表达式)替换部分字符串在数据框中抛出行

python - 使用 pandas 将年、月、日列合并为单个日期列

python - 如何从 Stack Overflow 复制/粘贴 DataFrame 到 Python

python - 如何将图中的 DOY(一年中的某一天)转换为月份(作为文本)?

django - 使用 Models FileField save() 对上传文件大小超过 2.5 MB 的文件进行分块

python - 如何在字典中找到树的第一个节点?

python - 读取文本文件中行的进度条

python - 如何洗牌并将大列表拆分为较小的列表,以最大限度地提高速度?

java - 在 Google 云端硬盘中下载文件时,文件暂时移动错误