python - 读取大型制表符分隔文件分块时出现异常

标签 python python-3.x

我有一个 350MB 制表符分隔的文本文件。如果我尝试将其读入内存,则会出现内存不足异常。所以我正在尝试一些类似的事情(即只阅读几列):

import pandas as pd

input_file_and_path = r'C:\Christian\ModellingData\X.txt'

column_names = [
    'X1'
    # , 'X2
]
raw_data = pd.DataFrame()
for chunk in pd.read_csv(input_file_and_path, names=column_names, chunksize=1000, sep='\t'):
    raw_data = pd.concat([raw_data, chunk], ignore_index=True)

print(raw_data.head())

不幸的是,我明白了:

Traceback (most recent call last):
  File "pandas\_libs\parsers.pyx", line 1134, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
  File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte

在处理上述异常的过程中,又发生了一个异常:

Traceback (most recent call last):
  File "C:/xxxx/EdaDataPrepRange1.py", line 17, in <module>
    for chunk in pd.read_csv(input_file_and_path, header=None, names=column_names, chunksize=1000, sep='\t'):
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas\_libs\parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
  File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte

任何想法。顺便说一句,我通常如何处理大文件并估算例如缺失的变量?最终,我需要读取所有内容来确定,例如,要估算的中位数。

最佳答案

在使用pd.read_csv时使用encoding="utf-8"

这里他们使用了这种编码。看看这是否有效。 打开(文件路径,编码='windows-1252'):

引用:'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

工作解决方案

使用编码encoding="ISO-8859-1"

关于python - 读取大型制表符分隔文件分块时出现异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51762885/

相关文章:

python - 生成没有重复列的位向量数组

python - 如何从 Python 发送 header 中包含姓名的电子邮件

python - Word2vec - 获得相似度等级

python - 找到第一个超过 100 万的斐波那契数

python - 如何在类初始化期间导入模块

python - 如何滚动浏览(大量)pandas 数据框?

python - python 3.5 中的多个相对导入

python - For循环不更新列表

python - 根据组的天间隔为列分配值的优雅方法

Python ttk 对象 - 不理解小部件特定的选项