我想将“category”列中的值加载到 pandas df 中,这是我的 tsv 文件:
Tagname text category
j245qzx_8 hamburger toppings f
h833uio_7 side of fries f
d423jin_2 milkshake combo d
这是我的代码:
with open(filename, 'r') as f:
df = pd.read_csv(f, sep='\t')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
但是,我在 df = pd.read_csv(f, sep='\t')
行收到 UnicodeDecodeError,并且我的代码停在那里:
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
self._make_engine(self.engine)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2101, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 898: invalid start byte
有什么想法或者如何解决这个问题吗?我的 tsv 中似乎没有任何特殊字符,所以我不确定导致此问题的原因或该怎么做。
最佳答案
修复
刚刚阅读this SO ,我想我明白出了什么问题。
您将使用 Python 的 open()
获取文件句柄,并将其传递给 Pandas 的 read_csv()
。 open()
确定文件的编码。
因此,尝试在 open()
中设置编码,如下所示:
with open(filename, 'r', encoding='windows-1252') as f:
df = pd.read_csv(f, sep='\t')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
或者,根本不使用 open()
:
df = pd.read_csv(filename, sep='\t', encoding='windows-1252')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
一些背景故事
我将 x89
回显到示例的末尾,然后运行 Python 的 chardetect
实用程序,它表明它是 Window-1252:
% echo -e '\x89' >> sample.csv
% cat sample.csv
Tagname text category
j245qzx_8 hamburger toppings f
h833uio_7 side of fries f
d423jin_2 milkshake combo d
�
% which chardetect
/Library/Frameworks/Python.framework/Versions/3.9/bin/chardetect
% chardetect sample.csv
sample.csv: Windows-1252 with confidence 0.73
关于python - 将 TSV 文件中的列加载到 python 列表中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69950117/