我有url其中包含一个我想加载到 Pandas 数据框中的文本文件。但是顶部有一些元数据,我在解析时无法跳过它并返回错误。
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
这是我的代码:
import pandas as pd
data = pd.read_csv('https://fred.stlouisfed.org/data/PERMIT.txt')
当顶部没有元数据时,此代码通常适用于我。如何在加载时跳过元数据?
txt
文件的开头如下所示:
Title: New Private Housing Units Authorized by Building Permits
Series ID: PERMIT
Source: U.S. Bureau of the Census, U.S. Department of Housing and Urban Development
Release: New Residential Construction
Seasonal Adjustment: Seasonally Adjusted Annual Rate
Frequency: Monthly
Units: Thousands of Units
Date Range: 1960-01-01 to 2018-03-01
Last Updated: 2018-04-24 7:01 AM CDT
Notes: Starting with the 2005-02-16 release, the series reflects an increase
in the universe of permit-issuing places from 19,000 to 20,000 places.
DATE VALUE
1960-01-01 1092
1960-02-01 1088
1960-03-01 955
1960-04-01 1016
1960-05-01 1052
1960-06-01 958
1960-07-01 999
1960-08-01 994
最佳答案
使用 skiprows
跳过元数据的参数。在你的例子中,你有 12 行:
data = pd.read_csv('https://fred.stlouisfed.org/data/PERMIT.txt', skiprows=12, sep='\s+')
>>> data.head()
DATE VALUE
0 1960-01-01 1092
1 1960-02-01 1088
2 1960-03-01 955
3 1960-04-01 1016
4 1960-05-01 1052
或者,使用 header
参数(第 11 行)告诉 read_csv
header 在哪里:
data = pd.read_csv('https://fred.stlouisfed.org/data/PERMIT.txt', header=11, sep='\s+')
>>> data.head()
DATE VALUE
0 1960-01-01 1092
1 1960-02-01 1088
2 1960-03-01 955
3 1960-04-01 1016
4 1960-05-01 1052
如果您不知道要跳过多少行,您可以实现 this answer 中使用的策略
关于python - 从文本文件提取数据到 Pandas 时如何忽略垃圾数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50355400/