我正在尝试使用 pandas.read_csv
从 IPython 笔记本中的 URL(在本例中为 GitHub 原始 URL)读取 bz2 压缩的 CSV 文件,但我得到以下信息错误:
Python 无法从打开的文件句柄读取 bz2
。
我做了一些研究,但我似乎无法弄清楚出了什么问题。我尝试过手动解压缩 bz2 文件,我知道它们没有损坏或损坏,而且我也知道 URL 的格式正确 - 如果我将它们输入浏览器,它就会正确下载文件。
这是我的 IPython 笔记本中的代码:
import pandas, bz2, urllib2
datafiles = {}
for filename in csvData['Filename']:
csvUrl = 'https://raw.github.com/hawkw/traverse/master/data/' + filename
try:
datafiles[filename] = pandas.read_csv(csvUrl, compression='bz2')
print (datafiles[filename])
except Exception as e:
print("Caught error \"{error}\" at {url}".format(error=e, url=csvUrl))
输出:
Caught error "Python cannot read bz2 from open file handle" at https://raw.github.com/hawkw/traverse/master/data/-Users-hawk-_2014-02-05.csv.bz2
Caught error "Python cannot read bz2 from open file handle" at https://raw.github.com/hawkw/traverse/master/data/-Users-Owner_2014-02-05.csv.bz2
Caught error "Python cannot read bz2 from open file handle" at https://raw.github.com/hawkw/traverse/master/data/-Users-will_2014-02-05.csv.bz2
Caught error "Python cannot read bz2 from open file handle" at https://raw.github.com/hawkw/traverse/master/data/-Users-hawk_2014-02-06.csv.bz2
Caught error "Python cannot read bz2 from open file handle" at https://raw.github.com/hawkw/traverse/master/data/-Users-hawk-Documents_2014-02-06.csv.bz2
Caught error "Python cannot read bz2 from open file handle" at https://raw.github.com/hawkw/traverse/master/data/-home-w-weismanm_2014-02-05.csv.bz2
Caught error "Python cannot read bz2 from open file handle" at https://raw.github.com/hawkw/traverse/master/data/-home-w-weismanm_2014-02-06.csv.bz2
有人知道我做错了什么吗?
编辑:我尝试像这样使用 urllib2
打开文件,正如 @edchum 建议的那样:
datafiles = {}
for filename in csvData['Filename']:
url = 'https://raw.github.com/hawkw/traverse/master/data/' + filename
try:
response = urllib2.urlopen(url)
except HTTPError as e:
print ("Caught HTTPError", e)
else:
try:
datafiles[filename] = pandas.read_csv(response, compression='bz2')
print (datafiles[filename])
except Exception as e:
print("Caught error \"{0}\" at {1}".format(e,url))
但它仍然不起作用,失败并出现同样的错误。附带说明一下,pandas.read_csv()
表示它可以从文档中的 URL 打开文件。
最佳答案
如果由于某种原因您绝对不能简单地下载文件,并且必须不断从远程源(不是很近)中提取它们,那么如果需要,您可以在内存中完成所有操作:
>>> import pandas as pd
>>> import io
>>> import urllib2
>>> import bz2
>>>
>>> url = "https://github.com/hawkw/traverse/blob/master/data/-Users-hawk_2014-02-06.csv.bz2?raw=true"
>>> raw_data = urllib2.urlopen(url).read()
>>> data = bz2.decompress(raw_data)
>>> df = pd.read_csv(io.BytesIO(data))
>>> df.head()
path st_mode st_ino \
0 /Users/hawk/Library/minecraft/bin/minecraft/mo... 33261 59612469
1 /Users/hawk/Library/Application Support/Google... 16832 91818463
2 /Users/hawk/Library/Caches/Metadata/Safari/His... 33188 95398522
3 /Users/hawk/Documents/Minecraft1.6.4/assets/so... 33188 90620503
4 /Users/hawk/Library/Caches/Metadata/Safari/His... 33188 96129272
st_dev st_nlink st_uid st_gid st_size st_atime st_mtime \
0 16777219 1 501 20 2626 1370201925 1366983504
1 16777219 3 501 20 102 1391697388 1384638452
2 16777219 1 501 20 36758 1389032348 1389032363
3 16777219 1 501 20 12129 1387000073 1384114141
4 16777219 1 501 20 170 1390545751 1390545751
st_ctime
0 1368736019
1 1384638452
2 1389032363
3 1384114141
4 1390545751
[5 rows x 11 columns]
您想要传递适当的encoding
参数的位置。
关于python - 使用 pandas.read_csv 从 URL 读取压缩的 CSV 文件时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21609299/