Python Urllib2 只读文档的一部分

好吧，这让我抓狂。

我正在尝试使用 Python 的 Urllib2 库从 Crunchbase API 读取数据。相关代码:

api_url="http://api.crunchbase.com/v/1/financial-organization/venrock.js"
len(urllib2.urlopen(api_url).read())

结果是 73493 或 69397。文档的实际长度要长得多。当我在另一台计算机上尝试这个时，长度是 44821 或 40725。我试过更改用户代理，使用 Urllib，将超时增加到一个非常大的数字，并一次读取小块。总是相同的结果。

我以为是服务器问题，但我的浏览器读取了整个内容。

Python 2.7.2，OS X 10.6.8 约 40k 长度。 Python 2.7.1 作为 iPython 运行约 70k 长度，OS X 10.7.3。想法？

最佳答案

那个服务器有点古怪。如果您像您的浏览器一样请求使用 gzip 编码的文件，它可能会起作用。这是一些应该可以解决问题的代码:

import urllib2, gzip

api_url='http://api.crunchbase.com/v/1/financial-organization/venrock.js'
req = urllib2.Request(api_url)
req.add_header('Accept-encoding', 'gzip')
resp = urllib2.urlopen(req)
data = resp.read()

>>> print len(data)
26610

接下来的问题是解压缩数据。

from StringIO import StringIO

if resp.info().get('Content-Encoding') == 'gzip':
    g = gzip.GzipFile(fileobj=StringIO(data))
    data = g.read()

>>> print len(data)
183159

关于Python Urllib2 只读文档的一部分，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10890426/

Python Urllib2 只读文档的一部分

上一篇：Python Challenge Lvl 3 解释(警告剧透!)

下一篇：python - 查找对象python中变量的平均值