python - Python 2.7.10 抓取网页时 Unicode 字符替换为问号

我有一些简单的 python 代码可以使用 urllib2 抓取网页:

response = urllib2.urlopen(url)
charset = response.headers.getheader("Content-Type")
charset = charset[charset.index("charset=") + 8:]
html = response.read()
html = " ".join(html.split())
html = html.decode(charset)
html = html.replace("amp;", "").replace("&#039;", "'")

我的问题是我正在抓取的页面中有 Te Reo Māori 词，因此它有很多包含长音符号的词，例如。 “蒲太澳。”当我打印HTML时，所有的长音字母都被问号替换了，我没有使用任何替换解码方法。它甚至在没有任何解码、拆分或连接的情况下发生。

同一站点上还有另一个页面，其中包含一些相同的词，并且 macrons 在 python 中显示完全正常。我还注意到该页面在其响应 header 中的字符集是 utf-8，而带有问号的页面是 ISO-8859-1，所以这可能与它有关。

带问号的网站链接是http://www.nzqa.govt.nz/ncea/assessment/search.do?query=reo+maori&view=all&level=01 .

另一页是http://www.nzqa.govt.nz/qualifications-standards/qualifications/ncea/subjects/

最佳答案

似乎服务器在无法识别请求来自的用户代理时以错误的内容类型响应。当我在我的机器上尝试时，我得到了类似的结果。

将有效的 User-Agent 添加到请求 header 后，我能够正确获取响应的 utf-8 编码。我不确定这是否是解决这种情况的最佳方法，但它应该能让您的代码正常工作。示例 -

req = urllib2.Request(url, headers = {"Connection":"keep-alive", "User-Agent":"Mozilla/5.0"})
response = urllib2.urlopen(req)
#After this rest of your original code.

关于python - Python 2.7.10 抓取网页时 Unicode 字符替换为问号，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32155000/

python - Python 2.7.10 抓取网页时 Unicode 字符替换为问号

上一篇：python - 使用正则表达式查找重复操作数 - Python

下一篇：Python 3.4 电子邮件