Python urllib.request 和 utf8 解码问题

我正在编写一个简单的 Python CGI 脚本，用于抓取网页并在 Web 浏览器中显示 HTML 文件(充当代理)。这是脚本:

#!/usr/bin/env python3.0

import urllib.request

site = "http://reddit.com/"
site = urllib.request.urlopen(site)
site = site.read()
site = site.decode('utf8')

print("Content-type: text/html\n\n")
print(site)

该脚本在从命令行运行时工作正常，但是当使用 Web 浏览器查看它时，它会显示一个空白页面。这是我在 Apache 的 error_log 中收到的错误:

Traceback (most recent call last):
  File "/home/public/projects/proxy/script.cgi", line 11, in <module>
    print(site)
  File "/usr/local/lib/python3.0/io.py", line 1491, in write
    b = encoder.encode(s)
  File "/usr/local/lib/python3.0/encodings/ascii.py", line 22, in encode
    return codecs.ascii_encode(input, self.errors)[0]
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 33777: ordinal not in range(128)

最佳答案

当您在命令行中打印它时，您会将 Unicode 字符串打印到终端。终端有一个编码，因此 Python 会将您的 Unicode 字符串编码为该编码。这会工作得很好。

当您在 CGI 中使用它时，您最终会打印到没有编码的标准输出。因此，Python 尝试使用 ASCII 对字符串进行编码。这会失败，因为 ASCII 不包含您尝试打印的所有字符，因此您会收到上述错误。

解决此问题的方法是将字符串编码为某种编码(为什么不是 UTF8？)，并在 header 中如此说明。

所以像这样:

sys.stdout.buffer.write(b"Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
sys.stdout.buffer.write(site.encode('UTF8'))

在 Python 2 下，这也可以工作:

print("Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
print(site.encode('UTF8'))

但在 Python 3 下，编码数据以字节为单位，因此打印效果不佳。

当然，您会注意到现在首先从 UTF8 进行解码，然后重新编码。严格来说，你不需要这样做。但如果您想在其间修改 HTML，那么这样做实际上可能是一个好主意，并将所有修改保留为 Unicode。

关于Python urllib.request 和 utf8 解码问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/4601912/

Python urllib.request 和 utf8 解码问题

上一篇：python - 如何修改 gzip 压缩的 tar 文件中的文件？

下一篇：c# - IronPython/C# float 据比较