我正在尝试编写一个Python脚本，其作用类似于Chrome网络浏览器上的Ctrl + S，它保存HTML页面，下载网页上的任何链接，最后将链接的URI替换为本地路径磁盘。

下面发布的代码尝试将 CSS 文件中的 URI 替换为我计算机上的本地路径。

我在尝试解析不同网站时遇到了一个问题，这让我有点头疼。

我的原始错误代码是UnicodeDecodeError:'ascii'编解码器无法解码位置13801中的字节0xa3:序号不在范围(128)

url = 'http://www.s1jobs.com/job/it-telecommunications/support/edinburgh/620050561.html'

response = urllib2.urlopen(url)
webContent = response.read()
dest_dir = 'C:/Users/Stuart/Desktop/' + title
for f in glob.glob(r'./*.css'):
    newContent = webContent.replace(cssUri, "./" + title + '/' + cssFilename)
    shutil.move(f, dest_dir)

当我尝试打印 newContent 或将其写入文件时，此问题仍然存在。我试图遵循这个堆栈问题 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128) 中的最佳答案并修改了我的行

newContent = webContent.decode('utf-8').replace(cssUri, "./" + title + '/' + cssFilename)

到 newContent = webContent.decode(utf-8).replace(cssUri, "./"+ title + '/' + cssFilename)。我还尝试了 .decode(utf-16) 和 32，分别得到这些错误代码:13801:无效的起始字节，字节 0x0a 在位置 44442:数据被截断，最后无法解码位置0-3中的字节:代码点不在范围(0x110000)

有人知道我应该如何解决这个问题吗？我必须补充一点，当我打印变量 webContent 时，有输出(不过我注意到底部有中文书写)。

最佳答案

这将解决您的问题

使用webContent.decode('utf-8', errors='ignore') 或webContent.decode('latin-1')

webContent[13801:13850]有一些奇怪的角色。忽略它们即可。

忽略下面的所有内容

<小时/>

这有点像黑暗中的尝试，但试试这个:

在文件顶部，

from __future__ import unicode_literals
from builtins import str

看来发生的事情是您正在尝试解码一个可能是 python 2.7 str 的 python 对象。对象，原则上应该是一些解码的文本对象。

简要说明

在默认的 python 2.7 内核中:

(iPython session )

In [1]: type("é") # By default, quotes in py2 create py2 strings, which is the same thing as a sequence of bytes that given some encoding, can be decoded to a character in that encoding.
Out[1]: str

In [2]: type("é".decode("utf-8")) # We can get to the actual text data by decoding it if we know what encoding it was initially encoded in, utf-8 is a safe guess in almost every country but Myanmar.
Out[2]: unicode

In [3]: len("é") # Note that the py2 `str` representation has a length of 2.  There's one byte for the "e" and one byte for the accent.  
Out[3]: 2

In [4]: len("é".decode("utf-8")) # the py2 `unicode` representation has length 1, since an accented e is a single character
Out[4]: 1

Python 2.7 中其他一些值得注意的事情:

"é"与 str("é") 相同
u"é"与 "é".decode('utf-8') 相同或unicode("é", 'utf-8')
u"é".encode('utf-8')与 str("é") 相同
您通常使用 py2 str 调用解码。，并使用 py2 unicode 进行编码。
- 由于早期的设计问题，您可以同时调用两者，尽管这实际上没有任何意义。
- 在 python3 中，str ，与 python2 unicode 相同，无法再解码，因为根据定义，字符串是已解码的字节序列。默认情况下，它使用utf-8编码。
在 ascii 编解码器中编码的字节序列与解码后的字节序列的行为完全相同。
- 在没有 future 导入的 python 2.7 中:type("a".decode('ascii'))给出一个 unicode 对象，但这与 str("a") 的行为几乎相同。 python3 中并非如此。

话虽如此，上面的代码片段的作用如下:

__future__是由核心 python 团队维护的模块，它将 python3 功能向后移植到 python2，以允许您在 python2 中使用 python3 习惯用法。
from __future__ import unicode_literals有以下效果:
- 没有 future 的导入 "é"与 str("é") 相同
- future 导入 "é"功能上与 unicode("é") 相同
builtins是一个由核心 python 团队批准的模块，包含在 python2 中通过 python3 api 使用 python3 习惯用法的安全别名。
- 由于我无法理解的原因，该软件包本身被命名为“future”，因此要安装builtins您运行的模块:pip install future
from builtins import str有以下效果:
- str构造函数现在提供您认为它所做的事情，即 python2 unicode 对象形式的文本数据。所以它在功能上与 str = unicode 相同。
- 注意:Python3 str功能上与Python2相同unicode
- 注意:要获取字节，您可以使用“bytes”前缀，例如b'é'

要点是这样的:

读取时解码/尽早解码，写入时编码/最后编码
使用str字节和 unicode 的对象文本对象

关于Python 解析 HTML 时的 Unicode 和 ASCII 问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37488901/

Python 解析 HTML 时的 Unicode 和 ASCII 问题

这将解决您的问题

忽略下面的所有内容

简要说明

上一篇：python - 计算 scipy.sparse 矩阵的伪逆矩阵的列子集的最快方法

下一篇：python - iPython :Using Pandas, 如何组合多个文本文件来查找重复出现的用户名？