python - 在网页抓取时如何绕过不受支持的字符？

我正在使用 lxml 抓取网页。在某一时刻，我获得了表格单元格的内容。

# get last name
lastNameContainer = tableRow.xpath('./td[@class="lastName"]');
lastName = lastNameContainer[0].text

不幸的是，一个表格单元格包含超出 ASCII 范围的字符，从而产生此错误。

UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-7: ordinal not in range(128)

我尝试将其添加到我的 Python 文件的顶部，但没有成功。

#!/usr/bin/python
# -*- coding: utf-8 -*-

如何解决这个问题？我还是想保存这个角色。顺便说一下，这个字符是 ♀ 或 ♂，具体取决于表格行。

<小时/>

更新:我意识到当我将数据写入文件时会触发错误:

with open('myData.txt', 'w') as myFile:
    myFile.write(lastName + '\n')

奇怪的是，这仍然会产生上述错误。

with open('myData.txt', 'w') as myFile:
    myFile.write(lastName.decode('utf-8') + '\n')

最佳答案

lxml 需要 unicode 格式的字符串。当我收到该异常时，我使用 decode('utf-8') 解决它。

即:E.doc('♀'.decode('utf-8'))

更新:

with open('myData.txt', 'w') as myFile:
      myFile.write(lastName + '\n')

Oddly, this still produces the above error.

with open('myData.txt', 'w') as myFile:
      myFile.write(lastName.decode('utf-8') + '\n')

另请注意，如果 lastName 为 unicode 并且您尝试编写 UTF-8 编码文件，则需要以这种方式将其转换回来 lastName.encode ('utf-8')

with open('myData.txt', 'w') as myFile:
    myFile.write(lastName.encode('utf-8') + '\n')

关于python - 在网页抓取时如何绕过不受支持的字符？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9630637/