python - lxml.html5parser : not working for arabic/persian html5s

我正在使用 lxml 的 html5parser 使用 ascii 字符没问题，但是如果我下载一个包含波斯语和俄语字符的 html 文件，则会出现此错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 418: ordinal not in range(128)

这是响应文本:http://paste.ubuntu.com/23552349/

这是我的代码(如您所见，我只删除了所有无效的 xml 字符):

f = requests.post('http://www.example.com/getHtml.php?', headers=headers, cookies=cookies, data=data)
resp = f.text
if resp == "":
    return []
resp = encode("utf-8")
resp = ''.join(c for c in resp if valid_xml_char_ordinal(c))
doc = html5parser.fragment_fromstring(resp.encode("utf-8"), guess_charset=False, create_parent='div')

如果我删除以下行:resp = encode("utf-8") 将出现此错误:

ValueError:所有字符串都必须与 XML 兼容:Unicode 或 ASCII，没有 NULL 字节或控制字符

最佳答案

我在直接使用 html5parser 时也遇到了一些奇怪的不一致问题(TypeError: __init__() got an unexpected keyword argument 'useChardet' 等等)。

如果您已经安装了 lxml，那么使用 BeautifulSoup 包装器是一种乐趣。

首先安装 BeautifulSoup(pip install beautifulsoup4)。然后:

import requests
from bs4 import BeautifulSoup

# (initialize headers, cookies and data)

f = requests.post('http://www.example.com/getHtml.php?', headers=headers, cookies=cookies, data=data)
resp = f.text
if not resp:
    return []
doc = BeautifulSoup(resp, 'lxml')

然后您可以使用 BeautifulSoup clean API 来操作 HTML 树。在底层，它仍然使用 lxml 进行解析。

BeautifulSoup API 引用:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

关于python - lxml.html5parser : not working for arabic/persian html5s，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40861834/

python - lxml.html5parser : not working for arabic/persian html5s

上一篇：python - Django Admin - 'ManyToManyField' 对象没有属性 'through'

下一篇：python - 导入 tflearn 时出现 "Scipy not supported!"