python - lxml 和 libxml2 哪个更适合在 Python 中解析格式错误的 html？

对于格式错误的 html，哪个更好更有用？
我找不到如何使用 libxml2。

谢谢。

最佳答案

在libxml2 page你可以看到这个注释:

Note that some of the Python purist dislike the default set of Python bindings, rather than complaining I suggest they have a look at lxml the more pythonic bindings for libxml2 and libxslt and check the mailing-list.

并且在 lxml将此页设为另一页:

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.

从本质上讲，使用 lxml 您可以获得完全相同的功能，但是有一个与标准库中的 ElementTree 库兼容的 pythonic API(因此这意味着标准库文档将有助于学习如何使用 lxml)。这就是为什么 lxml 优于 libxml2(即使底层实现是相同的)。

编辑:话虽如此，正如其他答案所解释的那样，要解析格式错误的 html，您最好的选择是使用 BeautifulSoup .需要注意的一件有趣的事情是，如果您安装了 lxml，BeautifulSoup 将按照 documentation 中的说明使用它。对于新版本:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

无论如何，即使 BeautifulSoup 在底层使用了 lxml，您也可以解析您无法解析的损坏的 html直接使用 xml。例如:

>>> lxml.etree.fromstring('<html>')
...
XMLSyntaxError: Premature end of data in tag html line 1, line 1, column 7

但是:

>>> bs4.BeautifulSoup('<html>', 'lxml')
<html></html>

最后请注意，lxml 还提供了旧版本BeautifulSoup 的接口(interface)，如下所示:

>>> lxml.html.soupparser.fromstring('<html>')
<Element html at 0x13bd230>

所以在一天结束时，您可能还是会使用 lxml 和 BeautifulSoup。您唯一需要选择的是您最喜欢的 API。

关于python - lxml 和 libxml2 哪个更适合在 Python 中解析格式错误的 html？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9324389/

python - lxml 和 libxml2 哪个更适合在 Python 中解析格式错误的 html？

上一篇：python - 如何使用日期时间对数据框进行切片？

下一篇：python - 使用 boto，在 s3 上已经存在的文件上设置 content_type