python - 为什么 beautiful soup 无法正确解析名为 "area"的元素？

我正在编写一个使用 beautiful soup 的 python 脚本解析 xml 文档。一些文档包含名为“area”的元素。由于某种原因，我一生都无法正确解析这些元素。它们总是显示为空 <area/>元素。

这是正在发生的事情的一个最小示例:

#!/usr/bin/python3.5
from bs4 import BeautifulSoup

xml = """""
<?xml version = '1.0' encoding = 'UTF-8' standalone = 'yes'?>

<root>
    <areax>
        foo
    </areax>
    <area>
        bar
    </area>
</root>
"""""
soup = BeautifulSoup (xml, "lxml")

print ("\n#### soup ####\n")
print (soup)

print ("\n#### areax ####\n")
areaxs = soup.find_all ("areax")
for areax in areaxs:
    print (areax)

print ("\n### area ###\n")
areas = soup.find_all ("area")
for area in areas:
    print (area)

输出:

#### soup ####

<html><body><p>""
<?xml version = '1.0' encoding = 'UTF-8' standalone = 'yes'?>
<root>
<areax>
        foo
    </areax>
<area/>
        bar

</root>
</p></body></html>

#### areax ####

<areax>
        foo
    </areax>

### area ###

<area/>

元素名称“area”是否受到任何方式的保护，或者我解析它的方式是否存在其他问题？

最佳答案

您的文档被解析为 HTML，并且 area element 是一个空的 HTML 元素(不能有任何子元素)。

要将其解析为 XML，请使用 BeautifulSoup(xml, "xml") ( docs ):

By default, Beautiful Soup parses documents as HTML. To parse a document as XML, pass in “xml” as the second argument to the BeautifulSoup constructor:
soup = BeautifulSoup(markup, "xml")
You’ll need to have lxml installed.

另一个问题是您的 xml 字符串周围有太多引号，因此它实际上以 "" 开头(尝试打印它)。恰好三个引号 (""") 就足够了。

关于python - 为什么 beautiful soup 无法正确解析名为 "area"的元素？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47457069/

python - 为什么 beautiful soup 无法正确解析名为 "area"的元素？

上一篇：python - 将矩阵划分为 2x2 方阵子矩阵 - maxpooling fprop

下一篇：python - 在聚合函数中最常见