上下文
下面的 Python 2.7 函数使用 etree 和 xpath 遍历 DOM,并构建 DOM 的扁平列表表示。在每个节点,它检查当前元素是否具有应忽略的类 - 如果是,则跳过该元素及其子元素。
import re
from lxml import etree
ignore_classes = ['ignore']
def flatten_tree(element):
children = element.findall('*')
elements = []
if len(children) > 0:
for child in children:
if child.attrib.get('class') in ignore_classes:
continue
else:
for el in get_children(child):
elements.append(el)
elements.insert(0, element)
return elements
问题
我该如何改进?必须有一种更优雅、更有效的方法。如果我正在编写一个嵌套的 for
循环,那么我一定做错了什么。
示例
本文档:
<html>
<body>
<header class="ignore">
<h1>Gerbils</h1>
</header>
<main>
<p>They like almonds. That's pretty much all I know.</p>
</main>
</body>
</html>
会变成这样:
[ <html>,
<body>,
<main>,
<p> ]
提前致谢!
最佳答案
您可以使用 XPath,例如
In [24]: root.xpath('descendant-or-self::*[not(ancestor-or-self::*[@class="ignore"])]')
Out[24]:
[<Element html at 0x7f4d5e1c1548>,
<Element body at 0x7f4d5e1dba48>,
<Element main at 0x7f4d5024e6d8>,
<Element p at 0x7f4d5024e728>]
XPath descendant-or-self::*[not(ancestor-or-self::*[@class="ignore"])]
表示
descendant-or-self::* select the current node and all its descendants
[ such that
not( it is not true that
ancestor-or-self::* it itself or an ancestor
[@class="ignore"] has an attribute, class, equal to "ignore"
)]
<小时/>
要处理要忽略的类名列表,您可以使用一些代码构建 XPath。
例如,如果 ignore_classes = ['A', 'B']
那么您可以定义
conditions = ' or '.join([
'ancestor-or-self::*[@class="{}"]'.format(cls) for cls in ignore_classes])
xpath = 'descendant-or-self::*[not({})]'.format(conditions)
这样xpath
就等于
'descendant-or-self::*[not(ancestor-or-self::*[@class="A"] or ancestor-or-self::*[@class="B"])]'
尽管这看起来很罗嗦,但使用 lxml 的 XPath 引擎应该会显着 比 Python 中遍历树更快。
<小时/>import lxml.html as LH
html = """
<html>
<body>
<header class="ignore">
<h1>Gerbils</h1>
</header>
<main class="ignore2">
<p>They like almonds. That's pretty much all I know.</p>
</main>
</body>
</html>"""
def flatten_element(element, ignore_classes):
conditions = ' or '.join([
'ancestor-or-self::*[@class="{}"]'.format(cls) for cls in ignore_classes])
xpath = 'descendant-or-self::*[not({})]'.format(conditions)
return element.xpath(xpath)
root = LH.fromstring(html)
ignore_classes = ['ignore']
flattened = flatten_element(root, ignore_classes)
print(flattened)
产量
[<Element html at 0x7f30af3459a8>, <Element body at 0x7f30af367ea8>, <Element main at 0x7f30af2fbdb8>, <Element p at 0x7f30af2fbae8>]
关于python - 使用 lxml 展平 DOM 最有效的方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43927782/