python - 遍历xml元素的有效方法

我有一个这样的 xml:

<a>
    <b>hello</b>
    <b>world</b>
</a>
<x>
    <y></y>
</x>
<a>
    <b>first</b>
    <b>second</b>
    <b>third</b>
</a>

我需要遍历所有 <a>和 <b>标签，但我不知道它们中有多少在文档中。所以我用xpath处理:

from lxml import etree

doc = etree.fromstring(xml)

atags = doc.xpath('//a')
for a in atags:
    btags = a.xpath('b')
    for b in btags:
            print b

它有效，但我有相当大的文件，cProfile告诉我xpath使用起来非常昂贵。

我想知道，也许有更有效的方法来迭代无限数量的 xml 元素？

最佳答案

XPath 应该很快。您可以将 XPath 调用次数减少到一次:

doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
    print b.text

如果这还不够快，您可以尝试 Liza Daly's fast_iter .这样做的好处是不需要首先使用 etree.fromstring 处理整个 XML，并且在访问子节点后丢弃父节点。这两件事都有助于减少内存需求。下面是a modified version of fast_iter这对于删除不再需要的其他元素更具侵略性。

def fast_iter(context, func, *args, **kwargs):
    """
    fast_iter is useful if you need to free memory while iterating through a
    very large XML file.

    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

def process_element(elt):
    print(elt.text)

context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)

Liza Daly's article关于解析大型 XML 文件的内容也可能对您有用。根据文章，带有 fast_iter 的 lxml 可以比 cElementTree 的 iterparse 更快。 (见表 1)。

关于python - 遍历xml元素的有效方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/4695826/

python - 遍历xml元素的有效方法

上一篇：python - 如何使用 wtforms 指定 <textarea > 标签的行和列

下一篇：python - 为什么我无法使用 'source env/bin/activate' 命令激活我的虚拟 Python 环境？

python - 遍历xml元素的有效方法

上一篇：python - 如何使用 wtforms 指定 &lt;textarea > 标签的行和列

下一篇：python - 为什么我无法使用 'source env/bin/activate' 命令激活我的虚拟 Python 环境？

上一篇：python - 如何使用 wtforms 指定 <textarea > 标签的行和列