python - lxml 的 iterparse 尝试将整个文件加载到内存中

我正在尝试解析一个非常大的 XML 文件，因此我决定使用 lxml.iterparse，如所解释的 here .

所以我的代码如下所示:

import sys
from lxml import etree

def fast_iter(context, func):
    for event, elem in context:
        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def launchArticleProcessing(elem):
    print elem

context = etree.iterparse(sys.argv[1], events=('end',), tag='text')

fast_iter(context, launchArticleProcessing)

我这样调用它:python lxmlwtf.py "/path/to/my/file.xml"

内存就被填满了(直到我终止进程，因为文件永远装不下)并且没有打印任何内容。我在这里缺少什么？

最佳答案

我在这里回答了一个非常相似的问题:lxml and fast_iter eating all the memory主要原因是 lxml.etree 仍然保留所有未显式捕获的元素在内存中。因此需要手动清除。

我所做的并不是过滤您正在查找的标签的事件:

context = etree.iterparse(open(filename,'r'),events=('end',))

而是手动解析并清除其余部分:

for (event,elem) in progress.bar(context):
    if elem.tag == 'text':
        # do things here

    # every element gets cleared here
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
del context

关于python - lxml 的 iterparse 尝试将整个文件加载到内存中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23443026/

上一篇：python - 无法将 fdopen 与 mkstemp 一起使用

下一篇：python - 有没有办法在Python中打印函数的定义

python - 带有偏移的 Pandas 头

python - 我们如何才能仅替换特定行中的 NaN？

python - pyGTK 检测所有窗口移动事件

python - 用 Python 解析古腾堡的 RDF

python - 使用 lxml 和请求进行 HTML 抓取会导致 unicode 错误

python - 使用 lxml.iterparse 解析相同内容两次

python - 将 HTML 转换为 PDF 的模块与 Bootstrap 和 Flask 兼容

python - 使用 Python 进行更快的解析

python - 脚本无法从网页获取数据