python - lxml解析器吃掉所有内存

我正在用 python 编写一些蜘蛛，并使用 lxml 库来解析 html 和 gevent 库以进行异步处理。我发现经过一段时间的工作后，lxml 解析器开始占用高达 8GB 的内存(所有服务器内存)。但我只有 100 个异步线程，每个线程最多解析 300kb 的文档。

我已经测试并发现问题始于 lxml.html.fromstring，但我无法重现此问题。

这行代码的问题:

HTML = lxml.html.fromstring(htmltext)

也许有人知道它可能是什么，或者想解决这个问题？

感谢您的帮助。

附言

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

向上:

我为使用 lxml 解析器的进程设置了 ulimit -Sv 500000 和 uliit -Sm 615000。

现在他们开始在错误日志中写入一些时间:

“异常 MemoryError:忽略‘lxml.etree._BaseErrorLog._receive’中的 MemoryError()”。

我无法捕捉到这个异常，所以它递归地在日志中写入这条消息，直到磁盘上有可用空间。

我怎样才能捕获这个异常来终止进程，以便守护进程可以创建新进程？

最佳答案

您可能会保留一些使文档保持活力的引用。例如，小心来自 xpath 评估的字符串结果:默认情况下它们是“智能”字符串，它提供对包含元素的访问，因此如果您保留对它们的引用，则将树保留在内存中。请参阅 xpath return values 上的文档:

There are certain cases where the smart string behaviour is undesirable. For example, it means that the tree will be kept alive by the string, which may have a considerable memory impact in the case that the string value is the only thing in the tree that is actually of interest. For these cases, you can deactivate the parental relationship using the keyword argument smart_strings.

(我不知道这是否是你的问题，但它是一个候选者。我自己也被这个问题困扰过一次;-))

关于python - lxml解析器吃掉所有内存，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/5260261/

python - lxml解析器吃掉所有内存

上一篇：android - 为什么这个 shell 脚本将自己称为 python 脚本？

下一篇：python - 有没有一种简单的方法可以使用 Common Lisp 中的 Python 库？