python - 为什么 pickle 吃内存？

我试图处理将大量 pickle 数据逐小块写入磁盘的问题。这是示例代码:

from cPickle import *
from gc import collect

PATH = r'd:\test.dat'
@profile
def func(item):
    for e in item:
        f = open(PATH, 'a', 0)
        f.write(dumps(e))
        f.flush()
        f.close()
        del f
        collect()

if __name__ == '__main__':
    k = [x for x in xrange(9999)]
    func(k)

open() 和 close() 放置在循环内，以排除内存中数据积累的可能原因。

为了说明问题，我附上了使用 Python 3d 派对模块获得的内存分析结果 memory_profiler :

   Line #    Mem usage  Increment   Line Contents
==============================================
    14                           @profile
    15      9.02 MB    0.00 MB   def func(item):
    16      9.02 MB    0.00 MB       path= r'd:\test.dat'
    17
    18     10.88 MB    1.86 MB       for e in item:
    19     10.88 MB    0.00 MB           f = open(path, 'a', 0)
    20     10.88 MB    0.00 MB           f.write(dumps(e))
    21     10.88 MB    0.00 MB           f.flush()
    22     10.88 MB    0.00 MB           f.close()
    23     10.88 MB    0.00 MB           del f
    24                                   collect()

在执行循环期间，奇怪的内存使用量增长发生了。怎样才能消除呢？有什么想法吗？

当输入数据量增加时，这些额外数据的体积会增长到比输入大得多(更新:在实际任务中我得到 300+Mb)

还有更广泛的问题——在 Python 中有哪些方法可以正确处理大量 IO 数据？

更新: 我重写了代码，只留下循环体以查看具体增长何时发生，结果如下:

Line #    Mem usage  Increment   Line Contents
==============================================
    14                           @profile
    15      9.00 MB    0.00 MB   def func(item):
    16      9.00 MB    0.00 MB       path= r'd:\test.dat'
    17
    18                               #for e in item:
    19      9.02 MB    0.02 MB       f = open(path, 'a', 0)
    20      9.23 MB    0.21 MB       d = dumps(item)
    21      9.23 MB    0.00 MB       f.write(d)
    22      9.23 MB    0.00 MB       f.flush()
    23      9.23 MB    0.00 MB       f.close()
    24      9.23 MB    0.00 MB       del f
    25      9.23 MB    0.00 MB       collect()

似乎 dumps() 会占用内存。 (虽然我实际上认为它会是 write())

最佳答案

Pickle 消耗大量 RAM，请参阅此处的解释:http://www.shocksolution.com/2010/01/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/

Why does Pickle consume so much more memory? The reason is that HDF is a binary data pipe, while Pickle is an object serialization protocol. Pickle actually consists of a simple virtual machine (VM) that translates an object into a series of opcodes and writes them to disk. To unpickle something, the VM reads and interprets the opcodes and reconstructs an object. The downside of this approach is that the VM has to construct a complete copy of the object in memory before it writes it to disk.

Pickle 非常适合小型用例或测试，因为在大多数情况下内存消耗并不重要。

对于必须转储和加载大量文件和/或大文件的密集型工作，您应该考虑使用另一种方式来存储数据(例如:hdf，为您的对象编写自己的序列化/反序列化方法，. ..)

关于python - 为什么 pickle 吃内存？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13871152/

python - 为什么 pickle 吃内存？

上一篇：python - pandas python 中没有列名

下一篇：Python 列表/数组 : disable negative indexing wrap-around in slices