python - 切片 numpy 数组时减少内存使用

标签 python linux numpy memory-leaks garbage-collection

我在 Python 中无法释放内存。情况基本上是这样的:我有一个大数据集分成 4 个文件。每个文件包含 5000 个形状为 (3072, 412) 的 numpy 数组的列表。我正在尝试将每个数组的第 10 到第 20 列提取到一个新列表中。

我想做的是按顺序读取每个文件,提取我需要的数据,并在继续下一个文件之前释放我正在使用的内存。但是,删除对象,将其设置为 None 并将其设置为 0,然后调用 gc.collect() 似乎不起作用。这是我正在使用的代码片段:

num_files=4
start=10
end=20           
fields = []
for j in range(num_files):
    print("Working on file ", j)
    source_filename = base_filename + str(j) + ".pkl"
    print("Memory before: ", psutil.virtual_memory())
    partial_db = joblib.load(source_filename)
    print("GC tracking for partial_db is ",gc.is_tracked(partial_db))
    print("Memory after loading partial_db:",psutil.virtual_memory())
    for x in partial_db:
        fields.append(x[:,start:end])
    print("Memory after appending to fields: ",psutil.virtual_memory())
    print("GC Counts before del: ", gc.get_count())
    partial_db = None
    print("GC Counts after del: ", gc.get_count())
    gc.collect()
    print("GC Counts after collection: ", gc.get_count())
    print("Memory after freeing partial_db: ", psutil.virtual_memory())

这是几个文件后的输出:

Working on file  0
Memory before:  svmem(total=67509161984, available=66177449984,percent=2.0, used=846712832, free=33569669120, active=27423051776, inactive=5678043136, buffers=22843392, cached=33069936640, shared=15945728)
GC tracking for partial_db is  True
Memory after loading partial_db:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
Memory after appending to fields:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
GC Counts before del:  (0, 7, 3)
GC Counts after del:  (0, 7, 3)
GC Counts after collection:  (0, 0, 0)
Memory after freeing partial_db:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
Working on file  1
Memory before:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
GC tracking for partial_db is  True
Memory after loading partial_db:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)
Memory after appending to fields:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)
GC Counts before del:  (0, 4, 2)
GC Counts after del:  (0, 4, 2)
GC Counts after collection:  (0, 0, 0)
Memory after freeing partial_db:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)

如果我继续让它运行,它将耗尽所有内存并触发 MemoryError 异常。

有人知道我可以做什么来确保释放 partial_db 使用的数据吗?

最佳答案

问题是这样的:

for x in partial_db:
    fields.append(x[:,start:end])

切片 numpy 数组(与普通的 Python 列表不同)几乎不花时间且不浪费空间的原因是它不制作副本,它只是在数组内存中创建另一个 View 。通常,这很好。但在这里,这意味着即使在释放 x 本身之后,您仍保持 x 的内存处于事件状态,因为您永远不会释放那些切片。

还有其他解决方法,但最简单的是只附加切片的副本:

for x in partial_db:
    fields.append(x[:,start:end].copy())

关于python - 切片 numpy 数组时减少内存使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50195197/

相关文章:

python - 将列连接到矩阵(numpy)

python - 在 Linux 中更改默认系统文件复制器

c - 尝试附加到共享内存时收到 "shmat: permission denied"。为什么?

python - 我的 Python 代码有多少个线程?

python - 如何更快地用python计算指数移动平均线?

python - Pandas 将列类型从列表转换为 np.array

Python子进程读取进程在写入进程示例之前终止,需要澄清

python - tensorflow 逻辑回归的准确性非常差

python - 在读取文件的同时不断更新 tkinter 应用程序

linux - BASH 语法检查 Debug模式故障?