python - numpy memmap内存使用-想要迭代一次

假设我在磁盘上保存了一些大矩阵。将其全部存储在内存中实际上不太可行，因此我使用 memmap 来访问它

A = np.memmap(filename, dtype='float32', mode='r', shape=(3000000,162))

现在假设我想迭代这个矩阵(本质上不是以有序的方式)，这样每一行都将被访问一次。

p = some_permutation_of_0_to_2999999()

我想做这样的事情:

start = 0
end = 3000000
num_rows_to_load_at_once = some_size_that_will_fit_in_memory()
while start < end:
    indices_to_access = p[start:start+num_rows_to_load_at_once]
    do_stuff_with(A[indices_to_access, :])
    start = min(end, start+num_rows_to_load_at_once)

随着这个过程的进行，我的计算机变得越来越慢，我的 RAM 和虚拟内存使用量呈爆炸式增长。

是否有某种方法可以强制 np.memmap 使用一定数量的内存？ (我知道我不需要的行数会超过我计划一次读取的行数，并且缓存不会真正帮助我，因为我只访问每一行一次)

也许还有其他方法可以按自定义顺序迭代(类似生成器) np 数组？我可以使用 file.seek 手动编写它，但它恰好比 np.memmap 实现慢得多

do_stuff_with() 不保留对其接收的数组的任何引用，因此在这方面没有“内存泄漏”

谢谢

最佳答案

这是我一段时间以来一直在努力解决的问题。我使用大型图像数据集，numpy.memmap 为处理这些大型图像集提供了便捷的解决方案。

但是，正如您所指出的，如果我需要访问每个帧(或您的情况下的行)来执行某些操作，RAM 使用量最终将达到最大值。

幸运的是，我最近找到了一个解决方案，允许您在限制 RAM 使用量的同时迭代整个 memmap 数组。

解决方案:

import numpy as np

# create a memmap array
input = np.memmap('input', dtype='uint16', shape=(10000,800,800), mode='w+')

# create a memmap array to store the output
output = np.memmap('output', dtype='uint16', shape=(10000,800,800), mode='w+')

def iterate_efficiently(input, output, chunk_size):
    # create an empty array to hold each chunk
    # the size of this array will determine the amount of RAM usage
    holder = np.zeros([chunk_size,800,800], dtype='uint16')

    # iterate through the input, replace with ones, and write to output
    for i in range(input.shape[0]):
        if i % chunk_size == 0:
            holder[:] = input[i:i+chunk_size] # read in chunk from input
            holder += 5 # perform some operation
            output[i:i+chunk_size] = holder # write chunk to output

def iterate_inefficiently(input, output):
    output[:] = input[:] + 5

计时结果:

In [11]: %timeit iterate_efficiently(input,output,1000)
1 loop, best of 3: 1min 48s per loop

In [12]: %timeit iterate_inefficiently(input,output)
1 loop, best of 3: 2min 22s per loop

磁盘上的阵列大小约为 12GB。使用 iterate_efficiently 函数将内存使用量保持在 1.28GB，而 iterate_inefficiently 函数最终使 RAM 达到 12GB。

这已在 Mac 操作系统上进行了测试。

关于python - numpy memmap内存使用-想要迭代一次，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45132940/

python - numpy memmap内存使用-想要迭代一次

上一篇：Django 'dynamic' 过滤

下一篇：离散值的 Fortran 循环