python - 随机访问磁盘上保存的 numpy 数组

我有一个大的 numpy 数组 A，形状为 (2_000_000, 2000)，dtype float64，其中需要 32 GB。

(或者将相同的数据分成 10 个形状(200_000, 2000)的数组，这样可能更容易序列化？)。

我们如何将其序列化到磁盘，以便我们可以快速随机读取数据的任何部分？

更准确地说，我需要能够从 A 随机起始索引 i 读取数万个形状为 (16, 2 000) 的窗口:

L = []
for i in range(10_000):
    i = random.randint(0, 2_000_000 - 16):
    window = A[i:i+16, :]         # window of A of shape (16, 2000) starting at a random index i
    L.append(window)
WINS = np.concatenate(L)   # shape (10_000, 16, 2000) of float64, ie: ~ 2.4 GB

假设我只有 8 GB RAM 可用于此任务；将整个 32 GB 的 A 加载到 RAM 中是完全不可能的。

我们如何在磁盘上序列化的 numpy 数组中读取此类窗口？(.h5 格式或任何其他格式)

注意:读取是在随机起始索引处完成的这一事实很重要。

最佳答案

此示例展示了如何使用 HDF5 文件来执行您所描述的过程。

首先，使用 shape(2_000_000, 2000) 和 dtype=float64 值的数据集创建一个 HDF5 文件。我使用了变量作为尺寸，这样你就可以修改它。

import numpy as np
import h5py
import random

h5_a0, h5_a1 = 2_000_000, 2_000

with h5py.File('SO_68206763.h5','w') as h5f:
    dset = h5f.create_dataset('test',shape=(h5_a0, h5_a1))
    
    incr = 1_000
    a0 = h5_a0//incr
    for i in range(incr):
        arr = np.random.random(a0*h5_a1).reshape(a0,h5_a1)
        dset[i*a0:i*a0+a0, :] = arr       
    print(dset[-1,0:10])  # quick dataset check of values in last row

接下来，以读取模式打开文件，读取形状为 (16,2_000) 的 10_000 个随机数组切片，并将其附加到列表 L 中。最后，将列表转换为数组WINS。请注意，默认情况下，数组将有 2 个轴 - 如果您希望每个注释有 3 个轴(也显示了 reshape)，则需要使用 .reshape()。

with h5py.File('SO_68206763.h5','r') as h5f:
    dset = h5f['test']
    L = []
    ds0, ds1 = dset.shape[0], dset.shape[1]
    for i in range(10_000):
        ir = random.randint(0, ds0 - 16)
        window = dset[ir:ir+16, :]  # window from dset of shape (16, 2000) starting at a random index i
        L.append(window)
    WINS = np.concatenate(L)   # shape (160_000, 2_000) of float64,
    print(WINS.shape, WINS.dtype)
    WINS = np.concatenate(L).reshape(10_0000,16,ds1)   # reshaped to (10_000, 16, 2_000) of float64
    print(WINS.shape, WINS.dtype)

上述过程的内存效率不高。您最终得到了随机切片数据的 2 个副本:在列表 L 和数组 WINS 中。如果内存有限，这可能是一个问题。为了避免中间复制，请将数据的随机幻灯片直接读取到数组中。这样做可以简化代码并减少内存占用。该方法如下所示(WINS2 是 2 轴数组，WINS3 是 3 轴数组)。

with h5py.File('SO_68206763.h5','r') as h5f:
    dset = h5f['test']
    ds0, ds1 = dset.shape[0], dset.shape[1]
    WINS2 = np.empty((10_000*16,ds1))
    WINS3 = np.empty((10_000,16,ds1))
    for i in range(10_000):
        ir = random.randint(0, ds0 - 16)
        WINS2[i*16:(i+1)*16,:] = dset[ir:ir+16, :]
        WINS3[i,:,:] = dset[ir:ir+16, :]

关于python - 随机访问磁盘上保存的 numpy 数组，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68206763/

python - 随机访问磁盘上保存的 numpy 数组

上一篇：docker - cAdvisor 容器多核 CPU 使用情况

下一篇：java - 有没有办法在ROS noetic中使用rosjava