python - HDF5 添加 numpy 数组速度慢

第一次使用 hdf5，你能帮我找出问题所在，为什么添加 3d numpy 数组很慢。预处理需要3秒，添加3d numpy数组(100x512x512)30秒并随着每个样本而上升

首先我创建 hdf:

def create_h5(fname_):
  """
  Run only once
  to create h5 file for dicom images
  """
  f = h5py.File(fname_, 'w', libver='latest') 

  dtype_ = h5py.special_dtype(vlen=bytes)


  num_samples_train = 1397
  num_samples_test = 1595 - 1397
  num_slices = 100

  f.create_dataset('X_train', (num_samples_train, num_slices, 512, 512), 
    dtype=np.int16, maxshape=(None, None, 512, 512), 
    chunks=True, compression="gzip", compression_opts=4)
  f.create_dataset('y_train', (num_samples_train,), dtype=np.int16, 
    maxshape=(None, ), chunks=True, compression="gzip", compression_opts=4)
  f.create_dataset('i_train', (num_samples_train,), dtype=dtype_, 
    maxshape=(None, ), chunks=True, compression="gzip", compression_opts=4)          
  f.create_dataset('X_test', (num_samples_test, num_slices, 512, 512), 
    dtype=np.int16, maxshape=(None, None, 512, 512), chunks=True, 
    compression="gzip", compression_opts=4)
  f.create_dataset('y_test', (num_samples_test,), dtype=np.int16, maxshape=(None, ), chunks=True, 
    compression="gzip", compression_opts=4)
  f.create_dataset('i_test', (num_samples_test,), dtype=dtype_, 
    maxshape=(None, ), 
    chunks=True, compression="gzip", compression_opts=4)

  f.flush()
  f.close()
  print('HDF5 file created')

然后我运行代码更新 hdf 文件:

num_samples_train = 1397
num_samples_test = 1595 - 1397

lbl = pd.read_csv(lbl_fldr + 'stage1_labels.csv')

patients = os.listdir(dicom_fldr)
patients.sort()

f = h5py.File(h5_fname, 'a') #r+ tried

train_counter = -1
test_counter = -1

for sample in range(0, len(patients)):    

    sw_start = time.time()

    pat_id = patients[sample]
    print('id: %s sample: %d \t train_counter: %d test_counter: %d' %(pat_id, sample, train_counter+1, test_counter+1), flush=True)

    sw_1 = time.time()
    patient = load_scan(dicom_fldr + patients[sample])        
    patient_pixels = get_pixels_hu(patient)       
    patient_pixels = select_slices(patient_pixels)

    if patient_pixels.shape[0] != 100:
        raise ValueError('Slices != 100: ', patient_pixels.shape[0])



    row = lbl.loc[lbl['id'] == pat_id]

    if row.shape[0] > 1:
        raise ValueError('Found duplicate ids: ', row.shape[0])

    print('Time preprocessing: %0.2f' %(time.time() - sw_1), flush=True)



    sw_2 = time.time()
    #found test patient
    if row.shape[0] == 0:
        test_counter += 1

        f['X_test'][test_counter] = patient_pixels
        f['i_test'][test_counter] = pat_id
        f['y_test'][test_counter] = -1


    #found train
    else: 
        train_counter += 1

        f['X_train'][train_counter] = patient_pixels
        f['i_train'][train_counter] = pat_id
        f['y_train'][train_counter] = row.cancer

    print('Time saving: %0.2f' %(time.time() - sw_2), flush=True)

    sw_el = time.time() - sw_start
    sw_rem = sw_el* (len(patients) - sample)
    print('Elapsed: %0.2fs \t rem: %0.2fm %0.2fh ' %(sw_el, sw_rem/60, sw_rem/3600), flush=True)


f.flush()
f.close()

最佳答案

速度缓慢几乎肯定是由于压缩和分块造成的。很难做到这一点。在我过去的项目中，我经常不得不关闭压缩，因为它太慢了，尽管我总体上并没有放弃 HDF5 中的压缩想法。

首先，您应该尝试确认压缩和分块是导致性能问题的原因。关闭分块和压缩(即省略 chunks=True、compression="gzip"、compression_opts=4 参数)，然后重试。我怀疑它会快很多。

如果您想使用压缩，您必须了解分块的工作原理，因为 HDF 逐 block 压缩数据。谷歌一下，但至少阅读section on chunking from the h5py docs 。以下引用至关重要:

Chunking has performance implications. It’s recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.

通过设置chunks=True，你可以让h5py自动为你确定 block 大小(打印数据集的chunks属性来查看它们是什么)。假设第一个维度(您的 sample 维度)中的 block 大小为 5 。这意味着当您添加一个样本时，底层 HDF 库将从磁盘读取包含该样本的所有 block (因此总共将完整读取 5 个样本)。对于每个 block ，HDF 将读取它、解压缩它、添加新数据、压缩它，然后将其写回磁盘。不用说，这很慢。 HDF 具有 block 缓存，因此未压缩的 block 可以驻留在内存中，从而缓解了这一问题。然而， block 缓存似乎相当小(请参阅 here )，因此我认为在 for 循环的每次迭代中，所有 block 都会在缓存中换入和换出。我在 h5py 中找不到任何设置来更改 block 缓存大小。

您可以通过将元组分配给 chunks 关键字参数来显式设置 block 大小。考虑到这一切，您可以尝试不同的 block 大小。我的第一个实验是将第一个(样本)维度中的 block 大小设置为 1，以便可以访问各个样本，而无需将其他样本读入缓存。让我知道这是否有帮助，我很想知道。

即使您找到适合写入数据的 block 大小，读取数据时它仍然可能很慢，具体取决于您读取的切片。选择 block 大小时，请记住应用程序通常如何读取数据。您可能必须调整文件创建例程以适应这些 block 大小(例如，逐 block 填充数据集)。或者您可以认为这根本不值得付出努力并创建未压缩的 HDF5 文件。

最后，我会在 create_dataset 调用中设置 shuffle=True 。这可能会给你带来更好的压缩比。但它不应该影响性能。

关于python - HDF5 添加 numpy 数组速度慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41771992/

python - HDF5 添加 numpy 数组速度慢

上一篇：python - 您的 PYTHONPATH 指向 Python 3.x 的站点包目录，但您运行的是 Python 2.x

下一篇：python - scikit learn 中不能使用超过 10 个内核