python - 与倒数第二个相比，最后一个索引对 numpy 数组的访问时间的影响更大

这是对 this answer 的跟进我之前的问题 Fastest approach to read thousands of images into one big numpy array .

在 chapter 2.3 "Memory allocation of the ndarray" ，Travis Oliphant 写了以下关于如何在内存中访问 C 有序 numpy 数组的索引。

...to move through computer memory sequentially, the last index is incremented first, followed by the second-to-last index and so forth.

这可以通过沿两个第一个或两个最后一个索引对二维数组的访问时间进行基准测试来确认(出于我的目的，这是加载 500 个大小为 512x512 像素的图像的模拟):

import numpy as np

N = 512
n = 500
a = np.random.randint(0,255,(N,N))

def last_and_second_last():
    '''Store along the two last indexes'''
    imgs = np.empty((n,N,N), dtype='uint16')
    for num in range(n):
        imgs[num,:,:] = a
    return imgs

def second_and_third_last():
    '''Store along the two first indexes'''
    imgs = np.empty((N,N,n), dtype='uint16')
    for num in range(n):
        imgs[:,:,num] = a
    return imgs

基准测试

In [2]: %timeit last_and_second_last()
136 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit second_and_third_last()
1.56 s ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

到目前为止一切顺利。但是，当我沿最后一个维度和倒数第三个维度加载数组时，这几乎与将它们加载到最后两个维度一样快。

def last_and_third_last():
    '''Store along the last and first indexes'''
    imgs = np.empty((N,n,N), dtype='uint16')
    for num in range(n):    
        imgs[:,num,:] = a
    return imgs

基准测试

In [4]: %timeit last_and_third_last()
149 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

为什么是last_and_third_last()我的速度如此接近last_and_second_last()相比 second_and third_last() ?
什么是可视化为什么最后一个索引在访问速度方面比倒数第二个索引更重要的好方法？

最佳答案

我将尝试说明索引，而不涉及处理器缓存等细节。

让我们创建一个具有不同元素值的小型 3d 数组:

In [473]: X = np.mgrid[100:300:100,10:30:10,1:4:1].sum(axis=0)
In [474]: X
Out[474]: 
array([[[111, 112, 113],
        [121, 122, 123]],

       [[211, 212, 213],
        [221, 222, 223]]])
In [475]: X.shape
Out[475]: (2, 2, 3)

ravel 将其视为一维数组，并向我们展示值在内存中的布局方式。 (顺便说一下，这是默认的 C 排序)

In [476]: X.ravel()
Out[476]: array([111, 112, 113, 121, 122, 123, 211, 212, 213, 221, 222, 223])

当我在第一个维度上建立索引时，我得到 2*3 值，即上述列表的连续 block :

In [477]: X[0,:,:].ravel()
Out[477]: array([111, 112, 113, 121, 122, 123])

在最后一个索引上给出 4 个值，从整个数组中选择 - 我添加了 .. 以突出显示它

In [478]: X[:,:,0].ravel()
Out[478]: array([111,.. 121,.. 211,.. 221])

中间的索引给了我 2 个连续的子 block ，即 2 行 X。

In [479]: X[:,0,:].ravel()
Out[479]: array([111, 112, 113,.. 211, 212, 213])

通过 strides 和 shape 计算 numpy 可以访问 X 中的任何一个元素(关于)同时。在 X[:,:,i] 的情况下，这就是它必须做的。这 4 个值“分散”在数据缓冲区中。

但如果它可以访问连续的 block ，例如在 X[i,:,:] 中，它可以将更多的操作委托(delegate)给低级编译和处理器代码。使用 X[:,i,:] 时，这些 block 没有那么大，但可能仍然大到足以产生重大影响。

在您的测试用例中，[n,:,:] 在 512*512 元素 block 上迭代 500 次。

[:,n,:] 必须将该访问分成 512 个 block ，每个 block 512 个。

[:,:,n] 必须进行 500 x 512 x 512 次单独的访问。

我想知道使用 uint16 是否会夸大效果。在另一个问题中，我们刚刚展示了使用 float16 的计算要慢得多(高达 10 倍)，因为处理器(和编译器)被调整为使用 32 位和 64 位数字。如果处理器被调整为移动 64 位数字 block ，那么移动一个隔离的 16 位数字可能需要大量额外的处理。这就像从文档中逐字复制粘贴一样，而逐行复制每次复制所需的击键次数更少。

确切的细节隐藏在处理器、操作系统和编译器以及 numpy 代码中，但希望这能让您了解为什么您的中间情况更接近最优而不是最坏的情况。

在测试中 - 将 imgs 设置为 a.dtype 在所有情况下都会减慢速度。所以 'uint16' 不会引起任何特殊问题。

Why does `numpy.einsum` work faster with `float32` than `float16` or `uint16`?

关于python - 与倒数第二个相比，最后一个索引对 numpy 数组的访问时间的影响更大，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44115571/

python - 与倒数第二个相比，最后一个索引对 numpy 数组的访问时间的影响更大

上一篇：iPhone 开发 - 内存管理类(class)

下一篇：Java 8 hashmap 高内存使用率