python - NumPy 函数标准差的内存消耗

我目前正在使用 GDAL 的 Python 绑定(bind)来处理相当大的栅格数据集(> 4 GB)。由于一次将它们加载到内存中对我来说不是可行的解决方案，所以我将它们读入较小的 block 并逐个进行计算。为避免对每个 block 读取进行新分配，我使用 buf_obj 参数 ( here ) 将值读入预分配的 NumPy 数组。有一次我必须计算整个栅格的均值和标准差。当然，我使用 np.std 进行计算。然而，通过分析我的程序的内存消耗，我意识到每次调用 np.std 都会额外分配和释放内存。

演示此行为的最小工作示例:

In [1]  import numpy as np
In [2]  a = np.random.rand(20e6)  # Approx. 150 MiB of memory
In [3]  %memit np.mean(a)
peak memory: 187.30 MiB, increment: 0.48 MiB
In [4]  %memit np.std(a)
peak memory: 340.24 MiB, increment: 152.91 MiB

在 GitHub 上的 NumPy 源代码树中搜索显示，np.std 函数在内部调用 _methods.py 中的 _var 函数(here)。在某一时刻，_var 计算与平均值的偏差并将它们相加。因此创建了输入数组的临时副本。该函数基本上按如下方式计算标准偏差:

mu = sum(arr) / len(arr)
tmp = arr - mu
tmp = tmp * tmp
sd = np.sum(tmp) / len(arr)

虽然这种方法适用于较小的输入数组，但绝对不能用于较大的输入数组。由于我使用的是如前所述的较小内存块，因此从我程序的内存角度来看，这个额外的副本并不是一个破坏游戏的问题。然而，让我感到困扰的是，对于每个 block ，都会在读取下一个 block 之前进行并释放新的分配。

NumPy 或 SciPy 中是否有一些其他函数利用像 Welford 算法 (Wikipedia) 这样的内存消耗恒定的方法一次性计算均值和标准差？

另一种方法是使用预分配缓冲区(如 NumPy ufuncs)的可选 out 参数实现自定义版本的 _var 函数。使用这种方法，不会消除额外的副本，但至少内存消耗会保持不变，并且每个 block 中的分配的运行时间得以节省。

编辑:按照 kezzos 的建议测试了 Welford 算法的 Cython 实现。

Cython 实现(从 kezzos 修改而来):

cimport cython
cimport numpy as np
from libc.math cimport sqrt

@cython.boundscheck(False)
def iterative_approach(np.ndarray[np.float32_t, ndim=1] a):
    cdef long n = 0
    cdef float mean = 0
    cdef float M2 = 0
    cdef long i
    cdef float delta
    cdef float a_min = 10000000  # Must be set to Inf and -Inf for real cases
    cdef float a_max = -10000000
    for i in range(len(a)):
        n += 1
        delta = a[i] - mean
        mean += delta / n
        M2 += delta * (a[i] - mean)
        if a[i] < a_min:
            a_min = a[i]
        if a[i] > a_max:
            a_max = a[i]
    return a_min, a_max, mean, sqrt(M2 / (n - 1))

NumPy 实现(均值和标准差可以在一个函数中计算):

def vector_approach(a):
    return np.min(a), np.max(a), np.mean(a), np.std(a, ddof=1)

使用随机数据集的测试结果(时间以毫秒为单位，25 次中最好):

----------------------------------
| Size |  Iterative |     Vector |
----------------------------------
|  1e2 |    0.00529 |    0.17149 |
|  1e3 |    0.02027 |    0.16856 |
|  1e4 |    0.17850 |    0.23069 |
|  1e5 |    1.93980 |    0.77727 |
|  1e6 |   18.78207 |    8.83245 |
|  1e7 |  180.04069 |  101.14722 |
|  1e8 | 1789.60228 | 1086.66737 |
----------------------------------

似乎使用 Cython 的迭代方法对于较小的数据集更快，而 NumPy 向量(可能是 SIMD 加速)方法对于具有 10000 多个元素的较大数据集。所有测试均使用 Python 2.7.9 和 NumPy 1.9.2 版进行。

请注意，在实际情况下，上层函数将用于计算单个栅格 block 的统计数据。所有 block 的标准偏差和均值将与维基百科 (here) 中建议的方法相结合。它的优点是不需要对栅格的所有元素求和，从而避免了 float 溢出问题(至少在某种程度上)。

最佳答案

我怀疑您会在 numpy 中找到任何此类函数。 numpy 存在的理由是它利用了 vector processor。指令集——执行大量数据的相同指令。基本上 numpy 以内存效率换取速度效率。然而，由于 Python 的内存密集型特性，numpy 也能够通过将数据类型与整个数组相关联而不是每个单独的元素来实现一定的内存效率。

提高速度但仍然牺牲一些内存开销的一种方法是计算 block 中的标准偏差，例如。

import numpy as np

def std(arr, blocksize=1000000):
    """Written for py3, change range to xrange for py2.
    This implementation requires the entire array in memory, but it shows how you can
    calculate the standard deviation in a piecemeal way.
    """
    num_blocks, remainder = divmod(len(arr), blocksize)
    mean = arr.mean()
    tmp = np.empty(blocksize, dtype=float)
    total_squares = 0
    for start in range(0, blocksize*num_blocks, blocksize):
        # get a view of the data we want -- views do not "own" the data they point to
        # -- they have minimal memory overhead
        view = arr[start:start+blocksize]
        # # inplace operations prevent a new array from being created
        np.subtract(view, mean, out=tmp)
        tmp *= tmp
        total_squares += tmp.sum()
    if remainder:
        # len(arr) % blocksize != 0 and need process last part of array
        # create copy of view, with the smallest amount of new memory allocation possible
        # -- one more array *view*
        view = arr[-remainder:]
        tmp = tmp[-remainder:]
        np.subtract(view, mean, out=tmp)
        tmp *= tmp
        total_squares += tmp.sum()
        
    var = total_squares / len(arr)
    sd = var ** 0.5
    return sd

a = np.arange(20e6)
assert np.isclose(np.std(a), std(a))

显示加速 --- blocksize 越大，加速越大。并且显着降低内存开销。较低的内存开销并非 100% 准确。

In [70]: %timeit np.std(a)
10 loops, best of 3: 105 ms per loop

In [71]: %timeit std(a, blocksize=4096)
10 loops, best of 3: 160 ms per loop

In [72]: %timeit std(a, blocksize=1000000)
10 loops, best of 3: 105 ms per loop

In [75]: %memit np.std(a)
peak memory: 512.70 MiB, increment: 152.59 MiB

In [73]: %memit std(a, blocksize=4096)
peak memory: 360.11 MiB, increment: 0.00 MiB

In [74]: %memit std(a, blocksize=1000000)
peak memory: 360.11 MiB, increment: 0.00 MiB

关于python - NumPy 函数标准差的内存消耗，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41501891/

python - NumPy 函数标准差的内存消耗

上一篇：python - 如何用 Python 编写一个简单的电子邮件监听器

下一篇：python - 在 AWS Lambda 上使用 Pyenchant，可以加载包但不能加载提供者字典，从 EC2 上的源代码编译并提取 .so 文件