python - Numpy 规范化代码出奇地慢

标签 python optimization numpy normalization

我正在整理一些基本的 python 代码,这些代码接受映射到矩阵列表的标签字典(矩阵表示分类图像),我只是想从所有内容中减去平均图像,然后将数据居中0 - 1 刻度。出于某种原因,这段代码似乎运行缓慢得令人尴尬。当仅迭代 500 张 48x48 图像时,运行大约需要 10 秒,这不会真正扩展到我正在处理的图像数量。查看 cProfile 结果后,看起来大部分时间都花在了 _center 函数上。

我觉得我可能没有在这里充分利用 numpy,并且想知道是否有人比我更有经验可以加快速度,或者可以指出我在这里做的一些愚蠢的事情。代码如下:

def __init__(self, master_dict, normalization = lambda x: math.exp(x)):
    """
    master_dict should be a dictionary mapping classes to lists of matrices

    example = {
        "cats": [[[]...], [[]...]...],
        "dogs": [[[]...], [[]...]...]
    }

    have to be python lists, not numpy arrays

    normalization represents the 0-1 normalization scheme used. Defaults to simple linear
    """
    normalization = np.vectorize(normalization)
    full_tensor = np.array(reduce(operator.add, master_dict.values()))
    centering = np.sum(np.array(reduce(operator.add, master_dict.values())), axis=0)/len(full_tensor)
    self.data = {key: self._center(np.array(value), centering, normalization) for key,value in master_dict.items()}
    self.normalization = normalization

def _center(self, list_of_arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    arrays = list_of_arrays - centering_factor
    normalize = lambda a: (a - np.min(a)) / (np.max(a) - np.min(a))
    return normalization_scheme([normalize(array) for array in arrays])

此外,在你问之前,我对输入格式没有太多控制权,但如果这真的是这里的限制因素,我可能会想出一些办法。

最佳答案

从@sethMMorton 的更改开始,我几乎可以将速度提高两倍。主要来自向量化您的normalize 函数(在_center 内部),这样您就可以在整个 上调用_center list_of_arrays 而不是仅仅将其放入列表理解中。这也消除了从 numpy 数组到列表再返回的额外转换。

def normalize(a):
    a -= a.min(1, keepdims=True).min(2, keepdims=True)
    a /= a.max(1, keepdims=True).max(2, keepdims=True)
    return a

请注意,我不会在 _center 调用中定义 normalize ,而是将其分开,如此答案所示。那么,在 _center 中,只需对整个 list_of_arrays 调用 normalize:

def _center(self, list_of_arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    list_of_arrays -= centering_factor
    return normalization_scheme(normalize(list_of_arrays))

事实上,你可以在一开始就对整个full_tensor调用normalize_center,而不必循环遍历,但是棘手的部分是再次将它拆分回数组列表的字典中。接下来我会继续努力 :P


如我的评论所述,您可以替换:

full_tensor = np.array(reduce(operator.add, master_dict.values()))

full_tensor = np.concatenate(master_dict.values())

这可能不会更快,但它更清晰并且是执行此操作的标准方法。

最后,这是时间安排:

>>> timeit slater_init(example)
1 loops, best of 3: 1.42 s per loop

>>> timeit seth_init(example)
1 loops, best of 3: 489 ms per loop

>>> timeit my_init(example)
1 loops, best of 3: 281 ms per loop

下面是我的完整计时代码。请注意,我将 self.data = ... 替换为 return ... 以便我可以保存和比较输出以确保我们所有的代码都返回相同的数据:) 当然,您也应该针对我的版本测试您的版本!

import operator
import math
import numpy as np

#example dict has N keys (integers), each value is a list of n random HxW 'arrays', in list form:
test_shape = 10, 2, 4, 4          # small example for testing
timing_shape = 100, 5, 48, 48     # bigger example for timing
N, n, H, W = timing_shape
example = dict(enumerate(np.random.rand(N, n, H, W).tolist()))

def my_init(master_dict, normalization=np.exp):
    full_tensor = np.concatenate(master_dict.values())
    centering = np.mean(full_tensor, 0)
    return {key: my_center(np.array(value), centering, normalization)
                     for key,value in master_dict.iteritems()} #use iteritems here
    #self.normalization = normalization

def my_normalize(a):
    a -= a.min(1, keepdims=True).min(2, keepdims=True)
    a /= a.max(1, keepdims=True).max(2, keepdims=True)
    return a

def my_center(arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    arrays -= centering_factor
    return normalization_scheme(my_normalize(arrays))

#### sethMMorton's original improvement ####

def seth_init(master_dict, normalization = np.exp):
    """
    master_dict should be a dictionary mapping classes to lists of matrices

    example = {
        "cats": [[[]...], [[]...]...],
        "dogs": [[[]...], [[]...]...]
    }

    have to be python lists, not numpy arrays

    normalization represents the 0-1 normalization scheme used. Defaults to simple linear
    """
    full_tensor = np.array(reduce(operator.add, master_dict.values()))
    centering = np.sum(full_tensor, axis=0)/len(full_tensor)
    return {key: seth_center(np.array(value), centering, normalization) for key,value in master_dict.items()}
    #self.normalization = normalization

def seth_center(list_of_arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    def seth_normalize(a):
        a_min = np.min(a)
        return (a - a_min) / (np.max(a) - a_min)
    arrays = list_of_arrays - centering_factor
    return normalization_scheme([seth_normalize(array) for array in arrays])

#### Original code, by slater ####

def slater_init(master_dict, normalization = lambda x: math.exp(x)):
    """
    master_dict should be a dictionary mapping classes to lists of matrices

    example = {
        "cats": [[[]...], [[]...]...],
        "dogs": [[[]...], [[]...]...]
    }

    have to be python lists, not numpy arrays

    normalization represents the 0-1 normalization scheme used. Defaults to simple linear
    """
    normalization = np.vectorize(normalization)
    full_tensor = np.array(reduce(operator.add, master_dict.values()))
    centering = np.sum(np.array(reduce(operator.add, master_dict.values())), axis=0)/len(full_tensor)
    return {key: slater_center(np.array(value), centering, normalization) for key,value in master_dict.items()}
    #self.normalization = normalization

def slater_center(list_of_arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    arrays = list_of_arrays - centering_factor
    slater_normalize = lambda a: (a - np.min(a)) / (np.max(a) - np.min(a))
    return normalization_scheme([slater_normalize(array) for array in arrays])

关于python - Numpy 规范化代码出奇地慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20308305/

相关文章:

r - 受范数不等式约束的二次函数最小化

java - "conditional"是否被 "int i = flag ? 1 : 0;"中的 JIT 删除了?

python - 如何堆叠两个numpy数组的每n行

python - 使用 Python 如何从 Excel 文件获取输入、定义函数并在该 Excel 文件的新工作表中生成输出?

python - Sklearn SVM系数属性-如何获取类名?

Python 3.4 url​​lib.request 错误(http 403)

python - cron 作业执行的脚本中的相对路径

python - Google App Engine 代码优化

Java:数据结构内存估计

python - 聚合 numpy 屏蔽数组