python - 加快计算不同列的数量

我需要计算相对较大数组中不同列的数量。

def nodistinctcols(M):
    setofcols = set()
    for column in M.T:
        setofcols.add(repr(column))
    return len(setofcols)

X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])

print "nodistinctcols(X.T)", nodistinctcols(X.T)

最后一行在我的电脑上需要 20 秒，这看起来太慢了。相比之下，X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)]) 需要 216 毫秒。 nodistinctcols 可以加速吗？

最佳答案

您可以使用view 来更改M 的dtype，以便将整行(或列)视为一个字节数组。然后可以应用 np.unique 来查找唯一值:

import numpy as np

def asvoid(arr):
    """
    View the array as dtype np.void (bytes).

    This views the last axis of ND-arrays as np.void (bytes) so 
    comparisons can be performed on the entire row.
    http://stackoverflow.com/a/16840350/190597 (Jaime, 2013-05)

    Some caveats:
        - `asvoid` will work for integer dtypes, but be careful if using asvoid on float
        dtypes, since float zeros may compare UNEQUALLY:
        >>> asvoid([-0.]) == asvoid([0.])
        array([False], dtype=bool)

        - `asvoid` works best on contiguous arrays. If the input is not contiguous,
        `asvoid` will copy the array to make it contiguous, which will slow down the
        performance.

    """
    arr = np.ascontiguousarray(arr)
    return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))

def nodistinctcols(M):
    MT = asvoid(M.T)
    uniqs = np.unique(MT)
    return len(uniqs)

X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])

print("nodistinctcols(X.T) {}".format(nodistinctcols(X.T)))

基准:

In [20]: %timeit nodistinctcols(X.T)
10 loops, best of 3: 63.6 ms per loop

In [21]: %timeit nodistinctcols_orig(X.T)
1 loops, best of 3: 17.4 s per loop

其中 nodistinctcols_orig 定义为:

def nodistinctcols_orig(M):
    setofcols = set()
    for column in M.T:
        setofcols.add(repr(column))
    return len(setofcols)

完整性检查通过:

In [24]: assert nodistinctcols(X.T) == nodistinctcols_orig(X.T)

顺便说一下，定义

可能更有意义

def num_distinct_rows(M):
    return len(np.unique(asvoid(M)))

当您希望计算不同列的数量时，只需将 M.T 传递给该函数即可。这样，如果您希望使用该函数来计算不同行的数量，则该函数不会因不必要的转置而变慢。

关于python - 加快计算不同列的数量，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22750318/

python - 加快计算不同列的数量

上一篇：python - 从列表中删除所有重复项 - 不保留重复项的实例

下一篇：python - 将值存储在一个 for 循环内的数组中