我需要计算相对较大数组中不同列的数量。
def nodistinctcols(M):
setofcols = set()
for column in M.T:
setofcols.add(repr(column))
return len(setofcols)
X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
print "nodistinctcols(X.T)", nodistinctcols(X.T)
最后一行在我的电脑上需要 20 秒,这看起来太慢了。相比之下,X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
需要 216 毫秒。 nodistinctcols
可以加速吗?
最佳答案
您可以使用view
来更改M
的dtype,以便将整行(或列)视为一个字节数组。然后可以应用 np.unique
来查找唯一值:
import numpy as np
def asvoid(arr):
"""
View the array as dtype np.void (bytes).
This views the last axis of ND-arrays as np.void (bytes) so
comparisons can be performed on the entire row.
http://stackoverflow.com/a/16840350/190597 (Jaime, 2013-05)
Some caveats:
- `asvoid` will work for integer dtypes, but be careful if using asvoid on float
dtypes, since float zeros may compare UNEQUALLY:
>>> asvoid([-0.]) == asvoid([0.])
array([False], dtype=bool)
- `asvoid` works best on contiguous arrays. If the input is not contiguous,
`asvoid` will copy the array to make it contiguous, which will slow down the
performance.
"""
arr = np.ascontiguousarray(arr)
return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))
def nodistinctcols(M):
MT = asvoid(M.T)
uniqs = np.unique(MT)
return len(uniqs)
X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
print("nodistinctcols(X.T) {}".format(nodistinctcols(X.T)))
基准:
In [20]: %timeit nodistinctcols(X.T)
10 loops, best of 3: 63.6 ms per loop
In [21]: %timeit nodistinctcols_orig(X.T)
1 loops, best of 3: 17.4 s per loop
其中 nodistinctcols_orig
定义为:
def nodistinctcols_orig(M):
setofcols = set()
for column in M.T:
setofcols.add(repr(column))
return len(setofcols)
完整性检查通过:
In [24]: assert nodistinctcols(X.T) == nodistinctcols_orig(X.T)
顺便说一下,定义
可能更有意义def num_distinct_rows(M):
return len(np.unique(asvoid(M)))
当您希望计算不同列的数量时,只需将 M.T
传递给该函数即可。这样,如果您希望使用该函数来计算不同行的数量,则该函数不会因不必要的转置而变慢。
关于python - 加快计算不同列的数量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22750318/