python - 带条件的每行唯一值的二维矢量化

考虑显示的数组和函数定义:

import numpy as np

a = np.array([[2, 2, 5, 6, 2, 5],
              [1, 5, 8, 9, 9, 1],
              [0, 4, 2, 3, 7, 9],
              [1, 4, 1, 1, 5, 1],
              [6, 5, 4, 3, 2, 1],
              [3, 6, 3, 6, 3, 6],
              [0, 2, 7, 6, 3, 4],
              [3, 3, 7, 7, 3, 3]])

def grpCountSize(arr, grpCount, grpSize):    
    count = [np.unique(row, return_counts=True)  for row in arr]
    valid = [np.any(np.count_nonzero(row[1] == grpSize) == grpCount) for row in count]
    return valid

该函数的要点是返回数组 a 的行，这些行恰好具有 grpCount 组元素，每个元素都完全相同 grpSize元素。

例如:

# which rows have exactly 1 group that holds exactly 2 identical elements?
out = a[grpCountSize(a, 1, 2)]

正如预期的那样，代码输出 out = [[2, 2, 5, 6, 2, 5], [3, 3, 7, 7, 3, 3]]。第一个输出行恰好有 1 组 2(即:5,5)，而第二个输出行也恰好有 1 组 2(即:7,7)。

类似地:

# which rows have exactly 2 groups that each hold exactly 3 identical elements?
out = a[grpCountSize(a, 2, 3)]

这会产生 out = [[3, 6, 3, 6, 3, 6]]，因为只有这一行恰好有 2 个组，每个组恰好包含 3 个元素(即:3,3, 3 和 6,6,6)

问题:我的实际数组只有 6 列，但它们可以有数百万行。该代码按预期完美运行，但对于长数组来说非常慢。有什么办法可以加快速度吗？

最佳答案

np.unique 对数组进行排序，这会降低您的目的效率。使用 np.bincount 这样您很可能会节省一些时间(取决于您的数组形状和数组中的值)。您也将不再需要 np.any:

def grpCountSize(arr, grpCount, grpSize):    
    count = [np.bincount(row) for row in arr]
    valid = [np.count_nonzero(row == grpSize) == grpCount for row in count]
    return valid

另一种甚至可以节省更多时间的方法是对所有行使用相同数量的 bin 并创建一个数组:

def grpCountSize(arr, grpCount, grpSize):
    m = arr.max()
    count = np.stack([np.bincount(row, minlength=m+1) for row in arr])
    return (count == grpSize).sum(1)==grpCount

另一个升级是使用vectorized 2D bin count from this post .例如(请注意，上面帖子中测试的 Numba 解决方案更快。例如，我只是提供了 numpy 解决方案。您可以将函数替换为上面链接的帖子中建议的任何函数):

def grpCountSize(arr, grpCount, grpSize):
    count = bincount2D_vectorized(arr)
    return (count == grpSize).sum(1)==grpCount
#from the post above
def bincount2D_vectorized(a):    
    N = a.max()+1
    a_offs = a + np.arange(a.shape[0])[:,None]*N
    return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)

以上所有解的输出:

a[grpCountSize2(a, 1, 2)]
#array([[2, 2, 5, 6, 2, 5],
#       [3, 3, 7, 7, 3, 3]])

关于python - 带条件的每行唯一值的二维矢量化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66037744/

python - 带条件的每行唯一值的二维矢量化

上一篇：javascript - 检查可变颜色

下一篇：sql - 如何创建一个 Postgres 11 触发器函数，在插入或更新表 'b' 时在表 'a' 中插入新行？