我有一个问题,我想识别和删除逻辑矩阵中作为其他列子集的列。即 [1, 0, 1] 是 [1, 1, 1] 的子集;但是 [1, 1, 0] 和 [0, 1, 1] 都不是彼此的子集。我写了一段快速代码来识别子集的列,它使用一对嵌套的 for 循环进行 (n^2-n)/2 检查。
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
columns = [True]*cols
for i in range(cols):
for j in range(i+1,cols):
diff = A[:,i]-A[:,j]
if all(diff >= 0):
print "%d is a subset of %d" % (j, i)
columns[j] = False
elif all(diff <= 0):
print "%d is a subset of %d" % (i, j)
columns[i] = False
B = A[:,columns]
解决方案应该是
>>> print B
[[1 0 0]
[0 1 1]
[1 1 0]
[1 0 1]
[1 0 1]
[1 0 0]
[0 1 1]
[0 1 0]]
不过,对于大型矩阵,我确信有一种方法可以更快地完成此操作。一种想法是在我进行时消除子集列,这样我就不会检查已知是子集的列。另一个想法是对其进行矢量化,这样就没有 O(n^2) 操作。谢谢。
最佳答案
由于我实际处理的 A
矩阵是 5000x5000 并且密度大约为 4% 的稀疏矩阵,我决定尝试结合 Python 的“集合”对象的稀疏矩阵方法。总的来说,它比我原来的解决方案快得多,但我觉得我从矩阵 A
到集合列表 D
的过程并没有那么快。任何关于如何更好地做到这一点的想法都将受到赞赏。
解决方案
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
drops = np.zeros(cols).astype(bool)
# sparse nonzero elements
C = np.nonzero(A)
# create a list of sets containing the indices of non-zero elements of each column
D = [set() for j in range(cols)]
for i in range(len(C[0])):
D[C[1][i]].add(C[0][i])
# find subsets, ignoring columns that are known to already be subsets
for i in range(cols):
if drops[i]==True:
continue
col1 = D[i]
for j in range(i+1,cols):
col2 = D[j]
if col2.issubset(col1):
# I tried `if drops[j]==True: continue` here, but that was slower
print "%d is a subset of %d" % (j, i)
drops[j] = True
elif col1.issubset(col2):
print "%d is a subset of %d" % (i, j)
drops[i] = True
break
B = A[:, ~drops]
print B
关于python - 二维矩阵 : Finding and deleting columns that are subsets of other columns,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37603764/