algorithm - 图问题松弛二分维简例的快速逼近

给定 bool 矩阵 M，我需要找到一组子矩阵 A = {A₁, ..., A_n} 这样 A 中的矩阵包含矩阵 M 中的所有 True 值并且只包含它们。子矩阵不必是连续的，即每个子矩阵由两组索引 {i₁, ..., i_k} 定义em>, {j₁, ..., j_t} 的 M。 (例如，子矩阵可以是 [{1, 2, 5}, {4, 7, 9, 13}] 之类的东西，它是这些行和列相交处的所有单元格。)可选地，子矩阵可以相交，如果这会产生更好的结果解决方案。子矩阵的总数 n 应该是最小的。

矩阵 M 的大小可以达到 10^4 x 10^4，所以我需要一个有效的算法。我想这个问题可能没有有效的精确算法，因为它让我想起了一些 NP-hard 问题。如果这是真的，那么任何好的和快速的近似都是可以的。我们还可以建议真实值的数量不是很大，即 < 所有值的 1/10，但为了避免在产品中出现意外 DOS，不使用此事实的解决方案更好。

我不需要任何代码，只要算法的一般概念及其属性的合理性(如果不是很明显的话)。

背景

我们正在为物流应用程序计算一些昂贵的距离矩阵。这些请求中的点通常是相交的，所以我们正在尝试开发一些缓存算法来不计算某些请求的部分。并将大请求拆分为只有未知子矩阵的小请求。此外，该算法可能不需要矩阵中的某些距离。一方面，少量的大组计算速度更快，另一方面，如果我们包含大量“假”值，并且我们的子矩阵大得不合理，这会减慢计算速度。确切的标准是复杂的，“昂贵”矩阵请求的时间复杂度很难估计。据我所知，对于方阵来说，它类似于 C*n^2.5，C 很大。因此很难制定一个好的优化标准，但欢迎提出任何想法。

关于数据

矩阵中的真值意味着这两个点之间的距离以前从未计算过。大多数请求(但不是全部)都是在两个轴上具有相同点的方阵。因此，大多数 M 预计几乎是对称的。还有一个简单的例子，几个全新的点和其他距离被缓存。我在预处理阶段处理这种情况。所有其他值都可以是非常随机的。如果它们太随机，我们可以放弃缓存并计算完整矩阵 M。但有时会有有用的模式。我认为由于数据的性质，预计它包含比随机数据更多的大 sumbatrices。大多数真值是偶然的，但形成我们需要找到的子矩阵模式。但我们不能完全依赖这一点，因为如果算法得到的矩阵太随机，它至少应该能够检测到它，而不是进行太长和复杂的计算。

更新

如 wikipedia 中所述这个问题称为图的二分维，已知是 NP-hard。因此，我们可以重新制定它的信息，为问题的简单情况找到快速松弛的近似值。我们可以允许一定比例的错误值，我们可以采用一些简单但最有效的贪婪启发式算法。

最佳答案

在您提供更新之前，我开始研究下面的算法。

此外，在这样做的过程中，我意识到虽然人们正在寻找真实值的 block ，但问题不在于 block 转换，正如您现在也更新的那样。

算法如下:

计算每行中的真值

对于任何具有最大真值的行，对列中的列进行排序矩阵使得该行的真值全部向左移动

对矩阵行进行降序排列左(现在将有一个左上角的全等三角形粗略三角形)

得到左上角最大的真值矩形

存储该矩形的行 ID 和列 ID(这是一个子矩阵定义)

将子矩阵的真值改为假值

从上往下重复，直到左上三角不为真

该算法将生成由仅包含真值的行列交集子矩阵组成的 bool 矩阵的完整覆盖。

我不确定在子矩阵中允许一些错误是否有帮助。虽然它将允许找到更大的子矩阵并因此减少 bool 矩阵查找覆盖的遍数，但可能需要更长的时间才能找到最大的此类子矩阵，因为要检查的组合更多。另外，我不确定如何阻止虚假子矩阵重叠。它可能需要维护一个单独的掩码矩阵而不是使用 bool 矩阵作为它自己的掩码，以确保不相交的子矩阵。

下面是上述算法在 python 中的首次实现。

我在 Intel Pentium N3700 @ 1.60Ghz 和 4GB RAM 的 Windows 10 上运行它按照原样，它会随机生成 ~10% 的正确率:

100 行 x 1000 列 < 7 秒

1000 行 x 100 列 < 6 秒

300 行 x 300 列 < 14 秒

3000 行 x 300 列 < 3 分钟

300 行 x 3000 列 < 15 分钟

1000 行 x 1000 列 < 8 分钟

我没有在近似对称矩阵上测试过，也没有在子矩阵比较大的矩阵上测试过。它可能对相对较大的子矩阵表现良好，例如，在极端情况下，即整个 bool 矩阵为真，只需要两次算法循环。

我认为可以进行大量优化的一个领域是行排序。下面的实现使用带有比较函数的内置 phython 排序。定制的排序函数可能会做得更好，如果它是类似于列排序的虚拟排序，可能尤其如此。

如果你能在一些真实的数据上尝试它，即方阵，近似对称矩阵，具有相对较大的子矩阵，那么最好知道它是如何进行的。

如果你愿意我尝试对 python 进行一些优化，请告知。我假设处理 10^4 x 10^4 bool 矩阵需要更快。

from functools import cmp_to_key booleanMatrix0 = [ ( 0, 0, 0, 0, 1, 1 ), ( 0, 1, 1, 0, 1, 1 ), ( 0, 1, 0, 1, 0, 1 ), ( 1, 1, 1, 0, 0, 0 ), ( 0, 1, 1, 1, 0, 0 ), ( 1, 1, 0, 1, 0, 0 ), ( 0, 0, 0, 0, 0, 0 ) ] booleanMatrix1 = [ ( 0, ) ] booleanMatrix2 = [ ( 1, ) ] booleanMatrix3 = [ ( 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0 ) ] booleanMatrix4 = [ ( 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 1, 1 ) ] booleanMatrix14 = [ ( 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0 ), ( 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1 ), ( 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0 ), ( 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0 ), ( 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1 ), ( 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1 ), ( 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1 ), ( 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1 ), ( 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1 ), ( 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ), ( 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ) ] booleanMatrix15 = [ ( 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ), ] booleanMatrix16 = [ ( 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ), ( 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ), ( 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ), ( 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ), ( 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1 ), ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1 ), ] import random booleanMatrix17 = [ ] for r in range(11): row = [] for c in range(21): if random.randrange(5) == 1: row.append(random.randrange(2)) else: row.append(0) booleanMatrix17.append(tuple(row)) booleanMatrix18 = [ ] for r in range(21): row = [] for c in range(11): if random.randrange(5) == 1: row.append(random.randrange(2)) else: row.append(0) booleanMatrix18.append(tuple(row)) booleanMatrix5 = [ ] for r in range(50): row = [] for c in range(200): row.append(random.randrange(2)) booleanMatrix5.append(tuple(row)) booleanMatrix6 = [ ] for r in range(200): row = [] for c in range(50): row.append(random.randrange(2)) booleanMatrix6.append(tuple(row)) booleanMatrix7 = [ ] for r in range(100): row = [] for c in range(100): row.append(random.randrange(2)) booleanMatrix7.append(tuple(row)) booleanMatrix8 = [ ] for r in range(100): row = [] for c in range(1000): if random.randrange(5) == 1: row.append(random.randrange(2)) else: row.append(0) booleanMatrix8.append(tuple(row)) booleanMatrix9 = [ ] for r in range(1000): row = [] for c in range(100): if random.randrange(5) == 1: row.append(random.randrange(2)) else: row.append(0) booleanMatrix9.append(tuple(row)) booleanMatrix10 = [ ] for r in range(317): row = [] for c in range(316): if random.randrange(5) == 1: row.append(random.randrange(2)) else: row.append(0) booleanMatrix10.append(tuple(row)) booleanMatrix11 = [ ] for r in range(3162): row = [] for c in range(316): if random.randrange(5) == 1: row.append(random.randrange(2)) else: row.append(0) booleanMatrix11.append(tuple(row)) booleanMatrix12 = [ ] for r in range(316): row = [] for c in range(3162): if random.randrange(5) == 1: row.append(random.randrange(2)) else: row.append(0) booleanMatrix12.append(tuple(row)) booleanMatrix13 = [ ] for r in range(1000): row = [] for c in range(1000): if random.randrange(5) == 1: row.append(random.randrange(2)) else: row.append(0) booleanMatrix13.append(tuple(row)) booleanMatrices = [ booleanMatrix0, booleanMatrix1, booleanMatrix2, booleanMatrix3, booleanMatrix4, booleanMatrix14, booleanMatrix15, booleanMatrix16, booleanMatrix17, booleanMatrix18, booleanMatrix6, booleanMatrix5, booleanMatrix7, booleanMatrix8, booleanMatrix9, booleanMatrix10, booleanMatrix11, booleanMatrix12, booleanMatrix13 ] def printMatrix(matrix, colOrder): for r in range(rows): row = "" for c in range(cols): row += str(matrix[r][0][colOrder[c]]) print(row) print() def rowUp(matrix): rowCount = [] maxRow = [ 0, 0 ] for r in range(rows): rowCount.append([ r, sum(matrix[r][0]) ]) if rowCount[-1][1] > maxRow[1]: maxRow = rowCount[-1] return rowCount, maxRow def colSort(matrix): # For a row with the highest number of trues, sort the true columns to the left newColOrder = [] otherCols = [] for c in range(cols): if matrix[maxRow[0]][0][colOrder[c]]: newColOrder.append(colOrder[c]) else: otherCols.append(colOrder[c]) newColOrder += otherCols return newColOrder def sorter(a, b): # Sort rows according to leading trues length = len(a) c = 0 while c < length: if a[0][colOrder[c]] == 1 and b[0][colOrder[c]] == 0: return -1 if b[0][colOrder[c]] == 1 and a[0][colOrder[c]] == 0: return 1 c += 1 return 0 def allTrues(rdx, cdx, matrix): count = 0 for r in range(rdx+1): for c in range(cdx+1): if matrix[r][0][colOrder[c]]: count += 1 else: return return rdx, cdx, count def getBiggestField(matrix): # Starting at (0, 0) find biggest rectangular field of 1s biggestField = (None, None, 0) cStop = cols for r in range(rows): for c in range(cStop): rtn = allTrues(r, c, matrix) if rtn: if rtn[2] > biggestField[2]: biggestField = rtn else: cStop = c break; if cStop == 0: break return biggestField def mask(matrix): maskMatrix = [] for r in range(rows): row = [] for c in range(cols): row.append(matrix[r][0][c]) maskMatrix.append([ row, matrix[r][1] ]) maskRows = [] for r in range(biggestField[0]+1): maskRows.append(maskMatrix[r][1]) for c in range(biggestField[1]+1): maskMatrix[r][0][colOrder[c]] = 0 maskCols= [] for c in range(biggestField[1]+1): maskCols.append(colOrder[c]) return maskMatrix, maskRows, maskCols # Add a row id to each row to keep track of rearranged rows rowIdedMatrices = [] for matrix in booleanMatrices: rowIdedMatrix = [] for r in range(len(matrix)): rowIdedMatrix.append((matrix[r], r)) rowIdedMatrices.append(rowIdedMatrix) import time for matrix in rowIdedMatrices: rows = len(matrix) cols = len(matrix[0][0]) colOrder = [] for c in range(cols): colOrder.append(c) subMatrices = [] startTime = time.thread_time() loopStart = time.thread_time() loop = 1 rowCount, maxRow = rowUp(matrix) ones = 0 for row in rowCount: ones += row[1] print( "_________________________\n", "Rows", rows, "Columns", cols, "Ones", str(int(ones * 10000 / rows / cols) / 100) +"%") colOrder = colSort(matrix) matrix.sort(key=cmp_to_key(sorter)) biggestField = getBiggestField(matrix) if biggestField[2] > 0: maskMatrix, maskRows, maskCols = mask(matrix) subMatrices.append(( maskRows, maskCols )) while biggestField[2] > 0: loop += 1 rowCount, maxRow = rowUp(maskMatrix) colOrder = colSort(maskMatrix) maskMatrix.sort(key=cmp_to_key(sorter)) biggestField = getBiggestField(maskMatrix) if biggestField[2] > 0: maskMatrix, maskRows, maskCols = mask(maskMatrix) subMatrices.append(( maskRows, maskCols) ) if loop % 100 == 0: print(loop, time.thread_time() - loopStart) loopStart = time.thread_time() endTime = time.thread_time() print("Sub-matrices:", len(subMatrices), endTime - startTime) for sm in subMatrices: print(sm) print() input("Next matrix")

关于algorithm - 图问题松弛二分维简例的快速逼近，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/74268238/

algorithm - 图问题松弛二分维简例的快速逼近

上一篇：c++ - GCC 或 Clang 能否针对调用堆栈内存占用进行优化？

下一篇：api - 使用 websockets 设计 API 的最佳方式是什么？