algorithm - 在局部敏感散列中搜索

我正在尝试理解 this paper 的第 5 节关于 LSH，特别是如何存储生成的哈希值。引用链接的论文:

Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+epsilon) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following: For each permu- tation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order ob- tained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q. This can be done by maintaining two pointers for each sorted order O σ (one moves up and the other down). At each step we move one of the pointers up or down corresponding to the element with the longest matching prefix. (Here the length of the longest matching prefix in O σ is computed relative to q with its bits permuted by σ). We examine 2N = O(n 1/(1+epsilon) ) bit vec- tors in this way. Of all the bit vectors examined, we return the one that has the smallest Hamming distance to q.

我对这个算法感到困惑，我不认为我理解它是如何工作的。

我已经找到this question关于这个话题，但我不明白评论中的答案。也在 this第 2 点中的问题描述了相同的算法，但我还是不明白它是如何工作的。

能否请您尝试向我解释它是如何逐步工作的，并尽可能简单？

我什至试图列出我不明白的东西，但实际上写得太糟糕了，我不明白大部分句子!

在 gsamaras 回答后编辑:

我大致看懂了答案，但还是有一些疑惑:

执行 N 排列的总成本是 O(Nnlogn) 是否正确，因为我们必须对它们中的每一个进行排序？
上面描述的排列+排序过程只在预处理期间执行一次，或者每个查询q？即使在预处理中，它似乎已经相当昂贵了 O(Nnlogn)，如果我们必须在查询时执行此操作，那将是一场灾难:D
在最后一点，我们将 v0 和 v4 与 q 进行比较，我们比较它们的置换版本或原始版本(在排列之前)？

最佳答案

这个问题有点宽泛，所以我在这里只举一个最小的(抽象的)例子:

我们的数据集中有 6 (= n) 个向量，每个向量有 d 位。假设我们进行 2 (= N) 次随机排列。

让第一个随机排列开始吧!请记住，我们排列的是位，而不是向量的顺序。置换位后，它们保持顺序，例如:

v1
v5
v0
v3
v2
v4

现在查询向量 q 到达了，但它(几乎)不太可能与我们数据集中的向量相同(在排列之后) , 因此我们不会通过执行二进制搜索找到它。

但是，我们将在两个向量之间结束。所以现在我们可以想象场景是这样的(例如q位于v0和v3之间:

v1
v5
v0 <-- up pointer
   <-- q lies here
v3 <-- down pointer
v2
v4

现在我们向上或向下移动指针，寻找与 q 匹配最多位的 vi 向量。假设它是 v0。

类似地，我们进行第二次排列，找到向量 vi，假设为 v4。我们现在比较第一个排列的 v0 和 v4，看看哪个最接近 q，即哪个与 q 相同的位数最多。

编辑:

Is it correct to say that the total cost of performing the N permutations is O(Nnlogn), since we have to sort each one of them?

如果他们真的从头开始对每个排列进行排序，那么是，但我不清楚他们是如何做到的。

The permutation+sorting process described above is performed only once during the pre-processing or for every query q?

一次。

At the last point, where we compare v0 and v4 to q, we compare their permuted version or the original one (before their permutation)?

我认为他们使用置换版本来做到这一点(请参阅论文中 2N 之前的括号)。但这不会有任何区别，因为它们也使用相同的排列 (σ) 排列 q。

这quora answer也可能会有所启发。

关于algorithm - 在局部敏感散列中搜索，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37377042/

algorithm - 在局部敏感散列中搜索

上一篇：string - 这已经是一个字符串相似度算法了吗？

下一篇：c - 修改堆栈中字符串的算法