algorithm - 有没有更好的方法来查找搜索引擎代码的集合交集？

标签 algorithm set search-engine intersection information-retrieval

我一直在编写一个小型搜索引擎，需要找出是否有更快的方法来查找集合交叉点。目前，我正在使用大多数搜索引擎算法中解释的排序链表。即对于每个单词，我都有一个按列表排序的文档列表，然后找到列表之间的交集。

案例的性能分析是here . 还有其他关于更快设置交叉点的想法吗？

最佳答案

一种有效的方法是“之字形”:

假设您的条件是一个列表 T:

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

该算法假定有效的 getFirstAfter() 可以为您提供符合该术语的第一个文档，并且其 docId 大于指定参数。如果没有，它应该返回无穷大。

如果对术语进行排序，使最稀有的术语排在最前面，则该算法将是最有效的。

该算法最多可确保 #docs_matching_first_term * #terms 次迭代，但实际上 - 通常迭代次数要少得多。

更多信息可以在 this lecture notes 中找到幻灯片 11-13 [讲座首页的版权]

关于algorithm - 有没有更好的方法来查找搜索引擎代码的集合交集？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9209693/