lucene - 如何使用查询词 tfidf 作为 Lucene 中文档相似度计算的一个因素

标签 lucene information-retrieval

我正在尝试通过 Lucene 实现显式语义分析(ESA)。

匹配文档时如何考虑查询中的术语 TFIDF?

例如:

  • 查询:“a b c a d a”
  • Doc1:“a b a”
  • 文档2:“a b c”

查询与 Doc1 的匹配应优于 Doc2。

我希望它能够在不影响性能的情况下工作。

我通过查询提升来做到这一点。通过增加相对于其 TFIDF 的术语。

有更好的方法吗?

最佳答案

Lucene 已经支持 TF/IDF 评分,当然,默认情况下,所以不太确定我明白你在寻找什么。

这实际上听起来有点像您想要根据查询本身中的 TF/IDF 来权衡查询术语。因此,让我们考虑其中的两个要素:

  • TF:Lucene 对每个查询项的得分进行求和。如果相同的查询词在查询中出现两次(例如 field:(a a b)),则双倍的词将获得与提升 2 相当(但绝不相同)的更重权重。

  • IDF:idf 是指跨多个文档语料库的数据。由于只有一个查询,因此这不适用。或者,如果您想了解相关技术,所有术语的 idf 均为 1。

因此,IDF 在这种情况下并没有真正的意义,而 TF 已经为您完成了。因此,您实际上不需要执行任何操作。

请记住,还有其他得分元素! 坐标因子在这里很重要。

  • a b a 匹配四个查询词(a b a a,但不匹配 c d)
  • a b c 匹配五个查询词(a b a c a,但不匹配 d)

因此,该特定评分元素将为第二个文档评分更高。


这是文档 a b a解释(请参阅 IndexSearcher.explain)输出:

0.26880693 = (MATCH) product of:
  0.40321037 = (MATCH) sum of:
    0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
      0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.42039964 = fieldWeight in 0, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=0)
    0.07690979 = (MATCH) weight(text:b in 0) [DefaultSimilarity], result of:
      0.07690979 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.29726744 = fieldWeight in 0, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=0)
    0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
      0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.42039964 = fieldWeight in 0, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=0)
    0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
      0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.42039964 = fieldWeight in 0, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=0)
  0.6666667 = coord(4/6)

对于文档a b c:

0.43768594 = (MATCH) product of:
  0.52522314 = (MATCH) sum of:
    0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
      0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.29726744 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=1)
    0.07690979 = (MATCH) weight(text:b in 1) [DefaultSimilarity], result of:
      0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.29726744 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=1)
    0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
      0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.29726744 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=1)
    0.217584 = (MATCH) weight(text:c in 1) [DefaultSimilarity], result of:
      0.217584 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.435168 = queryWeight, product of:
          1.0 = idf(docFreq=1, maxDocs=2)
          0.435168 = queryNorm
        0.5 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          1.0 = idf(docFreq=1, maxDocs=2)
          0.5 = fieldNorm(doc=1)
    0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
      0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.29726744 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=1)
  0.8333333 = coord(5/6)

请注意,根据需要,与术语 a 的匹配在第一个文档中获得更高的权重,并且您还可以看到每个独立的 a 单独评估并添加到得分。

但是,还要注意第二个文档中术语“c”的坐标和 idf 的差异。这些分数影响只会消除您通过添加同一术语的倍数而获得的提升。如果您向查询添加足够的 a,它们最终会交换位置。 c 上的匹配仅被评估为一个更重要的结果。

关于lucene - 如何使用查询词 tfidf 作为 Lucene 中文档相似度计算的一个因素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23722507/

相关文章:

lucene - Lucene IndexWriter.ExpungeDeletes() 所需的可用磁盘空间量

Solr:如何提高数字字段上的过滤器查询(针对特定值,而不是范围查询)的性能?

elasticsearch - 信息检索 - 我如何处理将单个单词分解为多个标记的搜索查询

nlp - 爬网

java - 搜索关键字列表以找出哪些存在或不存在

java - lucene - 越接近标题开头的术语越重要

python - NLTK 和 Lucene 之间词干分析器的兼容性

information-retrieval - 用于在网站上查找联系方式的脚本或库

information-retrieval - 如何使用tf-idf选择停用词? (非英语语料库)

java - 如何选择其中没有其他 div 的 div 元素?