我正在尝试通过 Lucene 实现显式语义分析(ESA)。
匹配文档时如何考虑查询中的术语 TFIDF?
例如:
- 查询:“a b c a d a”
- Doc1:“a b a”
- 文档2:“a b c”
查询与 Doc1 的匹配应优于 Doc2。
我希望它能够在不影响性能的情况下工作。
我通过查询提升来做到这一点。通过增加相对于其 TFIDF 的术语。
有更好的方法吗?
最佳答案
Lucene 已经支持 TF/IDF 评分,当然,默认情况下,所以不太确定我明白你在寻找什么。
这实际上听起来有点像您想要根据查询本身中的 TF/IDF 来权衡查询术语。因此,让我们考虑其中的两个要素:
TF:Lucene 对每个查询项的得分进行求和。如果相同的查询词在查询中出现两次(例如
field:(a a b)
),则双倍的词将获得与提升 2 相当(但绝不相同)的更重权重。IDF:idf 是指跨多个文档语料库的数据。由于只有一个查询,因此这不适用。或者,如果您想了解相关技术,所有术语的 idf 均为 1。
因此,IDF 在这种情况下并没有真正的意义,而 TF 已经为您完成了。因此,您实际上不需要执行任何操作。
请记住,还有其他得分元素! 坐标
因子在这里很重要。
a b a
匹配四个查询词(a b a a
,但不匹配c d
)a b c
匹配五个查询词(a b a c a
,但不匹配d
)
因此,该特定评分元素将为第二个文档评分更高。
这是文档 a b a
的解释
(请参阅 IndexSearcher.explain)输出:
0.26880693 = (MATCH) product of:
0.40321037 = (MATCH) sum of:
0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.42039964 = fieldWeight in 0, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.07690979 = (MATCH) weight(text:b in 0) [DefaultSimilarity], result of:
0.07690979 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.42039964 = fieldWeight in 0, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.42039964 = fieldWeight in 0, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=0)
0.6666667 = coord(4/6)
对于文档a b c
:
0.43768594 = (MATCH) product of:
0.52522314 = (MATCH) sum of:
0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.07690979 = (MATCH) weight(text:b in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.217584 = (MATCH) weight(text:c in 1) [DefaultSimilarity], result of:
0.217584 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.435168 = queryWeight, product of:
1.0 = idf(docFreq=1, maxDocs=2)
0.435168 = queryNorm
0.5 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.0 = idf(docFreq=1, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
0.25872254 = queryWeight, product of:
0.5945349 = idf(docFreq=2, maxDocs=2)
0.435168 = queryNorm
0.29726744 = fieldWeight in 1, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
0.5945349 = idf(docFreq=2, maxDocs=2)
0.5 = fieldNorm(doc=1)
0.8333333 = coord(5/6)
请注意,根据需要,与术语 a
的匹配在第一个文档中获得更高的权重,并且您还可以看到每个独立的 a
单独评估并添加到得分。
但是,还要注意第二个文档中术语“c”的坐标和 idf 的差异。这些分数影响只会消除您通过添加同一术语的倍数而获得的提升。如果您向查询添加足够的 a
,它们最终会交换位置。 c
上的匹配仅被评估为一个远更重要的结果。
关于lucene - 如何使用查询词 tfidf 作为 Lucene 中文档相似度计算的一个因素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23722507/