java Lucene最佳匹配不是精确匹配

标签 java lucene

Lucene 评分似乎完全无法理解。

我有一组文档用于以下内容:

Senior Education Recruitment Consultant
Senior IT Recruitment Consultant
Senior Recruitment Consultant

这些已使用 EnglishAnalyzer 进行分析。

搜索查询是使用 QueryParser 构建的,同时还使用了 EnglishAnalyzer

当我搜索 Senior Recruitment Consultant 时,上述所有文档都以相同的分数返回,其中期望(和预期)的结果将是 Senior Recruitment Consultant作为最佳结果。

是否有一种直接的方法可以实现我错过的所需行为?

这是我的调试输出:

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22157) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)
  2.3421772 = (MATCH) weight(Title:recruit in 22157) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)
  1.2005073 = (MATCH) weight(Title:consult in 22157) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22292) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)
  2.3421772 = (MATCH) weight(Title:recruit in 22292) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)
  1.2005073 = (MATCH) weight(Title:consult in 22292) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22494) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)
  2.3421772 = (MATCH) weight(Title:recruit in 22494) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)
  1.2005073 = (MATCH) weight(Title:consult in 22494) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)


Senior Education Recruitment Consultant 4.6491017
Senior IT Recruitment Consultant 4.6491017
Senior Recruitment Consultant 4.6491017

最佳答案

您必须依赖的唯一评分元素是长度范数。

Lengthnorm 在索引时间与字段的提升一起存储在文档中。它有助于为较短的文档打分。

为什么它不起作用?你有两个问题:

首先:规范以极其有损的压缩方式存储。它们仅占用一个字节,并且具有大约 1 位有效小数位的精度。所以,基本上,差异还不足以影响分数。

关于这种损失的基本原理,来自 DefaultSimilarity documentation :

...given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.

其次:“IT”在英语中是停用词。你的意思是“信息技术”,但分析器看到的只是普通的英语代词。无论您在该字段中放入多少停用词,它们都不会影响长度范数。

这是一个显示我想出的一些结果的测试:

Senior Education Recruitment Consultant ::: 0.732527
Senior IT Recruitment Consultant ::: 0.732527
Senior Recruitment Consultant ::: 0.732527
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.732527
Senior Education Recruitment Consultant Of Justice ::: 0.64096117
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.3662635

如您所见,对于“司法高级教育招聘顾问”,我们只添加了一个搜索词,lengthnorm 就开始发挥作用了。但是对于“if and but Senior IT IT IT IT IT Recruitment this that Consultant”仍然看不出有什么区别,因为所有添加的术语都是常见的英语停用词。


解决方案:您可以通过自定义相似性实现解决规范精度问题,该实现不会那么难以编码(复制DefaultSimilarity,并实现无损encodeNormValuedecodeNormValue)。您还可以使用自定义或空停用词列表(通过 EnglishAnalyzer ctor )设置分析器。

但是,这可能会把婴儿连同洗澡水一起倒掉。如果精确匹配获得更高的分数真的很重要,那么在查询中表达这一点可能会更好,如下所示:

\"Senior Recruitment Consultant\" Senior Recruitment Consultant

结果:

Senior Recruitment Consultant ::: 1.465054
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.732527
Senior Education Recruitment Consultant ::: 0.27469763
Senior IT Recruitment Consultant ::: 0.27469763
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.27469763
Senior Education Recruitment Consultant Of Justice ::: 0.24036042

关于java Lucene最佳匹配不是精确匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29541678/

相关文章:

java - LibGDX XLib 扩展 : "GLX" Error

java - 在Java中使用ResultSet.getNString或getString来获取代码 View 。不从 DBA_VIEWS.text 返回完整的字符串

php - 在mysql中搜索名称的最佳方法

postgresql - 将停用词从 Postgresql 加载到 Solr6

mysql - Elasticsearch如何匹配一个二维数组?

java - 顺序交叉 (OX) - 遗传算法

java - java中如果不满足特定条件如何发出警报

java - 按 Id 查询 Morphia

sql - lucene,还是sql全文?

java - 使用lucene获取文档中的单词位置