django - django-haystack自动完成返回的结果太宽

标签 django autocomplete elasticsearch django-haystack

我用字段title_auto创建了一个索引:

class GameIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, model_attr='title')
    title = indexes.CharField(model_attr='title')
    title_auto = indexes.NgramField(model_attr='title')

flex 搜索设置如下所示:
ELASTICSEARCH_INDEX_SETTINGS = {
    'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_ngram"],
                    "token_chars": ["letter", "digit"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_edgengram"]
                }
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 1,
                    "max_gram": 15,
                },
                "haystack_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 1,
                    "max_gram": 15,
                    "side": "front"
                }
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 1,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 1,
                    "max_gram": 15
                }
            }
        }
    }
}

我尝试进行自动完成搜索,但是可以返回太多不相关的结果:
qs = SearchQuerySet().models(Game).autocomplete(title_auto=search_phrase)
要么
qs = SearchQuerySet().models(Game).filter(title_auto=search_phrase)
它们都产生相同的输出。

如果search_phrase为“monopoly”,则第一个结果的标题中包含“Monopoly”,但是,由于只有2个相关项,因此返回51。其他与“Monopoly”无关。

所以我的问题是-如何更改结果的相关性?

最佳答案

由于我还没有看到完整的映射,因此很难确定,但是我怀疑问题是分析器(其中之一)同时用于索引和搜索。因此,当您为文档建立索引时,会创建并索引许多ngram术语。如果您搜索并且对搜索文本也进行了相同的分析,则会生成许多搜索词。由于最小的ngram是单个字母,因此几乎所有查询都将匹配许多文档。

我们在博客文章http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams上写了一篇关于将ngrams用于自动完成的文章,您可能会觉得有帮助。但是,我将给您提供一个更简单的示例来说明我的意思。我对干草堆不是很熟悉,所以我可能无法为您提供帮助,但是我可以在Elasticsearch中用ngrams解释问题。

首先,我将建立一个使用ngram分析器进行索引和搜索的索引:

PUT /test_index
{
   "settings": {
       "number_of_shards": 1,
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "nGram",
               "min_gram": 1,
               "max_gram": 15,
               "token_chars": [
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
               ]
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "nGram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
        "doc": {
            "properties": {
                "title": {
                    "type": "string", 
                    "analyzer": "nGram_analyzer"
                }
            }
        }
   }
}

并添加一些文档:
PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"monopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"oligopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"plutocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"title":"theocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":5}}
{"title":"democracy"}

并运行一个简单的match搜索"poly":
POST /test_index/_search
{
    "query": {
        "match": {
           "title": "poly"
        }
    }
}

它返回所有五个文档:
{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 5,
      "max_score": 4.729521,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 4.729521,
            "_source": {
               "title": "oligopoly"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 4.3608603,
            "_source": {
               "title": "monopoly"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 1.0197333,
            "_source": {
               "title": "plutocracy"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "4",
            "_score": 0.31496215,
            "_source": {
               "title": "theocracy"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "5",
            "_score": 0.31496215,
            "_source": {
               "title": "democracy"
            }
         }
      ]
   }
}

这是因为搜索项"poly"被标记为术语"p""o""l""y",由于每个文档中的"title"字段被标记为单字母术语,因此它们与每个文档匹配。

如果我们改用此映射重建索引(相同的分析器和文档):
"mappings": {
  "doc": {
     "properties": {
        "title": {
           "type": "string",
           "index_analyzer": "nGram_analyzer",
           "search_analyzer": "standard"
        }
     }
  }
}

该查询将返回我们期望的结果:
POST /test_index/_search
{
    "query": {
        "match": {
           "title": "poly"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1.5108256,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 1.5108256,
            "_source": {
               "title": "monopoly"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 1.5108256,
            "_source": {
               "title": "oligopoly"
            }
         }
      ]
   }
}

边缘ngram的工作原理类似,除了仅使用单词开头的术语。

这是我用于此示例的代码:

http://sense.qbox.io/gist/b24cbc531b483650c085a42963a49d6a23fa5579

关于django - django-haystack自动完成返回的结果太宽,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29008725/

相关文章:

python - Django-tenant-schemas 和 GeoDjango 一起

python - 当两个应用程序使用相同的命名空间时,reverse() 引发 NoReverseMatch

swift - NSTextView:自动完成 "."前缀单词

django - 如何使用带有参数的给定 url 的参数调用 reverse()?

Django- Group by 和 Count by unique 在一起

Solr 建议 - 如何将 solr 建议定义为不区分大小写

javascript - jquery autocomplete - 下拉结果的条件样式

search - ElasticSearch 新手指南

php - elasticsearch - 其中字段为空

elasticsearch - ElasticSearch查询:从ElasticSearch中的每个记录获取 `key`并返回唯一值的集合