elasticsearch - Elasticsearch:计算文档中的术语

标签 elasticsearch

我是elasticsearch的新手,使用6.5版。我的数据库包含网站页面及其内容,如下所示:

Url      Content
abc.com  There is some content about cars here. Lots of cars!
def.com  This page is all about cars.
ghi.com  Here it tells us something about insurances.
jkl.com  Another page about cars and how to buy cars.

我已经能够执行一个简单的查询,返回所有内容中包含“汽车”一词的文档(使用Python):
current_app.elasticsearch.search(index=index, doc_type=index, 
    body={"query": {"multi_match": {"query": "cars", "fields": ["*"]}}, 
    "from": 0, "size": 100})

结果看起来像这样:
{'took': 2521, 
'timed_out': False, 
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index': 
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571, 
'_source': {'content': '....'}}]}}

“_id”指的是一个域,所以我基本上回来了:
  • abc.com
  • def.com
  • jkl.com

  • 但是,我现在想知道在每个文档的中,搜索词(“汽车”)多久出现一次,例如:
  • abc.com:2
  • def.com:1
  • jkl.com:2

  • 我找到了几种解决方案,这些解决方案如何获取包含搜索词的文档数量,但是没有一种解决方案可以告诉您如何在
    文档中获取术语。我也没有在official documentation中找到任何东西,尽管我非常确定它在某处,而且我可能只是没有意识到这是解决我的问题的方法。

    更新:

    如@Curious_MInd所建议,我尝试了术语聚合:
    current_app.elasticsearch.search(index=index, doc_type=index, 
        body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content" 
    }}}})
    

    结果:
    {'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful': 
    5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0, 
    'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252', 
    '_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations': 
    {'skala_count': {'doc_count_error_upper_bound': 0, 
    'sum_other_doc_count': 0, 'buckets': []}}}
    

    我在这里看不到它将显示每个文档的计数,但是我假设这是因为“存储桶”为空?另一个要注意的是:术语聚合发现的结果明显比multi_match查询的结果差。有什么办法可以合并这些?

    最佳答案

    您要实现的目标无法在单个查询中完成。第一个查询将是过滤并获取需要对术语进行计数的文档ID。
    假设您具有以下映射:

    {
      "test": {
        "mappings": {
          "_doc": {
            "properties": {
              "details": {
                "type": "text",
                "store": true,
                "term_vector": "with_positions_offsets_payloads"
              },
              "name": {
                "type": "keyword"
              }
            }
          }
        }
      }
    }
    

    假设您查询返回以下两个文档:
    {
      "hits": {
        "total": 2,
        "max_score": 1,
        "hits": [
          {
            "_index": "test",
            "_type": "_doc",
            "_id": "1",
            "_score": 1,
            "_source": {
              "details": "There is some content about cars here. Lots of cars!",
              "name": "n1"
            }
          },
          {
            "_index": "test",
            "_type": "_doc",
            "_id": "2",
            "_score": 1,
            "_source": {
              "details": "This page is all about cars",
              "name": "n2"
            }
          }
        ]
      }
    }
    

    从上面的响应中,您可以获得与查询匹配的所有文档ID。上面我们有:"_id": "1""_id": "2"
    现在,我们使用_mtermvectors api来获取给定字段中每个术语的频率(计数):
    test/_doc/_mtermvectors
    {
      "docs": [
        {
          "_id": "1",
          "fields": [
            "details"
          ]
        },
        {
          "_id": "2",
          "fields": [
            "details"
          ]
        }
      ]
    }
    

    上面返回以下结果:
    {
      "docs": [
        {
          "_index": "test",
          "_type": "_doc",
          "_id": "1",
          "_version": 1,
          "found": true,
          "took": 8,
          "term_vectors": {
            "details": {
              "field_statistics": {
                "sum_doc_freq": 15,
                "doc_count": 2,
                "sum_ttf": 16
              },
              "terms": {
                ....
                ,
                "cars": {
                  "term_freq": 2,
                  "tokens": [
                    {
                      "position": 5,
                      "start_offset": 28,
                      "end_offset": 32
                    },
                    {
                      "position": 9,
                      "start_offset": 47,
                      "end_offset": 51
                    }
                  ]
                },
                ....
              }
            }
          }
        },
        {
          "_index": "test",
          "_type": "_doc",
          "_id": "2",
          "_version": 1,
          "found": true,
          "took": 2,
          "term_vectors": {
            "details": {
              "field_statistics": {
                "sum_doc_freq": 15,
                "doc_count": 2,
                "sum_ttf": 16
              },
              "terms": {
                ....
                ,
                "cars": {
                  "term_freq": 1,
                  "tokens": [
                    {
                      "position": 5,
                      "start_offset": 23,
                      "end_offset": 27
                    }
                  ]
                },
                ....
            }
          }
        }
      ]
    }
    

    请注意,由于术语 vector api返回了所有术语的术语相关详细信息,因此我在字段中使用了....来表示其他术语数据。
    您绝对可以从上面的响应中提取有关所需术语的信息,此处显示了cars,而您感兴趣的字段是term_freq

    关于elasticsearch - Elasticsearch:计算文档中的术语,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53571702/

    相关文章:

    elasticsearch - Elasticsearch:查询时自定义分析器

    elasticsearch - Elasticsearch将模糊匹配与地理距离分类相结合

    elasticsearch - Docker-compose 链接与 external_links

    regex - 正则表达式匹配每n次出现的字符

    elasticsearch - 在Elasticsearch中,如何检索按销售商店分组的产品?

    elasticsearch - elasticsearch批量转储数十万个文档

    elasticsearch - 是否在NEST 2.3.3中共享索引映射配置?

    php - Elasticsearch 5.5使用CURL获取错误 'not_x_content_exception'

    elasticsearch - elasticsearch 与 RDBMS 中的多个术语搜索

    elasticsearch - Elassandra索引数据大小比实际数据大10倍