elasticsearch - 数组中重复值的value_count而不是doc_count

标签 elasticsearch elasticsearch-aggregation

如果您有一个具有重复值的数组,则可以进行terms聚合,但它会为您提供doc_count。有没有一种方法来获取每个值的value_count而不是doc_count

例如,如果您有2行,并包含以下内容:["A", "A", "A", "B", "B", "C"]["A", "C", "C", "D"]
术语汇总将给出:

"A": 2,
"B": 1,
"C": 2,
"D": 1,

但是有没有办法得到以下结果?
"A": 4,
"B": 3,
"C": 2,
"D": 1,

最佳答案

我可以通过两种方法来实现它。

  • 创建嵌套类型并在客户端
  • 中计算术语数

    制图
    PUT index82
    {
      "mappings": {
        "properties": {
          "tags":{
            "type": "nested",
            "properties": {
              "name":{
                "type":"keyword"
              },
                "count":{
                  "type":"integer"
                }
              }
            }
          }
        }
    }
    

    数据:
    "hits" : [
          {
            "_index" : "index82",
            "_type" : "_doc",
            "_id" : "Uky-UnEB8es6kpJsB_Ak",
            "_score" : 1.0,
            "_source" : {
              "tags" : [
                {
                  "name" : "A",
                  "count" : 4
                },
                {
                  "name" : "B",
                  "count" : 2
                },
                {
                  "name" : "C",
                  "count" : 1
                }
              ]
            }
          },
          {
            "_index" : "index82",
            "_type" : "_doc",
            "_id" : "U0y-UnEB8es6kpJsNfCB",
            "_score" : 1.0,
            "_source" : {
              "tags" : [
                {
                  "name" : "A",
                  "count" : 1
                },
                {
                  "name" : "C",
                  "count" : 2
                },
                {
                  "name" : "D",
                  "count" : 1
                }
              ]
            }
          }
        ]
    

    查询:
    {
      "aggs": {
        "tags": {
          "nested": {
            "path": "tags"
          },
          "aggs": {
            "tag": {
              "terms": {
                "field": "tags.name",
                "size": 10
              },
              "aggs": {
                "count": {
                  "sum": {
                    "field": "tags.count"
                  }
                }
              }
            }
          }
        }
      }
    }
    

    结果:
    "aggregations" : {
        "tags" : {
          "doc_count" : 6,
          "tag" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "A",
                "doc_count" : 2,
                "count" : {
                  "value" : 5.0
                }
              },
              {
                "key" : "C",
                "doc_count" : 2,
                "count" : {
                  "value" : 3.0
                }
              },
              {
                "key" : "B",
                "doc_count" : 1,
                "count" : {
                  "value" : 2.0
                }
              },
              {
                "key" : "D",
                "doc_count" : 1,
                "count" : {
                  "value" : 1.0
                }
              }
            ]
          }
        }
      }
    
  • 使用term_vectors

  • 术语 vector 将文档ID作为输入,并为您提供每个文档的统计信息。因此,您将需要在客户端获取文档ID,以进行多项 vector 搜索并求和项频率客户端
    查询:
    POST index81/_mtermvectors
    {
        "docs": [
          {
             "_id": "TUy3UnEB8es6kpJsfPC4", ---> id of documents
             "term_statistics": true,
             "fields": [
                "tags"
             ]
          },
          {
             "_id": "Tky3UnEB8es6kpJshfAn",
             "term_statistics": true,
             "fields": [
                "tags"
             ]
          }
        ]
    }
    
    

    结果:
      "docs" : [
        {
          "_index" : "index81",
          "_type" : "_doc",
          "_id" : "TUy3UnEB8es6kpJsfPC4",
          "_version" : 1,
          "found" : true,
          "took" : 5,
          "term_vectors" : {
            "tags" : {
              "field_statistics" : {
                "sum_doc_freq" : 6,
                "doc_count" : 2,
                "sum_ttf" : 10
              },
              "terms" : {
                "a" : {
                  "doc_freq" : 2,
                  "ttf" : 4,
                  "term_freq" : 3,
                  "tokens" : [
                    {
                      "position" : 0,
                      "start_offset" : 0,
                      "end_offset" : 1
                    },
                    {
                      "position" : 101,
                      "start_offset" : 2,
                      "end_offset" : 3
                    },
                    {
                      "position" : 202,
                      "start_offset" : 4,
                      "end_offset" : 5
                    }
                  ]
                },
                "b" : {
                  "doc_freq" : 1,
                  "ttf" : 2,
                  "term_freq" : 2,
                  "tokens" : [
                    {
                      "position" : 303,
                      "start_offset" : 6,
                      "end_offset" : 7
                    },
                    {
                      "position" : 404,
                      "start_offset" : 8,
                      "end_offset" : 9
                    }
                  ]
                },
                "c" : {
                  "doc_freq" : 2,
                  "ttf" : 3,
                  "term_freq" : 1,
                  "tokens" : [
                    {
                      "position" : 505,
                      "start_offset" : 10,
                      "end_offset" : 11
                    }
                  ]
                }
              }
            }
          }
        },
        {
          "_index" : "index81",
          "_type" : "_doc",
          "_id" : "Tky3UnEB8es6kpJshfAn",
          "_version" : 1,
          "found" : true,
          "took" : 2,
          "term_vectors" : {
            "tags" : {
              "field_statistics" : {
                "sum_doc_freq" : 6,
                "doc_count" : 2,
                "sum_ttf" : 10
              },
              "terms" : {
                "a" : {
                  "doc_freq" : 2,
                  "ttf" : 4,
                  "term_freq" : 1,
                  "tokens" : [
                    {
                      "position" : 0,
                      "start_offset" : 0,
                      "end_offset" : 1
                    }
                  ]
                },
                "c" : {
                  "doc_freq" : 2,
                  "ttf" : 3,
                  "term_freq" : 2,
                  "tokens" : [
                    {
                      "position" : 101,
                      "start_offset" : 2,
                      "end_offset" : 3
                    },
                    {
                      "position" : 202,
                      "start_offset" : 4,
                      "end_offset" : 5
                    }
                  ]
                },
                "d" : {
                  "doc_freq" : 1,
                  "ttf" : 1,
                  "term_freq" : 1,
                  "tokens" : [
                    {
                      "position" : 303,
                      "start_offset" : 6,
                      "end_offset" : 7
                    }
                  ]
                }
              }
            }
          }
        }
      ]
    

    关于elasticsearch - 数组中重复值的value_count而不是doc_count,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61069648/

    相关文章:

    elasticsearch - Elasticsearch,对全文字符串进行过滤

    elasticsearch - ElasticSearch术语聚合不适用于自定义分析器和模式标记器

    elasticsearch - Elasticsearch-返回嵌套结果的子集

    Elasticsearch 基于条件求和

    elasticsearch - 使用 Elasticsearch 进行聚合

    elasticsearch - Elasticsearch 中的映射字段如何工作?

    filter - Elasticsearch范围过滤器倒排索引

    python - 如何在Python中从Elasticsearch访问索引值

    amazon-web-services - 无法从 Fargate 连接 AWS Elasticsearch。获取 java.net.UnknownHostException

    elasticsearch 7嵌套聚合文本关键字错误