elasticsearch - 关于Elasticsearch按两个字段分组,然后过滤或排序

标签 elasticsearch

有一个shareholder索引想要获得以下信息

  • 持有人最多投资同一家公司的时间

    按hld_id从股东组中选择hld_id,com_id,count(*),按count(*)desc选择com_id顺序;
  • 哪个持有人只投资了两次公司,也许是重复的记录

    通过hld_id从股东组中选择hld_id,com_id,com_id的count(*)= 2;

  • 那么如何通过elasticsearch搜索查询实现以上要求?

    最佳答案

    以下是示例映射,文档和聚合查询。我已经想出了三种可以完成/实现的方式。

    对应:

    PUT shareholder
    {
      "mappings": {
        "properties": {
          "hld_id": {
            "type": "keyword"
          },
          "com_id":{
            "type": "keyword"
          }
        }
      }
    }
    

    文件:
    POST shareholder/_doc/1
    {
      "hld_id": "001",
      "com_id": "001"
    }
    
    POST shareholder/_doc/2
    {
      "hld_id": "001",
      "com_id": "002"
    }
    
    POST shareholder/_doc/3
    {
      "hld_id": "002",
      "com_id": "001"
    }
    
    POST shareholder/_doc/4
    {
      "hld_id": "002",
      "com_id": "002"
    }
    
    POST shareholder/_doc/5
    {
      "hld_id": "002",
      "com_id": "002"               <--- Note I've changed this 
    }
    

    解决方案1:使用Elasticsearch的聚合

    聚合查询:1

    请注意,我刚刚使用的Terms Query首先是hld_id,然后是com_id
    POST shareholder/_search
    {
      "size": 0,
      "aggs": {
        "share_hoder": {
          "terms": {
            "field": "hld_id"
          },
          "aggs": {
            "com_aggs": {
              "terms": {
                "field": "com_id",
                "order": {
                  "_count": "desc"
                }
              }
            }
          }
        }
      }
    }
    

    以下是响应的显示方式:

    响应:
     {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 5,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "share_hoder" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "002",
              "doc_count" : 3,
              "com_aggs" : {
                "doc_count_error_upper_bound" : 0,
                "sum_other_doc_count" : 0,
                "buckets" : [
                  {
                    "key" : "002",
                    "doc_count" : 2                    <---- Count you are looking for
                  },
                  {
                    "key" : "001",
                    "doc_count" : 1
                  }
                ]
              }
            },
            {
              "key" : "001",
              "doc_count" : 2,
              "com_aggs" : {
                "doc_count_error_upper_bound" : 0,
                "sum_other_doc_count" : 0,
                "buckets" : [
                  {
                    "key" : "001",
                    "doc_count" : 1
                  },
                  {
                    "key" : "002",
                    "doc_count" : 1
                  }
                ]
              }
            }
          ]
        }
      }
    }
    

    当然,由于Elasticsearch聚合的工作方式,您可能无法完全获得所需的结果表示。

    聚合查询:2

    为此,大多数操作与aggregation_1相同,在这里我使用了两个Terms Query,但我另外使用了Cardinality Aggregation Query来获取hld_id的计数,然后我又使用了Bucket Selector Aggregation,在其中添加了count()==2的条件
    POST shareholder/_search
    {
      "size": 0,
      "aggs": {
        "share_holder": {
          "terms": {
            "field": "hld_id",
            "order": {
              "_key": "desc"
            }
          },
          "aggs": {
            "com_aggs": {
              "terms": {
                "field": "com_id"
              },
              "aggs": {
                "count_filter":{
                  "bucket_selector": {
                    "buckets_path": {
                      "count_path": "_count"
                    },
                    "script": "params.count_path == 2"
                  }
                }
              }
            }
          }
        }
      }
    }
    

    以下是响应的显示方式。

    响应:
    {
      "took" : 3,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 5,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "share_holder" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "002",
              "doc_count" : 3,
              "com_aggs" : {
                "doc_count_error_upper_bound" : 0,
                "sum_other_doc_count" : 0,
                "buckets" : [
                  {
                    "key" : "002",                   
                    "doc_count" : 2                     <---- Count == 2
                  }
                ]
              }
            },
            {
              "key" : "001",
              "doc_count" : 2,
              "com_aggs" : {
                "doc_count_error_upper_bound" : 0,
                "sum_other_doc_count" : 0,
                "buckets" : [ ]
              }
            }
          ]
        }
      }
    }
    

    请注意,第二个存储桶是空的。我试图查看是否可以过滤上面的查询,以便"key": "001"不会出现在第一位。

    解决方案2:使用Elasticsearch SQL:

    如果您具有Kibana的x-pack版本,则可以以SQLish风格执行以下查询:

    查询:1
    POST /_sql?format=txt
    {
        "query": "SELECT hld_id, com_id, count(*) FROM shareholder GROUP BY hld_id, com_id ORDER BY count(*) desc"
    }
    

    响应:
        hld_id     |    com_id     |   count(*)    
    ---------------+---------------+---------------
    002            |002            |2              
    001            |001            |1              
    001            |002            |1              
    002            |001            |1              
    

    查询2:
    POST /_sql?format=txt
    {
        "query": "SELECT hld_id, com_id FROM shareholder GROUP BY hld_id, com_id HAVING count(*) = 2"
    }
    

    响应:
        hld_id     |    com_id     
    ---------------+---------------
    002            |002            
    

    解决方案3:在术语聚合中使用脚本

    聚合查询:
    POST shareholder/_search
    {
      "size": 0,
      "aggs": {
        "query_groupby_count": {
          "terms": {
            "script": {
              "source": """
                  doc['hld_id'].value + ", " + doc['com_id'].value
                """
            }
          }
        },
        "query_groupby_count_equals_2": {
          "terms": {
            "script": {
              "source": """
                  doc['hld_id'].value + ", " + doc['com_id'].value
                """
            }
          },
          "aggs": {
            "myaggs": {
              "bucket_selector": {
                "buckets_path": {
                  "count": "_count"
                },
                "script": "params.count == 2"
              }
    
            }
          }
        }
    
      }
    }
    

    响应:
    {
      "took" : 3,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 5,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "query_groupby_count_equals_2" : {               <---- Group By Query For Count == 2
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "002, 002",
              "doc_count" : 2
            }
          ]
        },
        "query_groupby_count" : {                        <---- Group By Query
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "002, 002",
              "doc_count" : 2
            },
            {
              "key" : "001, 001",
              "doc_count" : 1
            },
            {
              "key" : "001, 002",
              "doc_count" : 1
            },
            {
              "key" : "002, 001",
              "doc_count" : 1
            }
          ]
        }
      }
    }
    

    使用CURL:

    首先,让我们将查询保存在.txt.json文件中。

    例如,我创建了一个名为query.json的文件,仅将查询复制并粘贴到该文件中。
    {
        "query": "SELECT hld_id, com_id, count(*) FROM shareholder GROUP BY hld_id, com_id ORDER BY count(*) desc"
    }
    

    现在执行以下curl命令,在其中引用文件,如下所示:
    curl -XGET http://localhost:9200/_sql?format=txt -H "Content-Type: application/json" -d @query.json
    

    希望这可以帮助!

    关于elasticsearch - 关于Elasticsearch按两个字段分组,然后过滤或排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58568986/

    相关文章:

    elasticsearch - 使用Elasticsearch注册Kibana特权时出错

    Elasticsearch : Curator does not work

    docker - Filebeat容器不会将日志发送到Elastic

    elasticsearch - ElasticSearch配置文件API的 `time_in_nanoseconds`值高于 `took`时间

    elasticsearch - 没有通配符星号的 Elasticsearch 查询

    node.js - 脚本化动态更新在 ElasticSearch 中不起作用

    elasticsearch - Elastalert 默认安装位置

    elasticsearch - 将城市名称从Logstash映射到GeoPoint到Elasticsearch

    使用 AND 运算符的基于 Elasticsearch URI 的查询

    json - Elastic Sink 中的 Kafka Connect 序列化错误