elasticsearch - 如何获取达到某个最小日期的不同值的计数

我有以下结构的文件(简化了):

curl -XPOST "http://localhost:9200/test/aggtest/1" -d "{
    \"user_id\": 123,
    \"date_created\": \"2015-05-12T10:29:49-04:00\"
}"

curl -XPOST "http://localhost:9200/test/aggtest/2" -d "{
    \"user_id\": 123,
    \"date_created\": \"2014-05-12T10:29:49-04:00\"
}"

curl -XPOST "http://localhost:9200/test/aggtest/3" -d "{
    \"user_id\": 123,
    \"date_created\": \"2013-05-12T10:29:49-04:00\"
}"

curl -XPOST "http://localhost:9200/test/aggtest/4" -d "{
    \"user_id\": 456,
    \"date_created\": \"2015-05-12T10:29:49-04:00\"
}"

curl -XPOST "http://localhost:9200/test/aggtest/5" -d "{
    \"user_id\": 456,
    \"date_created\": \"2012-05-12T10:29:49-04:00\"
}"

curl -XPOST "http://localhost:9200/test/aggtest/6" -d "{
    \"user_id\": 456,
    \"date_created\": \"2011-05-12T10:29:49-04:00\"
}"

如何获得在特定日期之前或之后创建的user_ids的计数？例如。在上述文件中，只有一个唯一的user_id在2012年之后才有记录。

在mongodb中，这非常简单。使用聚合框架，我可以转换文档，其中将是一个唯一的用户ID及其最小创建日期。然后只需按日期和计数过滤结果即可。我无法在elasticsearch中编写类似的查询。任何帮助表示赞赏。

在SQL中，此查询应如下所示:

 SELECT COUNT(DISTINCT(user_id)) FROM aggtest WHERE date_created >= 2015 AND user_id NOT IN (SELECT user_id FROM aggtest WHERE date_created < 2015)

最佳答案

通过阅读评论，我认为我理解您的要求，尽管从原始问题还不能完全清楚。

听起来好像您要查找唯一的user_id，在某个日期之后有creation_date，但是之前没有。

我想不出一种方法来处理当前的数据结构，但是如果您愿意使用parent/child relationship来重组数据，那么设置所需的查询就非常简单了。

为了测试它，我设置了两种类型的索引，如下所示:

PUT /test_index
{
   "mappings": {
      "user": {
         "_id": {
            "path": "user_id"
         },
         "properties": {
            "user_id": {
               "type": "integer"
            }
         }
      },
      "creation_date": {
         "_parent": {
            "type": "user"
         }, 
         "properties": {
            "date_created": {
               "type": "date",
               "format": "dateOptionalTime"
            }
         }
      }
   }
}

然后，我使用新架构索引了您提供的数据:

POST /test_index/_bulk
{"index":{"_type":"user"}}
{"user_id":123}
{"index":{"_type":"creation_date","_parent":123}}
{"date_created":"2015-05-12T10:29:49-04:00"}
{"index":{"_type":"creation_date","_parent":123}}
{"date_created":"2014-05-12T10:29:49-04:00"}
{"index":{"_type":"creation_date","_parent":123}}
{"date_created":"2013-05-12T10:29:49-04:00"}
{"index":{"_type":"user"}}
{"user_id":456}
{"index":{"_type":"creation_date","_parent":456}}
{"date_created":"2015-05-12T10:29:49-04:00"}
{"index":{"_type":"creation_date","_parent":456}}
{"date_created":"2012-05-12T10:29:49-04:00"}
{"index":{"_type":"creation_date","_parent":456}}
{"date_created":"2011-05-12T10:29:49-04:00"}

现在，我可以通过以下查询找回您的要求(假设我理解正确)。换句话说，我要过滤那些(父)user文档中至少有一个(子)creation_date大于或等于"2012-05-12"的人，但不要让任何(子)creation_date小于"2012-05-12"的人，然后我想要在聚合中显示这些id(聚合在这里有点多余，但是我假设您的真实索引更加复杂，因此不返回完整的user文档可能很有用):

POST /test_index/user/_search?search_type=count
{
   "query": {
      "filtered": {
         "filter": {
            "bool": {
               "must": [
                  {
                     "has_child": {
                        "type": "creation_date",
                        "filter": {
                           "range": {
                              "date_created": {
                                 "gte": "2012-05-12"
                              }
                           }
                        }
                     }
                  },
                  {
                     "not": {
                        "filter": {
                           "has_child": {
                              "type": "creation_date",
                              "filter": {
                                 "range": {
                                    "date_created": {
                                       "lt": "2012-05-12"
                                    }
                                 }
                              }
                           }
                        }
                     }
                  }
               ]
            }
         }
      }
   },
   "aggs": {
      "distinct_user_ids": {
         "terms": {
            "field": "user_id"
         }
      }
   }
}

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "distinct_user_ids": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": 123,
               "doc_count": 1
            }
         ]
      }
   }
}

这是我用来测试的所有代码:

http://sense.qbox.io/gist/1fbe448a85b9c74cb25cd5245d4e77f1eec46ea7

关于elasticsearch - 如何获取达到某个最小日期的不同值的计数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32400244/

elasticsearch - 如何获取达到某个最小日期的不同值的计数

上一篇：indexing - 如何使用ElasticSearch的文档类型

下一篇：javascript - Android上的HTML5音频自动播放-iframe技巧