如果您有一个具有重复值的数组,则可以进行terms
聚合,但它会为您提供doc_count
。有没有一种方法来获取每个值的value_count
而不是doc_count
?
例如,如果您有2行,并包含以下内容:["A", "A", "A", "B", "B", "C"]
["A", "C", "C", "D"]
术语汇总将给出:
"A": 2,
"B": 1,
"C": 2,
"D": 1,
但是有没有办法得到以下结果?
"A": 4,
"B": 3,
"C": 2,
"D": 1,
最佳答案
我可以通过两种方法来实现它。
制图
PUT index82
{
"mappings": {
"properties": {
"tags":{
"type": "nested",
"properties": {
"name":{
"type":"keyword"
},
"count":{
"type":"integer"
}
}
}
}
}
}
数据:
"hits" : [
{
"_index" : "index82",
"_type" : "_doc",
"_id" : "Uky-UnEB8es6kpJsB_Ak",
"_score" : 1.0,
"_source" : {
"tags" : [
{
"name" : "A",
"count" : 4
},
{
"name" : "B",
"count" : 2
},
{
"name" : "C",
"count" : 1
}
]
}
},
{
"_index" : "index82",
"_type" : "_doc",
"_id" : "U0y-UnEB8es6kpJsNfCB",
"_score" : 1.0,
"_source" : {
"tags" : [
{
"name" : "A",
"count" : 1
},
{
"name" : "C",
"count" : 2
},
{
"name" : "D",
"count" : 1
}
]
}
}
]
查询:
{
"aggs": {
"tags": {
"nested": {
"path": "tags"
},
"aggs": {
"tag": {
"terms": {
"field": "tags.name",
"size": 10
},
"aggs": {
"count": {
"sum": {
"field": "tags.count"
}
}
}
}
}
}
}
}
结果:
"aggregations" : {
"tags" : {
"doc_count" : 6,
"tag" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "A",
"doc_count" : 2,
"count" : {
"value" : 5.0
}
},
{
"key" : "C",
"doc_count" : 2,
"count" : {
"value" : 3.0
}
},
{
"key" : "B",
"doc_count" : 1,
"count" : {
"value" : 2.0
}
},
{
"key" : "D",
"doc_count" : 1,
"count" : {
"value" : 1.0
}
}
]
}
}
}
术语 vector 将文档ID作为输入,并为您提供每个文档的统计信息。因此,您将需要在客户端获取文档ID,以进行多项 vector 搜索并求和项频率客户端
查询:
POST index81/_mtermvectors
{
"docs": [
{
"_id": "TUy3UnEB8es6kpJsfPC4", ---> id of documents
"term_statistics": true,
"fields": [
"tags"
]
},
{
"_id": "Tky3UnEB8es6kpJshfAn",
"term_statistics": true,
"fields": [
"tags"
]
}
]
}
结果:
"docs" : [
{
"_index" : "index81",
"_type" : "_doc",
"_id" : "TUy3UnEB8es6kpJsfPC4",
"_version" : 1,
"found" : true,
"took" : 5,
"term_vectors" : {
"tags" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 10
},
"terms" : {
"a" : {
"doc_freq" : 2,
"ttf" : 4,
"term_freq" : 3,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 1
},
{
"position" : 101,
"start_offset" : 2,
"end_offset" : 3
},
{
"position" : 202,
"start_offset" : 4,
"end_offset" : 5
}
]
},
"b" : {
"doc_freq" : 1,
"ttf" : 2,
"term_freq" : 2,
"tokens" : [
{
"position" : 303,
"start_offset" : 6,
"end_offset" : 7
},
{
"position" : 404,
"start_offset" : 8,
"end_offset" : 9
}
]
},
"c" : {
"doc_freq" : 2,
"ttf" : 3,
"term_freq" : 1,
"tokens" : [
{
"position" : 505,
"start_offset" : 10,
"end_offset" : 11
}
]
}
}
}
}
},
{
"_index" : "index81",
"_type" : "_doc",
"_id" : "Tky3UnEB8es6kpJshfAn",
"_version" : 1,
"found" : true,
"took" : 2,
"term_vectors" : {
"tags" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 10
},
"terms" : {
"a" : {
"doc_freq" : 2,
"ttf" : 4,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 1
}
]
},
"c" : {
"doc_freq" : 2,
"ttf" : 3,
"term_freq" : 2,
"tokens" : [
{
"position" : 101,
"start_offset" : 2,
"end_offset" : 3
},
{
"position" : 202,
"start_offset" : 4,
"end_offset" : 5
}
]
},
"d" : {
"doc_freq" : 1,
"ttf" : 1,
"term_freq" : 1,
"tokens" : [
{
"position" : 303,
"start_offset" : 6,
"end_offset" : 7
}
]
}
}
}
}
}
]
关于elasticsearch - 数组中重复值的value_count而不是doc_count,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61069648/