我在ElasticSearch中有数百万条记录。今天,我意识到有些记录重复了。有什么办法可以删除这些重复的记录?
这是我的查询。
{
"query": {
"filtered":{
"query" : {
"bool": {"must":[
{"match": { "sensorId": "14FA084408" }},
{"match": { "variableName": "FORWARD_FLOW" }}
]
}
},
"filter": {
"range": { "timestamp": { "gt" : "2015-07-04",
"lt" : "2015-07-06" }}
}
}
}
}
这就是我从中得到的。
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 21,
"max_score": 8.272615,
"hits": [
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxVcMpd7AZtvmZcK",
"_score": 8.272615,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxVnMpd7AZtvmZcL",
"_score": 8.272615,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxV6Mpd7AZtvmZcN",
"_score": 8.0957,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxWOMpd7AZtvmZcP",
"_score": 8.0957,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxW8Mpd7AZtvmZcT",
"_score": 8.0957,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxXFMpd7AZtvmZcU",
"_score": 8.0957,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxXbMpd7AZtvmZcW",
"_score": 8.0957,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxUtMpd7AZtvmZcG",
"_score": 8.077545,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxXPMpd7AZtvmZcV",
"_score": 8.077545,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxUZMpd7AZtvmZcE",
"_score": 7.9553676,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
}
]
}
}
如您所见,我当天有21条重复的记录。如何每天删除仅保留一个副本的重复记录?谢谢。
最佳答案
进行计数(为此使用Count API),然后在查询大小小于计数的情况下使用按查询删除。 (使用按查询删除+ From / Size API来获取此信息)
Count API
From/size API
Delete by query API
在这种情况下,您应该编写查询以使其仅获取重复记录。
或者只是查询ID,然后对除一个以外的所有对象调用批量删除。但是,我想您不能执行此操作,因为您没有ID。恕我直言,我看不到任何其他聪明的方法来做到这一点。
关于elasticsearch - 在ElasticSearch中删除重复的记录,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31263636/