我是elasticsearch的新手,使用6.5版。我的数据库包含网站页面及其内容,如下所示:
Url Content
abc.com There is some content about cars here. Lots of cars!
def.com This page is all about cars.
ghi.com Here it tells us something about insurances.
jkl.com Another page about cars and how to buy cars.
我已经能够执行一个简单的查询,返回所有内容中包含“汽车”一词的文档(使用Python):
current_app.elasticsearch.search(index=index, doc_type=index,
body={"query": {"multi_match": {"query": "cars", "fields": ["*"]}},
"from": 0, "size": 100})
结果看起来像这样:
{'took': 2521,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index':
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571,
'_source': {'content': '....'}}]}}
“_id”指的是一个域,所以我基本上回来了:
但是,我现在想知道在每个文档的中,搜索词(“汽车”)多久出现一次,例如:
我找到了几种解决方案,这些解决方案如何获取包含搜索词的文档数量,但是没有一种解决方案可以告诉您如何在文档中获取术语。我也没有在official documentation中找到任何东西,尽管我非常确定它在某处,而且我可能只是没有意识到这是解决我的问题的方法。
更新:
如@Curious_MInd所建议,我尝试了术语聚合:
current_app.elasticsearch.search(index=index, doc_type=index,
body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content"
}}}})
结果:
{'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful':
5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0,
'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252',
'_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations':
{'skala_count': {'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0, 'buckets': []}}}
我在这里看不到它将显示每个文档的计数,但是我假设这是因为“存储桶”为空?另一个要注意的是:术语聚合发现的结果明显比multi_match查询的结果差。有什么办法可以合并这些?
最佳答案
您要实现的目标无法在单个查询中完成。第一个查询将是过滤并获取需要对术语进行计数的文档ID。
假设您具有以下映射:
{
"test": {
"mappings": {
"_doc": {
"properties": {
"details": {
"type": "text",
"store": true,
"term_vector": "with_positions_offsets_payloads"
},
"name": {
"type": "keyword"
}
}
}
}
}
}
假设您查询返回以下两个文档:
{
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"details": "There is some content about cars here. Lots of cars!",
"name": "n1"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 1,
"_source": {
"details": "This page is all about cars",
"name": "n2"
}
}
]
}
}
从上面的响应中,您可以获得与查询匹配的所有文档ID。上面我们有:
"_id": "1"
和"_id": "2"
现在,我们使用
_mtermvectors
api来获取给定字段中每个术语的频率(计数):test/_doc/_mtermvectors
{
"docs": [
{
"_id": "1",
"fields": [
"details"
]
},
{
"_id": "2",
"fields": [
"details"
]
}
]
}
上面返回以下结果:
{
"docs": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_version": 1,
"found": true,
"took": 8,
"term_vectors": {
"details": {
"field_statistics": {
"sum_doc_freq": 15,
"doc_count": 2,
"sum_ttf": 16
},
"terms": {
....
,
"cars": {
"term_freq": 2,
"tokens": [
{
"position": 5,
"start_offset": 28,
"end_offset": 32
},
{
"position": 9,
"start_offset": 47,
"end_offset": 51
}
]
},
....
}
}
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_version": 1,
"found": true,
"took": 2,
"term_vectors": {
"details": {
"field_statistics": {
"sum_doc_freq": 15,
"doc_count": 2,
"sum_ttf": 16
},
"terms": {
....
,
"cars": {
"term_freq": 1,
"tokens": [
{
"position": 5,
"start_offset": 23,
"end_offset": 27
}
]
},
....
}
}
}
]
}
请注意,由于术语 vector api返回了所有术语的术语相关详细信息,因此我在字段中使用了
....
来表示其他术语数据。您绝对可以从上面的响应中提取有关所需术语的信息,此处显示了
cars
,而您感兴趣的字段是term_freq
关于elasticsearch - Elasticsearch:计算文档中的术语,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53571702/