elasticsearch - 使用 Elasticsearch 从文本中提取关键字(多词)

标签 elasticsearch

我有一个充满关键字的索引,我想根据这些关键字从输入文本中提取关键字。

以下是示例关键字索引。请注意,关键字也可以是多个单词,或者基本上是唯一的标签。

{
  "hits": {
    "total": 2000,
    "hits": [
      {
        "id": 1,
        "keyword": "thousand eyes"
      },
      {
        "id": 2,
        "keyword": "facebook"
      },
      {
        "id": 3,
        "keyword": "superdoc"
      },
      {
        "id": 4,
        "keyword": "quora"
      },
      {
        "id": 5,
        "keyword": "your story"
      },
      {
        "id": 6,
        "keyword": "Surgery"
      },
      {
        "id": 7,
        "keyword": "lending club"
      },
      {
        "id": 8,
        "keyword": "ad roll"
      },
      {
        "id": 9,
        "keyword": "the honest company"
      },
      {
        "id": 10,
        "keyword": "Draft kings"
      }
    ]
  }
}

现在,如果我输入文本“我在 facebook 上看到了 lending club 的新闻,你的故事和 quora” 搜索的输出应该是 ["lending club", “facebook”、“你的故事”、“quora”]。此外,搜索应该不区分大小写

最佳答案

只有一种真正的方法可以做到这一点。您必须将您的数据作为关键字编制索引,并使用带状疱疹对其进行分析:

查看此复制品:

首先,我们将创建两个自定义分析器:关键字和带状疱疹:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        },
        "my_analyzer_shingle": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "keyword": {
          "type": "string",
          "index_analyzer": "my_analyzer_keyword",
          "search_analyzer": "my_analyzer_shingle"
        }
      }
    }
  }
}

现在让我们使用您提供给我们的内容创建一些示例数据:

POST /test/your_type/1
{
  "id": 1,
  "keyword": "thousand eyes"
}
POST /test/your_type/2
{
  "id": 2,
  "keyword": "facebook"
}
POST /test/your_type/3
{
  "id": 3,
  "keyword": "superdoc"
}
POST /test/your_type/4
{
  "id": 4,
  "keyword": "quora"
}
POST /test/your_type/5
{
  "id": 5,
  "keyword": "your story"
}
POST /test/your_type/6
{
  "id": 6,
  "keyword": "Surgery"
}
POST /test/your_type/7
{
  "id": 7,
  "keyword": "lending club"
}
POST /test/your_type/8
{
  "id": 8,
  "keyword": "ad roll"
}
POST /test/your_type/9
{
  "id": 9,
  "keyword": "the honest company"
}
POST /test/your_type/10
{
  "id": 10,
  "keyword": "Draft kings"
}

最后查询以运行搜索:

POST /test/your_type/_search
{
  "query": {
    "match": {
      "keyword": "I saw the news of lending club on facebook, your story and quora"
    }
  }
}

这是结果:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.009332742,
    "hits": [
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "2",
        "_score": 0.009332742,
        "_source": {
          "id": 2,
          "keyword": "facebook"
        }
      },
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "7",
        "_score": 0.009332742,
        "_source": {
          "id": 7,
          "keyword": "lending club"
        }
      },
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "4",
        "_score": 0.009207102,
        "_source": {
          "id": 4,
          "keyword": "quora"
        }
      },
      {
        "_index": "test",
        "_type": "your_type",
        "_id": "5",
        "_score": 0.0014755741,
        "_source": {
          "id": 5,
          "keyword": "your story"
        }
      }
    ]
  }
}

那么它在幕后做了什么?

  1. 它将您的文档作为整个关键字编制索引(它将整个字符串作为单个标记发出)。我还添加了 asciifolding 过滤器,所以它规范化字母,即 é 变成 e) 和小写过滤器(不区分大小写的搜索)。因此,例如 Draft kings 被索引为 draft kings
  2. 现在搜索分析器使用相同的逻辑,不同之处在于它的分词器发出单词分词,并在此基础上创建 shingles(分词组合),这将匹配您在第一步中索引的关键字。

关于elasticsearch - 使用 Elasticsearch 从文本中提取关键字(多词),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33581029/

相关文章:

javascript - 如何使用 ElasticSearch 索引源代码

elasticsearch - 可视化内部服务器错误kibana中的错误

elasticsearch - 如何查询metricbeat系统的cpu和elasticsearch 6.3中存储的内存数据

amazon-web-services - Elasticsearch 1.3.2升级后的问题

ruby-on-rails - 搜索踢 rails 5

amazon-web-services - 无法通过aws公共(public)IP连接到Elasticsearch

c# - 在 Elasticsearch 和嵌套中传递和比较具有不同时区的日期时间值

java - Elasticsearch Java API 迁移到TermsFacetBuilder 和AggregationBuilders

elasticsearch - 尝试在Elasticsearch中加载JSON文档

docker - 如何使用 docker swarm 部署 elasticsearch?