java - 通过java代码在elasticsearch中使用inguest-attachment插件索引pdf/word

标签 java elasticsearch elastic-stack

我正在尝试为我的 word/pdf 文档建立索引,以便我使用 java 创建一个 util 程序将我的文件编码为 Base64,然后尝试在 ElasticSearch 中为它们建立索引。

请找到我的以下代码,我可以将我的文件编码为 Base64。现在,我不确定如何在 ElasticSearch 中对它们进行索引

请在下面找到我的 java 代码。

public static void main(String args[]) throws IOException {
    String filePath = "D:\\\\1SearchEngine\\testing.pdf";
    String encodedfile = null;
    RestHighLevelClient restHighLevelClient = null;
    File file = new File(filePath);
    try {
        FileInputStream fileInputStreamReader = new FileInputStream(file);
        byte[] bytes = new byte[(int) file.length()];
        fileInputStreamReader.read(bytes);
        encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
        //System.out.println(encodedfile);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }

    try {
        if (restHighLevelClient != null) {
            restHighLevelClient.close();
        }
    } catch (final Exception e) {
        System.out.println("Error closing ElasticSearch client: ");
    }

    try {
        restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
                new HttpHost("localhost", 9201, "http")));
    } catch (Exception e) {
        System.out.println(e.getMessage());
    }

    IndexRequest request = new IndexRequest( "attach_local", "doc", "103");   
    Map<String, Object> jsonMap = new HashMap<>();
    jsonMap.put("resume", "Karthikeyan");
    jsonMap.put("postDate", new Date());
    jsonMap.put("resume", encodedfile);
    try {
        IndexResponse response = restHighLevelClient.index(request);
    } catch(ElasticsearchException e) {
        if (e.status() == RestStatus.CONFLICT) {

        }
    }
}

我使用 ElasticSearch 6.2.3 版本,并且我已经安装了 ingest-attachment 插件版本 6.3.0

我正在为 ElasticSearch 客户端使用以下依赖项

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>6.1.2</version>
</dependency>

请查找我的 map 详细信息

PUT attach_local
{
  "mappings" : {
    "doc" : {
      "properties" : {
        "attachment" : {
          "properties" : {
            "content" : {
              "type" : "binary"
            },
            "content_length" : {
              "type" : "long"
            },
            "content_type" : {
              "type" : "text"
            },
            "language" : {
              "type" : "text"
            }
          }
        },
        "resume" : {
          "type" : "text"
        }
      }
    }
  }
}

PUT _ingest/pipeline/attach_local
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "resume"
      }
    }
  ]
}

现在在创建索引时从 java 收到以下错误

Exception in thread "main" org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 1: source is missing;2: content type is missing;
    at org.elasticsearch.action.ValidateActions.addValidationError(ValidateActions.java:26)
    at org.elasticsearch.action.index.IndexRequest.validate(IndexRequest.java:153)
    at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:436)
    at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:429)
    at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:312)
    at com.es.utility.DocumentIndex.main(DocumentIndex.java:82)

最佳答案

终于我找到了解决方案,如何通过 Java API 在 ElasticSearch 中索引 PDF/WORD 文档

String filePath = "D:\\\\1SearchEngine\\testing.pdf";
String encodedfile = null;
RestHighLevelClient restHighLevelClient = null;
File file = new File(filePath);
try {
    FileInputStream fileInputStreamReader = new FileInputStream(file);
    byte[] bytes = new byte[(int) file.length()];
    fileInputStreamReader.read(bytes);
    encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (FileNotFoundException e) {
    e.printStackTrace();
}

try {
    if (restHighLevelClient != null) {
        restHighLevelClient.close();
    }
} catch (final Exception e) {
    System.out.println("Error closing ElasticSearch client: ");
}

try {
    restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
            new HttpHost("localhost", 9201, "http")));
} catch (Exception e) {
    System.out.println(e.getMessage());
}


Map<String, Object> jsonMap = new HashMap<>();
jsonMap.put("Name", "Karthikeyan");
jsonMap.put("postDate", new Date());
jsonMap.put("resume", encodedfile);

IndexRequest request = new IndexRequest("attach_local", "doc", "104")
        .source(jsonMap)
        .setPipeline("attach_local");

try {
    IndexResponse response = restHighLevelClient.index(request);
} catch(ElasticsearchException e) {
    if (e.status() == RestStatus.CONFLICT) {

    }
}

映射详细信息:

PUT attach_local
{
  "mappings" : {
    "doc" : {
      "properties" : {
        "attachment" : {
          "properties" : {
            "content" : {
              "type" : "binary"
            },
            "content_length" : {
              "type" : "long"
            },
            "content_type" : {
              "type" : "text"
            },
            "language" : {
              "type" : "text"
            }
          }
        },
        "resume" : {
          "type" : "text"
        }
      }
    }
  }
}


PUT _ingest/pipeline/attach_local
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "resume"
      }
    }
  ]
}

关于java - 通过java代码在elasticsearch中使用inguest-attachment插件索引pdf/word,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50927198/

相关文章:

java - Apache Commons Net IMAPsClient 列表命令

Java 转换/类加载器问题

elasticsearch - Kibana - 获取所有索引的列表

elasticsearch - 在Elasticsearch中保留错误消息

logging - 在 OpenShift 上安装 ELK 堆栈

java - 为什么 protected 方法不被Spring AOP拦截

java - 为什么 isEnable 没有被调用到处理程序?

java - HibernateSearch ElasticSearch集成错误: Exception in thread "main" java. lang.ExceptionInInitializerError

amazon-web-services - AWS - 将多个 lambda 日志订阅到一个 Elasticsearch 服务

elasticsearch - Puppet - 依赖项不起作用