curl - 如何避免Elasticsearch Bulk API中每个文档的索引

我正在使用curl将apache日志作为文档添加到使用批量API的elasticsearch中。我发布以下内容:

{"index": {"_type": "apache", "_id": "123", "_index": "apache-2017-01"}}
{"s": 200, "d": "example.se", "@t": "2017-01-01T00:00:00.000Z", "p": "/foo"}
{"index": {"_type": "apache", "_id": "124", "_index": "apache-2017-01"}}
{"s": 200, "d": "example.se", "@t": "2017-01-01T00:00:00.000Z", "p": "/bar"}
... more of the same ...

我的猜测是，对于每个logrow文档行，lucene索引都会更新其索引。但是我不需要用elasticsearch来做。首先添加所有日志文件，然后再更新索引，就可以了。

这可能吗？这是一个好主意吗？会改善性能吗？

最佳答案

你的直觉离真理不远。默认情况下，ElasticSearch will update its index every second:

The default index.refresh_interval is 1s, which forces Elasticsearch to create a new segment every second. Increasing this value (to say, 30s) will allow larger segments to flush and decreases future merge pressure.

因此，增加索引吞吐量的一种方法是增加index.refresh_interval，甚至可以增加到无穷大，然后在完成插入操作后将其重新打开。 (请注意，插入的文档仅在段关闭(即写入完成)之后才可用于搜索。)

但是，这不是将文档插入ElasticSearch时唯一的瓶颈。例如，您可能考虑使用多个线程来批量插入文档，或者使用ElasticSearch文档的Tune for index speed部分中描述的其他调整。您可以在Dynamic Index Settings部分中查找要更改的其他索引参数。

希望有帮助!

关于curl - 如何避免Elasticsearch Bulk API中每个文档的索引，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47348876/

上一篇：powershell - PowerShell 模块中的 .ps1 文件是什么？

下一篇：xml - 为PowerShell脚本异常的结果添加颜色

相关文章：

android - 如何知道下一页/上一页何时完全加载到 harism 的 curl 效果中？

java - lucene 对常见 NLP 任务的支持

json - Cloudant NoSql Bluemix 内容类型错误 : bad_content_type

python - 将curl请求转换为python请求

c# - 如何告诉 Nest ElasticSearch 只使用 InterfaceProperties

elasticsearch - 将Zabbix事件读取到Elastic Search

java - 在 Windows 8 上安装 Elasticsearch 5.0.2 -\config\jvm.options "was unexpected at this time"

nhibernate - 流利的 NHibernate + Lucene 搜索 (NHibernate.Search)

maven - 如何结合neo4j和elasticsearch

ruby - 如何使用 SSL version = TLS 编写 POST 请求，我得到 SSL_connect SYSCALL returned=5 errno=0