我对这两个参数完全感到困惑,
es.scroll.size
es.scroll.limit
我做了一些测试,仍然不知道。
es.scroll.limit = es.scroll.size * num_of_scrolls ???
最佳答案
es.scroll.size
和es.scroll.limit
都是在从分布式集群(例如Apache-Spark for exmaple)发出请求时传递给elasticsearch.hadoop
的配置参数。
在阅读这两个参数之前,重要的是要从docs了解有关elasticsearch.hadoop
的信息:
Shards play a critical role when reading information from Elasticsearch. Since it acts as a source, elasticsearch-hadoop will create one Hadoop InputSplit per Elasticsearch shard, or in case of Apache Spark one Partition, that is given a query that works against index I. elasticsearch-hadoop will dynamically discover the number of shards backing I and then for each shard will create, in case of Hadoop an input split (which will determine the maximum number of Hadoop tasks to be executed) or in case of Spark a partition which will determine the RDD maximum parallelism.
因此,我们了解到分片数量会影响运行的查询数量。 ES小组成员james.baiera还说here:
ES-Hadoop uses the scroll endpoint to collect all the data for processing within Spark. ES-Hadoop performs the multiple scroll requests under the hood on its own...
因此,集群为每个分区创建了一个滚动请求,而每个分区又为每个分区创建了滚动请求!这些滚动中的每一个都受到上述
limit
和size
参数的影响。同样,按照documentation:
es.scroll.size (default 50)
Number of results/items returned by each individual per request.
es.scroll.limit (default -1)
Number of total results/items returned by each individual scroll. A negative value indicates that all documents that match should be returned. Do note that this applies per scroll which is typically bound to one of the job tasks. Thus the total number of documents returned is LIMIT * NUMBER_OF_SCROLLS (OR TASKS)
Size
指出滚动条的每个调用而不是整个滚动条所请求的文档数。Limit
指定在滚动API调用的所有调用中要检索的最大文档数(还记得与索引中的分片一样多的滚动API调用吗?)所以现在这个计算很有意义:
整个集群检索到的文档总数=每个滚动API调用的限制(
es.scroll.limit
)*滚动调用的数量(索引中每个分片一个)。当我自己尝试执行此操作时,我得到了不错的结果,我查询了一个索引,其中包含14个分片,
limit
是1
,实际上该集群提取了14个文档。正如nefo_x在他的answer中所述,实际上
limit
也将限制size
,这仅是合理的-整个滚动API调用中的每个调用都不应大于该滚动API调用中所有调用的整个限制,对吗?
关于apache-spark - es.scroll.limit和es.scroll.size有什么区别,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47193321/