hadoop - spark 是否有可能同时读取 HDFS 数据和进行一些计算？

例如，我在 Spark 平台上运行了以下工作计数应用程序:

val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
             .map(word => (word, 1))
             .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

假设有一个 worker 需要处理 1Gb 数据，那么这个 worker 是否有可能在获取所有 1Gb 数据之前开始做一些计算(比如 flatMap)？

最佳答案

一般来说，是的，但是您的问题有点宽泛。所以我不知道您是否在寻找特定案例的答案。

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, I mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

有时您需要在不同用户之间共享资源。

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

通常，一切都取决于您使用的调度程序及其用途。

Ref. Official documentation > Job Scheduling > Scheduling Within an Application .

那么回到您的具体问题并假设有一个工作人员需要处理 1Gb 数据，那么该工作人员是否有可能在获取所有 1Gb 数据之前开始进行一些计算(如 flatMap)？

是的。

关于hadoop - spark 是否有可能同时读取 HDFS 数据和进行一些计算？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36880832/

hadoop - spark 是否有可能同时读取 HDFS 数据和进行一些计算？

上一篇：java - 在Hbase中存储图片丢失Meta数据和Exif

下一篇：hadoop - 请求的行超出 HRegion 上 doMiniBatchMutation 的范围