apache-spark - K8s 上的 Spark : Job proceeds although some executors are still pending

我正在使用 Spark 3.1.2 并创建了一个集群，其中包含 4 个执行程序，每个执行程序有 15 个核心。

因此我的分区总数应该是 60，但只分配了 30。

作业启动如下，请求4个executor

21/12/23 23:51:11 DEBUG ExecutorPodsAllocator: Set total expected execs to {0=4}

几分钟后，它还在等着他们

21/12/23 23:53:13 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod allocation status: 0 running, 4 unknown pending, 0 scheduler backend known pending, 0 unknown newly created, 0 scheduler backend known newly created.
21/12/23 23:53:13 DEBUG ExecutorPodsAllocator: Still waiting for 4 executors for ResourceProfile Id 0 before requesting more.

最后2个上来

21/12/23 23:53:14 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named io-getspectrum-data-acquisition-modelscoringprocessor-8b92877de9b4ab13-exec-1, action MODIFIED
21/12/23 23:53:14 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named io-getspectrum-data-acquisition-modelscoringprocessor-8b92877de9b4ab13-exec-3, action MODIFIED
21/12/23 23:53:15 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod allocation status: 2 running, 2 unknown pending, 0 scheduler backend known pending, 0 unknown newly created, 0 scheduler backend known newly created.

然后是三分之一

21/12/23 23:53:17 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named io-getspectrum-data-acquisition-modelscoringprocessor-8b92877de9b4ab13-exec-2, action MODIFIED
21/12/23 23:53:18 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod allocation status: 3 running, 1 unknown pending, 0 scheduler backend known pending, 0 unknown newly created, 0 scheduler backend known newly created.

...然后工作终于继续进行

21/12/23 23:53:30 DEBUG KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Launching task 0 on executor id: 1 hostname: 10.128.35.137.
21/12/23 23:53:33 INFO MyProcessor: Calculated partitions are read 45 write 1

我不明白为什么当我们有 3 个执行器而不是等待第 4 个时突然决定继续。

我已经查看了 Spark 和 Spark K8s 配置我没有看到影响此行为的适当配置

为什么当我们有 3 个执行器时它会继续？

最佳答案

根据 Spark docs , 调度由这些设置控制

spark.scheduler.maxRegisteredResourcesWaitingTime
default=30s
Maximum amount of time to wait for resources to register before scheduling begins.

spark.scheduler.minRegisteredResourcesRatio
default=0.8 for KUBERNETES mode; 0.8 for YARN mode; 0.0 for standalone mode and Mesos coarse-grained mode
The minimum ratio of registered resources (registered resources / total expected resources) (resources are executors in yarn mode and Kubernetes mode, CPU cores in standalone mode and Mesos coarse-grained mode ['spark.cores.max' value is total expected resources for Mesos coarse-grained mode] ) to wait for before scheduling begins. Specified as a double between 0.0 and 1.0. Regardless of whether the minimum ratio of resources has been reached, the maximum amount of time it will wait before scheduling begins is controlled by config spark.scheduler.maxRegisteredResourcesWaitingTime.

在您的情况下，似乎已达到 WaitingTime。

关于apache-spark - K8s 上的 Spark : Job proceeds although some executors are still pending，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/70468725/

apache-spark - K8s 上的 Spark : Job proceeds although some executors are still pending

上一篇：r - 在 R 中使用 GAM 调整 p 值？

下一篇：python - Stablebaselines3 自定义健身房记录奖励