apache-spark - K8s 上的 Spark : Job proceeds although some executors are still pending

标签 apache-spark kubernetes

我正在使用 Spark 3.1.2 并创建了一个集群,其中包含 4 个执行程序,每个执行程序有 15 个核心。

因此我的分区总数应该是 60,但只分配了 30。

作业启动如下,请求4个executor

21/12/23 23:51:11 DEBUG ExecutorPodsAllocator: Set total expected execs to {0=4}

几分钟后,它还在等着他们

21/12/23 23:53:13 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod allocation status: 0 running, 4 unknown pending, 0 scheduler backend known pending, 0 unknown newly created, 0 scheduler backend known newly created.
21/12/23 23:53:13 DEBUG ExecutorPodsAllocator: Still waiting for 4 executors for ResourceProfile Id 0 before requesting more.

最后2个上来

21/12/23 23:53:14 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named io-getspectrum-data-acquisition-modelscoringprocessor-8b92877de9b4ab13-exec-1, action MODIFIED
21/12/23 23:53:14 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named io-getspectrum-data-acquisition-modelscoringprocessor-8b92877de9b4ab13-exec-3, action MODIFIED
21/12/23 23:53:15 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod allocation status: 2 running, 2 unknown pending, 0 scheduler backend known pending, 0 unknown newly created, 0 scheduler backend known newly created.

然后是三分之一

21/12/23 23:53:17 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named io-getspectrum-data-acquisition-modelscoringprocessor-8b92877de9b4ab13-exec-2, action MODIFIED
21/12/23 23:53:18 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod allocation status: 3 running, 1 unknown pending, 0 scheduler backend known pending, 0 unknown newly created, 0 scheduler backend known newly created.

...然后工作终于继续进行

21/12/23 23:53:30 DEBUG KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Launching task 0 on executor id: 1 hostname: 10.128.35.137.
21/12/23 23:53:33 INFO MyProcessor: Calculated partitions are read 45 write 1

我不明白为什么当我们有 3 个执行器而不是等待第 4 个时突然决定继续。

我已经查看了 Spark 和 Spark K8s 配置我没有看到影响此行为的适当配置

为什么当我们有 3 个执行器时它会继续?

最佳答案

根据 Spark docs , 调度由这些设置控制

spark.scheduler.maxRegisteredResourcesWaitingTime
default=30s
Maximum amount of time to wait for resources to register before scheduling begins.

spark.scheduler.minRegisteredResourcesRatio
default=0.8 for KUBERNETES mode; 0.8 for YARN mode; 0.0 for standalone mode and Mesos coarse-grained mode
The minimum ratio of registered resources (registered resources / total expected resources) (resources are executors in yarn mode and Kubernetes mode, CPU cores in standalone mode and Mesos coarse-grained mode ['spark.cores.max' value is total expected resources for Mesos coarse-grained mode] ) to wait for before scheduling begins. Specified as a double between 0.0 and 1.0. Regardless of whether the minimum ratio of resources has been reached, the maximum amount of time it will wait before scheduling begins is controlled by config spark.scheduler.maxRegisteredResourcesWaitingTime.

在您的情况下,似乎已达到 WaitingTime

关于apache-spark - K8s 上的 Spark : Job proceeds although some executors are still pending,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70468725/

相关文章:

apache-spark - 删除每个分区的重复项

java - Tekton 中有未知字段 "container"

kubernetes - 如何将子域设置为 kubernetes pod?

curl - 出现错误 “Failed to connect to 192.168.99.100 port 31539: Connection refused”

python - 处理 spark 数据帧中的非统一 JSON 列

apache-spark - Spark 流: using object as key in 'mapToPair'

scala - Spark + Kafka 集成 - Kafka 分区到 RDD 分区的映射

apache-spark - 从 Spark 到 S3 的分段上传错误

python - 使用 kubernetes python cli 列出命名空间中的证书

kubernetes - Istio Ingress 路由失败,对 Kubernetes 中的 Shiny 服务器出现 400 Bad request