java - Spark Master在Worker节点应用程序/作业提交后出现IOException后无休止地重新提交

标签 java scala apache-spark pyspark

我对 Spark 世界还很陌生。 在我们的应用程序中,我们有一个内置的 Spark 独立集群(版本 2.4.3),它通过我们的主数据引擎加载应用程序通过 Spark 提交主 URL 接收提交的作业。

我们在不同的虚拟机上有 3 个工作从属节点。有趣的是,因为我以非常有限和神秘的格式发布了一些 IOException,以限制系统内部结构。 Master 假设它需要一遍又一遍地重新提交相同的工作/申请给同一个 worker (数十万次)

工作人员应用程序/作业日志对于每个作业重新提交都是相同的

2020-04-28 11:31:15,466 INFO spark.SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(app_prod); groups with view permissions: Set(); users with modify permissions: Set(app_prod); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:201)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:64)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
... 4 more
Caused by: java.io.IOException: Failed to connect to load.box.ancestor.com/xx.xxx.xx.xxx:30xxx
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: load.box.ancestor.com/xx.xxx.xx.xxx:30xxx
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)

在主日志下方,它一遍又一遍地重新提交相同的作业,即使从表面上看,工作人员作业/应用程序正在发出 EXIT(1) 信号

Spark Master 作业日志:

2020-04-28 11:30:49,750 INFO master.Master: Launching executor app-27789323082123-23782/11635 on worker worker-29990224233349-yy.yy.yyy.yyy-7078
2020-04-28 11:30:52,195 INFO master.Master: Removing executor app-27789323082123-23782/11635 because it is EXITED
2020-04-28 11:30:52,195 INFO master.Master: Launching executor app-27789323082123-23782/11636 on worker worker-29990224233349-yy.yy.yyy.yyy-7078
2020-04-28 11:30:54,651 INFO master.Master: Removing executor app-27789323082123-23782/11636 because it is EXITED
2020-04-28 11:30:54,651 INFO master.Master: Launching executor app-27789323082123-23782/11637 on worker worker-29990224233349-yy.yy.yyy.yyy-7078
2020-04-28 11:30:57,201 INFO master.Master: Removing executor app-27789323082123-23782/11637 because it is EXITED
2020-04-28 11:30:57,201 INFO master.Master: Launching executor app-27789323082123-23782/11638 on worker worker-29990224233349-yy.yy.yyy.yyy-7078
2020-04-28 11:30:59,769 INFO master.Master: Removing executor app-27789323082123-23782/11638 because it is EXITED
2020-04-28 11:30:59,769 INFO master.Master: Launching executor app-27789323082123-23782/11639 on worker worker-29990224233349-yy.yy.yyy.yyy-7078

我的查询是:我们还没有修改spark.deploy.maxExecutorRetries 所以它应该是默认的10。

此错误或重复提交是否受此参数影响,或者我们需要检查此问题的另一个配置,以防 Spark master 无法识别 Worker 作业失败。

最佳答案

尝试设置以下配置

spark.task.maxFailures = 2

关于java - Spark Master在Worker节点应用程序/作业提交后出现IOException后无休止地重新提交,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61484866/

相关文章:

java - 尽管mapWithState中的元素相同,但为什么所有元素都被打印

amazon-s3 - 在简单的 SparkSQL 查询中未修剪分区

Maven clean/build 和部署后,Java App Engine 对 GcsService.createOrReplace 的调用会导致 Stopwatch.createUnstarted noSuchMethodError

java - 传递 2 个命令行参数并在 Java 中显示两者的总和

java - 代码设计 : JSON from Socket to custom handler class

java - KeyAdapter 监听器适用于 Windows,不适用于 Mac

mongodb - 如何使用 Scala 将 1 亿条记录加载到 MongoDB 中进行性能测试?

java - 这个 Scala 代码的 Java 等价物是什么?

Scala Spark 过滤掉重复出现的零值

hadoop - 使用Spark的有状态操作updateStateByKey如何保持实时性