docker - 在 Docker 容器中运行 Spark 驱动程序 - 没有从执行程序到驱动程序的连接?

标签 docker apache-spark mesos apache-spark-standalone

更新:问题已解决。 Docker 镜像在这里:docker-spark-submit

我在 Docker 容器内使用 fat jar 运行 Spark-submit。我的独立 Spark 集群在 3 台虚拟机上运行 - 一台主服务器和两台工作服务器。从工作计算机上的执行程序日志中,我看到执行程序具有以下驱动程序 URL:

"--driver-url" "spark://CoarseGrainedScheduler@172.17.0.2:5001"

172.17.0.2实际上是带有驱动程序的容器的地址,而不是容器运行的主机的地址。 Worker 机器无法访问此 IP,因此 Worker 无法与驱动程序进行通信。 正如我从StandaloneSchedulerBackend的源代码中看到的,它使用spark.driver.host设置构建driverUrl:

val driverUrl = RpcEndpointAddress(
  sc.conf.get("spark.driver.host"),
  sc.conf.get("spark.driver.port").toInt,
  CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString

它没有考虑 SPARK_PUBLIC_DNS 环境变量 - 这是正确的吗?在容器中,我无法将spark.driver.host 设置为除容器“内部”IP 地址(本例中为172.17.0.2)之外的任何其他值。当尝试将spark.driver.host设置为主机的IP地址时,我收到如下错误:

WARN Utils: Service 'sparkDriver' could not bind on port 5001. Attempting port 5002.

我尝试将spark.driver.bindAddress设置为主机的IP地址,但遇到了同样的错误。 那么,如何配置Spark使用宿主机IP地址而不是Docker容器地址与驱动程序通信呢?

UPD:来自执行器的堆栈跟踪:

ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
    at scala.util.Try$.apply(Try.scala:192)
    at scala.util.Failure.recover(Try.scala:216)
    at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
    at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
    at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at scala.concurrent.Promise$class.complete(Promise.scala:55)
    at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
    at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
    at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
    at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
    at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
    ... 8 more

最佳答案

所以工作配置是:

  • 将spark.driver.host设置为主机的IP地址
  • 将spark.driver.bindAddress设置为容器的IP地址

工作的 Docker 镜像在这里:docker-spark-submit .

关于docker - 在 Docker 容器中运行 Spark 驱动程序 - 没有从执行程序到驱动程序的连接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45489248/

相关文章:

c++ - 错误 : unrecognized command line option ‘-Wno-invalid-source-encoding’ [-Werror] within building Mesos

ubuntu - 查找registry-1.docker.io : no such host

docker - 如何在github操作中指定dockerfile位置?

docker - 在Docker构建时显示帮助消息

docker - 错误消息 "Program does not contain a static ' Main' 适合入口点的方法”

scala - 新 mac 上的 spark-shell 给出错误

java - Spark Java 使用数学运算来获取具有最大截止值的值比例

apache-spark - 使用 Spark SQL 查询 Hive 分区中子目录中的数据

hadoop - Hortonworks HDP与Mesos的集成

docker - 在 BRIDGE 模式下运行 Chronos docker 镜像