amazon-web-services - 无法使用 pyspark 连接到 EC2 中的 Spark 集群

标签 amazon-web-services amazon-ec2 apache-spark

我按照 Spark 网站上的说明操作,在我的 Amazon 中运行了 1 个主服务器和 1 个从服务器。但是,我无法使用 pyspark 连接到主节点

我可以使用 SSH 连接到主节点,没有任何问题。

这是我的命令 Spark-ec2 --key-pair=graph-cluster --identity-file=/Users/.ssh/pem.pem --region=us-east-1 --zone=us-east-1a 启动 graph-cluster

我可以去http://ec2-54-152-xx-xxx.compute-1.amazonaws.com:8080/

看到 Spark 已启动并正在运行,我还在

处看到了 Spark Master
spark://ec2-54-152-xx-xxx.compute-1.amazonaws.com:7077

但是当我运行命令时

MASTER=spark://ec2-54-152-xx-xx.compute-1.amazonaws.com:7077 pyspark

我收到此错误

2015-09-16 15:39:31,800 ERROR actor.OneForOneStrategy (Slf4jLogger.scala:apply$mcV$sp(66)) -
java.lang.NullPointerException
    at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
    at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2015-09-16 15:39:31,804 INFO  client.AppClient$ClientActor (Logging.scala:logInfo(59)) - Connecting to master akka.tcp://<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="fd8e8d9c8f96b09c8e89988fbd989ecfd0c8c9d0ccc8cfd08585d0858585d39e92908d888998d0ccd39c909c8792939c8a8ed39e9290" rel="noreferrer noopener nofollow">[email protected]</a>:7077/user/Master...
2015-09-16 15:39:31,955 INFO  util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52333.
2015-09-16 15:39:31,956 INFO  netty.NettyBlockTransferService (Logging.scala:logInfo(59)) - Server created on 52333
2015-09-16 15:39:31,959 INFO  storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Trying to register BlockManager
2015-09-16 15:39:31,964 INFO  storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(59)) - Registering block manager xxx:52333 with 265.1 MB RAM, BlockManagerId(driver, xxx, 52333)
2015-09-16 15:39:31,969 INFO  storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Registered BlockManager
2015-09-16 15:39:32,458 ERROR spark.SparkContext (Logging.scala:logError(96)) - Error initializing SparkContext.
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
    at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
    at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503)
    at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
2015-09-16 15:39:32,460 INFO  spark.SparkContext (Logging.scala:logInfo(59)) - SparkContext already stopped.
Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/shell.py", line 43, in <module>
    sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 113, in __init__
    conf, jsc, profiler_cls)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 165, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 219, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 701, in __call__
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
    at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
    at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503)
    at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

最佳答案

Spark_ec2 不会在主节点上为来自集群外部的传入连接打开端口 7077。

您可以检查 AWS 控制台/EC2/网络和安全/安全组并检查 graph-cluster-master 安全组的入站选项卡。

您可以添加规则以打开到端口 7077 的入站连接。

但建议从EC2集群中的主机上运行pyspark(本质上是Spark的App驱动程序),并避免在网络外运行驱动程序。 这样做的原因 - 延迟增加以及防火墙连接设置问题 - 您需要打开一些端口,以便执行可以连接到您计算机上的驱动程序。

因此,正确的方法是使用以下命令登录 ssh 集群:

spark-ec2 --key-pair=graph-cluster --identity-file=/Users/.ssh/pem.pem --region=us-east-1 --zone=us-east-1a login graph-cluster

并从主服务器运行命令:

cd spark
bin/pyspark

您需要将相关文件(您的脚本和数据)传输到母版。我通常将数据保存在 S3 上并使用 vim 编辑脚本文件或启动 ipython 笔记本。

顺便说一句,后者非常简单 - 您需要在 EC2 控制台主机的安全组中添加从计算机 IP 到端口 18888 的传入连接规则。然后在集群上运行命令:

IPYTHON_OPTS="notebook --pylab inline --port=18888 --ip='*'"pyspark

然后你可以通过 http://ec2-54-152-xx-xxx.compute-1.amazonaws.com:18888/ 访问它

关于amazon-web-services - 无法使用 pyspark 连接到 EC2 中的 Spark 集群,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32617321/

相关文章:

symfony - 如何在生产中调试 symfony 错误 "Untrusted Host"

typescript - 模拟 AWS 服务和 Lambda 最佳实践

powershell - 从 Powershell 运行 New-EC2Tag 需要对哪些资源执行哪些政策操作?

api - AWS Cloudfront : Credential should be scoped to a valid region

java - Spark安装错误=>无法初始化编译器: object java. lang.Object in compilermirror not found

python - pyspark从RDD中过滤列表

node.js Amazon s3 如何检查文件是否存在

amazon-web-services - 如何使用无服务器框架在AWS Lambda函数上添加S3触发事件?

linux - 亚马逊 EC2 上的 ncftpget 不工作 (centos)

apache-spark - ShuffledRDD、MapPartitionsRDD 和 ParallelCollectionRDD 之间有什么区别?