python - 使用 spark-submit 将 Python 文件提交到 Spark 时,输出消息到哪里去了

标签 python amazon-web-services apache-spark

我正在尝试使用 spark-submit 命令将我的 Python 应用程序提交到一个集群(AWS-EMR 上的 3 台机器集群)。



import sys

from pyspark import SparkContext

if __name__ == "__main__":

    sc = SparkContext(appName="sparkSubmitTest")

    for item in range(50):
        print "I love this game!"



./spark/bin/spark-submit --master yarn-cluster ./


[hadoop@ip-172-31-34-124 ~]$ ./spark/bin/spark-submit --master yarn-cluster ./
15/08/04 23:50:25 INFO client.RMProxy: Connecting to ResourceManager at /
15/08/04 23:50:25 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
15/08/04 23:50:25 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
15/08/04 23:50:25 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/08/04 23:50:25 INFO yarn.Client: Setting up container launch context for our AM
15/08/04 23:50:25 INFO yarn.Client: Preparing resources for our AM container
15/08/04 23:50:25 INFO yarn.Client: Uploading resource file:/home/hadoop/.versions/spark-1.3.1.e/lib/spark-assembly-1.3.1-hadoop2.4.0.jar -> hdfs://
15/08/04 23:50:26 INFO metrics.MetricsSaver: MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500 
15/08/04 23:50:26 INFO metrics.MetricsSaver: Created MetricsSaver j-2LU0EQ3JH58CK:i-048c1ded:SparkSubmit:24928 period:60 /mnt/var/em/raw/i-048c1ded_20150804_SparkSubmit_24928_raw.bin
15/08/04 23:50:27 INFO metrics.MetricsSaver: 1 aggregated HDFSWriteDelay 1053 raw values into 1 aggregated values, total 1
15/08/04 23:50:27 INFO yarn.Client: Uploading resource file:/home/hadoop/ -> hdfs://
15/08/04 23:50:27 INFO yarn.Client: Setting up the launch environment for our AM container
15/08/04 23:50:27 INFO spark.SecurityManager: Changing view acls to: hadoop
15/08/04 23:50:27 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/08/04 23:50:27 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/08/04 23:50:27 INFO yarn.Client: Submitting application 7 to ResourceManager
15/08/04 23:50:27 INFO impl.YarnClientImpl: Submitted application application_1438724051797_0007
15/08/04 23:50:28 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:28 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1438732227551
     final status: UNDEFINED
     tracking URL:
 user: hadoop
15/08/04 23:50:29 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:30 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:31 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:32 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:33 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:34 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:34 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: ip-172-31-39-205.ec2.internal
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1438732227551
     final status: UNDEFINED
     tracking URL:
 user: hadoop
15/08/04 23:50:35 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:36 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:37 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:38 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:39 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:40 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:41 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:42 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:43 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:44 INFO yarn.Client: Application report for application_1438724051797_0007 (state: FINISHED)
15/08/04 23:50:44 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: ip-172-31-39-205.ec2.internal
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1438732227551
     final status: SUCCEEDED
     tracking URL:
     user: hadoop



我第一次尝试: yarn 日志-applicationId applicationid_xxxx 被告知“日志聚合尚未完成或未启用”。


1. Follow the link at the end of the execution, (here reverse ssh and proxy needs to be setup). 
2. at the application overview page, find out the AppMaster Node id: ip-172-31-41-6.ec2.internal:9035
3. go back to AWS EMR cluster list, find out the public dns for this id.
4. ssh from the driver node into this AppMaster Node. same key_pair.
5. cd /var/log/hadoop/userlogs/application_1438796304215_0005/container_1438796304215_0005_01_000001 (always choose the first container).
6. cat stdout

如您所见,它非常复杂。将输出写入 S3 中托管的文件可能会更好。

关于python - 使用 spark-submit 将 Python 文件提交到 Spark 时,输出消息到哪里去了,我们在Stack Overflow上找到一个类似的问题:


java - sparkR:实例化 'org.apache.spark.sql.hive.HiveSessionState' 时出错:

python - 如何将 spark 数据框中的所有列值连接到 Python 中的字符串中?


python - 2D循环卷积Vs卷积FFT [Matlab/Octave/Python]

python - 计算大量/不精确数据量统计信息的有效方法

python - PyEphem 能否用于计算任何对象的设置和上升时间?

apache - 如何将我在 EC2 中的网站重定向到我的域

matlab - 通过 Matab 系统连接 AWS 时遇到问题

Java Client连接ElasticCache Redis缓存节点

apache-spark - 是否可以使用HADOOP YARN运行任何应用程序或程序?