python - 使用 spark-submit 将 Python 文件提交到 Spark 时,输出消息到哪里去了

标签 python amazon-web-services apache-spark

我正在尝试使用 spark-submit 命令将我的 Python 应用程序提交到一个集群(AWS-EMR 上的 3 台机器集群)。

令人惊讶的是,我看不到任务的任何预期输出。然后我简化了我的应用程序,只打印出一些固定的字符串,但我仍然没有看到任何打印的消息。我在下面附加了应用程序和命令。希望有人能帮我找到原因。非常感谢!

提交测试.py:

import sys

from pyspark import SparkContext

if __name__ == "__main__":

    sc = SparkContext(appName="sparkSubmitTest")

    for item in range(50):
        print "I love this game!"

    sc.stop()

我使用的命令是:

./spark/bin/spark-submit --master yarn-cluster ./submit-test.py

我得到的输出如下:

[hadoop@ip-172-31-34-124 ~]$ ./spark/bin/spark-submit --master yarn-cluster ./submit-test.py
15/08/04 23:50:25 INFO client.RMProxy: Connecting to ResourceManager at /172.31.34.124:9022
15/08/04 23:50:25 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
15/08/04 23:50:25 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
15/08/04 23:50:25 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/08/04 23:50:25 INFO yarn.Client: Setting up container launch context for our AM
15/08/04 23:50:25 INFO yarn.Client: Preparing resources for our AM container
15/08/04 23:50:25 INFO yarn.Client: Uploading resource file:/home/hadoop/.versions/spark-1.3.1.e/lib/spark-assembly-1.3.1-hadoop2.4.0.jar -> hdfs://172.31.34.124:9000/user/hadoop/.sparkStaging/application_1438724051797_0007/spark-assembly-1.3.1-hadoop2.4.0.jar
15/08/04 23:50:26 INFO metrics.MetricsSaver: MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500 
15/08/04 23:50:26 INFO metrics.MetricsSaver: Created MetricsSaver j-2LU0EQ3JH58CK:i-048c1ded:SparkSubmit:24928 period:60 /mnt/var/em/raw/i-048c1ded_20150804_SparkSubmit_24928_raw.bin
15/08/04 23:50:27 INFO metrics.MetricsSaver: 1 aggregated HDFSWriteDelay 1053 raw values into 1 aggregated values, total 1
15/08/04 23:50:27 INFO yarn.Client: Uploading resource file:/home/hadoop/submit-test.py -> hdfs://172.31.34.124:9000/user/hadoop/.sparkStaging/application_1438724051797_0007/submit-test.py
15/08/04 23:50:27 INFO yarn.Client: Setting up the launch environment for our AM container
15/08/04 23:50:27 INFO spark.SecurityManager: Changing view acls to: hadoop
15/08/04 23:50:27 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/08/04 23:50:27 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/08/04 23:50:27 INFO yarn.Client: Submitting application 7 to ResourceManager
15/08/04 23:50:27 INFO impl.YarnClientImpl: Submitted application application_1438724051797_0007
15/08/04 23:50:28 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:28 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1438732227551
     final status: UNDEFINED
     tracking URL:     http://172.31.34.124:9046/proxy/application_1438724051797_0007/
 user: hadoop
15/08/04 23:50:29 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:30 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:31 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:32 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:33 INFO yarn.Client: Application report for application_1438724051797_0007 (state: ACCEPTED)
15/08/04 23:50:34 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:34 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: ip-172-31-39-205.ec2.internal
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1438732227551
     final status: UNDEFINED
     tracking URL: http://172.31.34.124:9046/proxy/application_1438724051797_0007/
 user: hadoop
15/08/04 23:50:35 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:36 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:37 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:38 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:39 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:40 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:41 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:42 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:43 INFO yarn.Client: Application report for application_1438724051797_0007 (state: RUNNING)
15/08/04 23:50:44 INFO yarn.Client: Application report for application_1438724051797_0007 (state: FINISHED)
15/08/04 23:50:44 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: ip-172-31-39-205.ec2.internal
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1438732227551
     final status: SUCCEEDED
     tracking URL: http://172.31.34.124:9046/proxy/application_1438724051797_0007/A
     user: hadoop

最佳答案

在这里发布我的答案,因为我没有在其他地方找到它们。

我第一次尝试: yarn 日志-applicationId applicationid_xxxx 被告知“日志聚合尚未完成或未启用”。

下面是挖掘打印消息的步骤:

1. Follow the link at the end of the execution, http://172.31.34.124:9046/proxy/application_1438724051797_0007/A (here reverse ssh and proxy needs to be setup). 
2. at the application overview page, find out the AppMaster Node id: ip-172-31-41-6.ec2.internal:9035
3. go back to AWS EMR cluster list, find out the public dns for this id.
4. ssh from the driver node into this AppMaster Node. same key_pair.
5. cd /var/log/hadoop/userlogs/application_1438796304215_0005/container_1438796304215_0005_01_000001 (always choose the first container).
6. cat stdout

如您所见,它非常复杂。将输出写入 S3 中托管的文件可能会更好。

关于python - 使用 spark-submit 将 Python 文件提交到 Spark 时,输出消息到哪里去了,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31821410/

相关文章:

java - sparkR:实例化 'org.apache.spark.sql.hive.HiveSessionState' 时出错:

python - 如何将 spark 数据框中的所有列值连接到 Python 中的字符串中?

Python:创建蜂鸣声反馈,其暂停持续时间基于实时传感器输入

python - 2D循环卷积Vs卷积FFT [Matlab/Octave/Python]

python - 计算大量/不精确数据量统计信息的有效方法

python - PyEphem 能否用于计算任何对象的设置和上升时间?

apache - 如何将我在 EC2 中的网站重定向到我的域

matlab - 通过 Matab 系统连接 AWS 时遇到问题

Java Client连接ElasticCache Redis缓存节点

apache-spark - 是否可以使用HADOOP YARN运行任何应用程序或程序?