我正在尝试通过 oozie 工作流在 YARN 上提交要在 hadoop 集群上执行的基本 spark 操作,但我收到以下错误(来自 YARN 应用程序日志):
>>> Invoking Spark class now >>>
python: can't open file '/absolute/local/path/to/script.py': [Errno 2] No such file or directory
Hadoop Job IDs executed by Spark:
Intercepting System.exit(2)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [2]
但我确信文件在那里。事实上,当我运行以下命令时:
spark-submit --master yarn --deploy-mode client /absolute/local/path/to/script.py arg1 arg2
有用。我得到了我想要的输出。
注意:我按照本文中的所有内容进行了设置( 我正在使用 Spark2 ):
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_spark-component-guide/content/ch_oozie-spark-action.html
有任何想法吗?
工作流.xml (为清楚起见进行了简化)
<action name = "action1">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>${sparkMaster}</master>
<mode>${sparkMode}</mode>
<name>action1</name>
<jar>${integrate_script}</jar>
<arg>arg1</arg>
<arg>arg2</arg>
</spark>
<ok to = "end" />
<error to = "kill_job" />
</action>
工作属性 (为清楚起见进行了简化)
oozie.wf.application.path=${nameNode}/user/${user.name}/${user.name}/${zone}
oozie.use.system.libpath=true
nameNode=hdfs://myNameNode:8020
jobTracker=myJobTracker:8050
oozie.action.sharelib.for.spark=spark2
sparkMaster=yarn
sparkMode=client
integrate_script=/absolute/local/path/to/script.py
zone=somethingUsefulForMe
在 CLUSTER 模式下运行时出现异常:
diagnostics: Application application_1502381591395_1000 failed 2 times due to AM Container for appattempt_1502381591395_1000_000002 exited with exitCode: -1000
For more detailed output, check the application tracking page: http://hostname:port/cluster/app/application_1502381591395_1000 Then click on links to logs of each attempt.
Diagnostics: File does not exist: hdfs://hostname:port/user/oozie/.sparkStaging/application_1502381591395_1000/__spark_conf__.zip
java.io.FileNotFoundException: File does not exist: hdfs://hostname:port/user/oozie/.sparkStaging/application_1502381591395_1000/__spark_conf__.zip
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1427)
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
EDIT2:
我刚从shell尝试过,由于导入而失败。
/scripts/functions/tools.py
/scripts/functions/__init__.py
/scripts/myScript.py
from functions.tools import *
这就是它失败的地方。我假设脚本首先复制到集群并在那里运行。如何让所有必需的模块也一起使用?修改 hdfs 上的 PYTHONPATH?我明白为什么它不工作只是不知道如何解决它。
EDIT3:
请参阅下面的堆栈跟踪。网上的大多数评论都说问题在于 python 代码将 Master 设置为“本地”。不是这种情况。更重要的是,我什至删除了所有与 spark 相关的内容(在 python 脚本中),但仍然遇到同样的问题。
Diagnostics: File does not exist: hdfs://hdfs/path/user/myUser/.sparkStaging/application_1502381591395_1783/pyspark.zip
java.io.FileNotFoundException: File does not exist: hdfs://hdfs/path/user/myUser/.sparkStaging/application_1502381591395_1783/pyspark.zip
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1427)
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
最佳答案
如果你想用 oozie 调用脚本,它需要放在 HDFS 上(因为你永远不知道哪个节点会运行启动器)。
将它放在 HDFS 上后,需要明确告诉 spark-submit 从远程文件系统获取它,因此在 job.properties 集中:
integrate_script=hdfs:///absolute/hdfs/path/to/script.py
关于hadoop - Pyspark 操作提交时 oozie 失败 : '[Errno 2] No such file or directory' ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46008216/