hadoop - Pyspark 操作提交时 oozie 失败 : '[Errno 2] No such file or directory'

标签 hadoop apache-spark pyspark hadoop-yarn oozie

我正在尝试通过 oozie 工作流在 YARN 上提交要在 hadoop 集群上执行的基本 spark 操作,但我收到以下错误(来自 YARN 应用程序日志):

>>> Invoking Spark class now >>>

python: can't open file '/absolute/local/path/to/script.py': [Errno 2] No such file or directory
Hadoop Job IDs executed by Spark:

Intercepting System.exit(2)

<<< Invocation of Main class completed <<<

Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [2]

但我确信文件在那里。事实上,当我运行以下命令时:
spark-submit --master yarn --deploy-mode client /absolute/local/path/to/script.py arg1 arg2

有用。我得到了我想要的输出。

注意:我按照本文中的所有内容进行了设置( 我正在使用 Spark2 ):
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_spark-component-guide/content/ch_oozie-spark-action.html

有任何想法吗?

工作流.xml (为清楚起见进行了简化)
<action name = "action1">
  <spark xmlns="uri:oozie:spark-action:0.1">
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <master>${sparkMaster}</master>
      <mode>${sparkMode}</mode>
      <name>action1</name>
      <jar>${integrate_script}</jar>
      <arg>arg1</arg>
      <arg>arg2</arg>
  </spark>

  <ok to = "end" />
  <error to = "kill_job" />
</action>

工作属性 (为清楚起见进行了简化)
oozie.wf.application.path=${nameNode}/user/${user.name}/${user.name}/${zone}
oozie.use.system.libpath=true
nameNode=hdfs://myNameNode:8020
jobTracker=myJobTracker:8050
oozie.action.sharelib.for.spark=spark2
sparkMaster=yarn
sparkMode=client
integrate_script=/absolute/local/path/to/script.py
zone=somethingUsefulForMe

在 CLUSTER 模式下运行时出现异常:
diagnostics: Application application_1502381591395_1000 failed 2 times due to AM Container for appattempt_1502381591395_1000_000002 exited with  exitCode: -1000
For more detailed output, check the application tracking page: http://hostname:port/cluster/app/application_1502381591395_1000 Then click on links to logs of each attempt.
Diagnostics: File does not exist: hdfs://hostname:port/user/oozie/.sparkStaging/application_1502381591395_1000/__spark_conf__.zip
java.io.FileNotFoundException: File does not exist: hdfs://hostname:port/user/oozie/.sparkStaging/application_1502381591395_1000/__spark_conf__.zip
    at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1427)
    at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

EDIT2:

我刚从shell尝试过,由于导入而失败。
/scripts/functions/tools.py
/scripts/functions/__init__.py
/scripts/myScript.py

from functions.tools import *

这就是它失败的地方。我假设脚本首先复制到集群并在那里运行。如何让所有必需的模块也一起使用?修改 hdfs 上的 PYTHONPATH?我明白为什么它不工作只是不知道如何解决它。

EDIT3:

请参阅下面的堆栈跟踪。网上的大多数评论都说问题在于 python 代码将 Master 设置为“本地”。不是这种情况。更重要的是,我什至删除了所有与 spark 相关的内容(在 python 脚本中),但仍然遇到同样的问题。
Diagnostics: File does not exist: hdfs://hdfs/path/user/myUser/.sparkStaging/application_1502381591395_1783/pyspark.zip
java.io.FileNotFoundException: File does not exist: hdfs://hdfs/path/user/myUser/.sparkStaging/application_1502381591395_1783/pyspark.zip
    at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1427)
    at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

最佳答案

如果你想用 oozie 调用脚本,它需要放在 HDFS 上(因为你永远不知道哪个节点会运行启动器)。

将它放在 HDFS 上后,需要明确告诉 spark-submit 从远程文件系统获取它,因此在 job.properties 集中:

integrate_script=hdfs:///absolute/hdfs/path/to/script.py

关于hadoop - Pyspark 操作提交时 oozie 失败 : '[Errno 2] No such file or directory' ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46008216/

相关文章:

apache-spark - 如何在Spark RDD中获取具有精确样本量的样本?

python - 将RDD保存为pyspark中的序列文件

python - 从多行记录创建 Spark 数据结构

hadoop - yarn 中的核心数

apache-spark - 如何计算一天从 Kafka 主题中获取的消息数?

hadoop - 通过 Hadoop 连接 Elasticsearch 和 Splunk

apache-spark - 什么是 Spark RDD 图、血统图、Spark 任务的 DAG?他们是什么关系

apache-spark - 从多个分区读取多个 parquet 文件

hadoop - 通过将分区目录复制到仓库中来复制 Hive 管理的表

hadoop - Hive:是否可以重命名现有的 Hive 数据库?