apache-spark - 在带有附加文件的 YARN 集群上运行 Spark 作业

我正在编写一个简单的 spark 应用程序，它使用一些输入 RDD，通过管道将其发送到外部脚本，并将该脚本的输出写入文件。驱动程序代码如下所示:

val input = args(0)
val scriptPath = args(1)
val output = args(2)
val sc = getSparkContext
if (args.length == 4) {
  //Here I pass an additional argument which contains an absolute path to a script on my local machine, only for local testing
  sc.addFile(args(3))
}

sc.textFile(input).pipe(Seq("python2", SparkFiles.get(scriptPath))).saveAsTextFile(output)

当我在本地机器上运行它时，它工作正常。但是当我通过

spark-submit --master yarn --deploy-mode cluster --files /absolute/path/to/local/test.py --class somepackage.PythonLauncher path/to/driver.jar path/to/input/part-* test.py path/to/output`

它失败了，但有一个异常(exception)。

Lost task 1.0 in stage 0.0 (TID 1, rwds2.1dmp.ru): java.lang.Exception: Subprocess exited with status 2

我尝试了管道命令的不同变体。例如，.pipe("cat")工作正常，并按预期运行，但 .pipe(Seq("cat", scriptPath))也失败并显示错误代码 1，因此 spark 似乎无法找出群集节点上脚本的路径。

有什么建议？

最佳答案

我自己不使用 python，但我发现一些线索可能对你有用(在 Spark-1.3 SparkSubmitArguments 的源代码中)

--py-files PY_FILES 、逗号分隔的 .zip、.egg 或 .py 文件列表，用于放置在 Python 应用程序的 PYTHONPATH 上。

--files FILES , 逗号分隔的文件列表，放在每个执行器的工作目录中。

--archives ARCHIVES , 逗号分隔的压缩文件列表，要提取到每个执行程序的工作目录中。

还有，你对 spark-submit 的论点应该遵循这种风格:
Usage: spark-submit [options] <app jar | python file> [app arguments]

关于apache-spark - 在带有附加文件的 YARN 集群上运行 Spark 作业，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30047760/

apache-spark - 在带有附加文件的 YARN 集群上运行 Spark 作业

上一篇：apache2 - 在WSGI和Apache上运行时Python的工作目录

下一篇：asp.net-web-api - webapi 批处理和委托(delegate)处理程序