apache-spark - 提交Apache Spark作业时在spark.jars中使用通配符

标签 apache-spark

我有一组要对我的Spark作业可用的JAR，存储在HDFS上。

Spark 2.3的文档说spark.jars是该参数:
spark.jars: Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
但是，将spark.jars设置为hdfs:///path/to/my/libs/*.jar失败:驱动程序启动正常，启动了一个阶段，但随后任务死于:
WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, xxxx, executor 1): java.io.FileNotFoundException: File hdfs:/path/to/my/libs/*.jar does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:901) at org.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:724) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:692) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:472) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:755) ...
即，它似乎在执行程序上运行时并没有扩展glob。

明确将spark.jars设置为hdfs:///path/to/my/libs/libA.jar,hdfs:///path/to/my/libs/libB.jar确实可以正常工作。

如文档所示，如何在spark.jars中使用glob？

最佳答案

我正在从本地文件系统运行所有spark批处理和流应用程序。我不确定为什么需要将它们存储在hdfs中。

但是，如果您更喜欢使用本地文件系统来保存jar，则可以使用通配符，如下所示:-

export BASE_DIR="/local/file/path/where/jar/is/available"

spark-submit \
    --class ${class} \
    --deploy-mode cluster \
    --master yarn \
...
...
...
    --name ${APPLICATION_NAME} \
    ${BASE_DIR}/*.jar

希望这会有所帮助。

关于apache-spark - 提交Apache Spark作业时在spark.jars中使用通配符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50220378/

上一篇：unit-testing - Jest - 如何覆盖模拟类和实现

下一篇：docker - nginx 后面的 docker 注册表错误，推拉

java - Hadoop 2.6 连接到 ResourceManager at/0.0.0.0 :8032

scala - Spark : Broadcast usage on local mode

apache-spark - Spark 流中批处理间隔，滑动间隔和窗口大小之间的差异

java - RDD 不可序列化 Cassandra/Spark 连接器 java API

apache-spark - 在 spark 2.3.0 中的结构化流中禁用 _spark_metadata

java - 为什么 pyspark 失败并显示 “Error while instantiating ' org.apache.spark.sql.internal.SessionStateBuilder'”？

amazon-web-services - AWS 胶水 : How to add a column with the source filename in the output?

apache-spark - pySpark Dataframe 上聚合的多个标准

apache-spark - 从 Spark 到 Snowflake 数据类型