docker - spark-submit如何在群集模式下传递--driver-class-path?

标签 docker apache-spark hadoop pyspark kerberos

好吧,如果使用pyspark shell和driver-class-path,使用docker image可以访问 hive 资源:

$ pyspark --driver-class-path /etc/spark2/conf:/etc/hive/conf
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
Using Python version 3.7.4 (default, Aug 13 2019 20:35:49)
SparkSession available as 'spark'.
>>> from pyspark.sql import SparkSession
>>>
>>> #declaration
... appName = "test_hive_minimal"
>>> master = "yarn"
>>>
... sc = SparkSession.builder \
...     .appName(appName) \
...     .master(master) \
...     .enableHiveSupport() \
...     .config("spark.hadoop.hive.enforce.bucketing", "True") \
...     .config("spark.hadoop.hive.support.quoted.identifiers", "none") \
...     .config("hive.exec.dynamic.partition", "True") \
...     .config("hive.exec.dynamic.partition.mode", "nonstrict") \
...     .getOrCreate()
>>> sql = "show tables in user_tables"
>>> df_new = sc.sql(sql)
20/08/20 15:08:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> df_new.show()
+-----------+--------------------+-----------+
|   database|           tableName|isTemporary|
+-----------+--------------------+-----------+
|user_tables|              dummyt|      false|
|user_tables|abcdefg...dummytable|      false|
但如果通过以下方式通过spark-submit使用相同的脚本,则会遇到以下错误:
spark-submit --master local --deploy-mode cluster --name test_hive --executor-memory 2g --num-executors 1 -- test_hive_minimal.py --verbose

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/opt/conda/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py", line 71, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Database 'user_tables' not found;"
test_hive_minimal.py是检查 hive 数据库的简单脚本:
from pyspark.sql import SparkSession

appName = "test_hive_minimal"
master = "yarn"
# Creating Spark session
sc = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .enableHiveSupport() \
    .config("spark.hadoop.hive.enforce.bucketing", "True") \
    .config("spark.hadoop.hive.support.quoted.identifiers", "none") \
    .config("hive.exec.dynamic.partition", "True") \
    .config("hive.exec.dynamic.partition.mode", "nonstrict") \
    .getOrCreate()

sql = "show tables in user_tables"
df_new = sc.sql(sql)
df_new.show()
sc.stop()
我尝试了几种方法,分别传递hive.metastore.uris,spark.sql.warehouse.dir以及xml文件作为--files传递。我的执行者以某种方式无法访问它的配置。有人能帮忙吗?
更新:
我成功地将hive-site.xml作为--files传递给集群模式下的 Spark 提交,并且日志显示其不再为metastore创建本地derby.db。但是,现在面临另一个问题,如下所示:
20/08/21 09:59:29 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/08/21 09:59:31 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
20/08/21 09:59:31 INFO hive.metastore: Trying to connect to metastore with URI thrift://cluster01.cdh.com:9083
20/08/21 09:59:32 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
        at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
        at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
        at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
        at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
似乎是kerberos的问题,但是我已经有有效的kerberos token ,并且能够通过终端/也可以通过docker的spark-shell访问hdfs。在这里需要做什么?在集群上提交时,这不是由yarn自动设置的吗?

最佳答案

我认为您应该在spark-submit命令中传递keytab,此代码通过SSH运行?

关于docker - spark-submit如何在群集模式下传递--driver-class-path?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63509049/

相关文章:

java - 如何通过Linux解析Xml文件

docker - 如何在运行Docker容器时/之后打开外壳而不覆盖现有CMD?

node.js - 我应该将 AWS ECS 与 Postgres Docker 容器还是 AWS RDS 结合使用吗?

docker - Docker 拉取错误 - "Layer already being pulled by another client"

scala - 对由partitionBy创建的一个输出目录中的数据进行排序

python - Spark 在非常小的数据集上运行非常慢

postgresql - 如何用 fig 恢复 postgres 数据库?

apache-spark - 在 spark 中从 kafka 消息中获取主题

Hadoop 数据节点 -> 名称节点通信问题

hadoop - Hadoop数据分割和数据流控制