apache-spark - 带有 Hive Metastore 3.1.0 的 Apache Spark 2.3.1

标签 apache-spark hive apache-spark-sql hive-metastore hdp

我们已将 HDP 集群升级到 3.1.1.3.0.1.0-187 并发现:

  • Hive 有一个新的 Metastore 位置
  • Spark 看不到 Hive 数据库

  • 事实上我们看到:
    org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database ... not found
    

    你能帮我理解发生了什么以及如何解决这个问题吗?

    更新:

    配置:

    (spark.sql.warehouse.dir,/warehouse/tablespace/external/hive/) (spark.admin.acls,) (spark.yarn.dist.files,file:///opt/folder/config.yml,file:///opt/jdk1.8.0_172/jre/lib/security/cacerts) (spark.history.kerberos.keytab,/etc/security/keytabs/spark.service.keytab) (spark.io.compression.lz4.blockSize,128kb) (spark.executor.extraJavaOptions,-Djavax.net.ssl.trustStore=cacerts) (spark.history.fs.logDirectory,hdfs:///spark2-history/) (spark.io.encryption.keygen.algorithm,HmacSHA1) (spark.sql.autoBroadcastJoinThreshold,26214400) (spark.eventLog.enabled,true) (spark.shuffle.service.enabled,true) (spark.driver.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64) (spark.ssl.keyStore,/etc/security/serverKeys/server-keystore.jks) (spark.yarn.queue,default) (spark.jars,file:/opt/folder/component-assembly-0.1.0-SNAPSHOT.jar) (spark.ssl.enabled,true) (spark.sql.orc.filterPushdown,true) (spark.shuffle.unsafe.file.output.buffer,5m) (spark.yarn.historyServer.address,master2.env.project:18481) (spark.ssl.trustStore,/etc/security/clientKeys/all.jks) (spark.app.name,com.company.env.component.MyClass) (spark.sql.hive.metastore.jars,/usr/hdp/current/spark2-client/standalone-metastore/*) (spark.io.encryption.keySizeBits,128) (spark.driver.memory,2g) (spark.executor.instances,10) (spark.history.kerberos.principal,spark/edge.env.project@ENV.PROJECT) (spark.unsafe.sorter.spill.reader.buffer.size,1m) (spark.ssl.keyPassword,*********(redacted)) (spark.ssl.keyStorePassword,*********(redacted)) (spark.history.fs.cleaner.enabled,true) (spark.shuffle.io.serverThreads,128) (spark.sql.hive.convertMetastoreOrc,true) (spark.submit.deployMode,client) (spark.sql.orc.char.enabled,true) (spark.master,yarn) (spark.authenticate.enableSaslEncryption,true) (spark.history.fs.cleaner.interval,7d) (spark.authenticate,true) (spark.history.fs.cleaner.maxAge,90d) (spark.history.ui.acls.enable,true) (spark.acls.enable,true) (spark.history.provider,org.apache.spark.deploy.history.FsHistoryProvider) (spark.executor.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64) (spark.executor.memory,2g) (spark.io.encryption.enabled,true) (spark.shuffle.file.buffer,1m) (spark.eventLog.dir,hdfs:///spark2-history/) (spark.ssl.protocol,TLS) (spark.dynamicAllocation.enabled,true) (spark.executor.cores,3) (spark.history.ui.port,18081) (spark.sql.statistics.fallBackToHdfs,true) (spark.repl.local.jars,file:///opt/folder/postgresql-42.2.2.jar,file:///opt/folder/ojdbc6.jar) (spark.ssl.trustStorePassword,*********(redacted)) (spark.history.ui.admin.acls,) (spark.history.kerberos.enabled,true) (spark.shuffle.io.backLog,8192) (spark.sql.orc.impl,native) (spark.ssl.enabledAlgorithms,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA) (spark.sql.orc.enabled,true) (spark.yarn.dist.jars,file:///opt/folder/postgresql-42.2.2.jar,file:///opt/folder/ojdbc6.jar) (spark.sql.hive.metastore.version,3.0)



    来自 hive-site.xml:
    <property>
      <name>hive.metastore.warehouse.dir</name>
      <value>/warehouse/tablespace/managed/hive</value>
    </property>
    

    代码如下:
    val spark = SparkSession
      .builder()
      .appName(getClass.getSimpleName)
      .enableHiveSupport()
      .getOrCreate()
    ...
    dataFrame.write
      .format("orc")
      .options(Map("spark.sql.hive.convertMetastoreOrc" -> true.toString))
      .mode(SaveMode.Append)
      .saveAsTable("name")
    

    Spark-提交:
        --master yarn \
        --deploy-mode client \
        --driver-memory 2g \
        --driver-cores 4 \
        --executor-memory 2g \
        --num-executors 10 \
        --executor-cores 3 \
        --conf "spark.dynamicAllocation.enabled=true" \
        --conf "spark.shuffle.service.enabled=true" \
        --conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=cacerts" \
        --conf "spark.sql.warehouse.dir=/warehouse/tablespace/external/hive/" \
        --jars postgresql-42.2.2.jar,ojdbc6.jar \
        --files config.yml,/opt/jdk1.8.0_172/jre/lib/security/cacerts \
        --verbose \
        component-assembly-0.1.0-SNAPSHOT.jar \
    

    最佳答案

    看起来这是一个未实现的 Spark feature .但是我发现自 3.0 以来使用 Spark 和 Hive 的唯一方法是使用 HiveWarehouseConnector来自霍顿。文档 here .来自霍顿社区的好指南 here .
    在 Spark 开发人员准备好自己的解决方案之前,我没有回答这个问题。

    关于apache-spark - 带有 Hive Metastore 3.1.0 的 Apache Spark 2.3.1,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53010746/

    相关文章:

    hadoop - Impala通过命令行查询Shell变量

    hadoop - 摄取一组 JSON 对象并转换为表格数据

    apache-spark - sparksql 删除配置单元表

    python - 使用一个命令在 Spark 中进行不同和求和聚合

    hadoop - 是否可以在 SPARK 中覆盖 Hadoop 配置?

    apache-spark - Spark 结构化流式传输蓝/绿部署

    python - 如何将数组(即列表)列转换为 Vector

    hadoop - Hive查询在连接下面编写时歪斜:

    apache-spark - Pyspark 按顺序将多个 csv 文件读取到数据框中

    scala - 将多个文件作为独立的 RDD 并行处理