我正在尝试将数据框另存为外部表,它将使用 spark 和可能使用 hive 进行查询,但不知何故,我无法使用 hive 查询或查看任何数据。它适用于 spark。
重现问题的方法如下:
scala> println(spark.conf.get("spark.sql.catalogImplementation"))
hive
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("spark.sql.sources.bucketing.enabled", true)
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("hive.enforce.bucketing","true")
scala> spark.conf.set("optimize.sort.dynamic.partitionining","true")
scala> spark.conf.set("hive.vectorized.execution.enabled","true")
scala> spark.conf.set("hive.enforce.sorting","true")
scala> spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
scala> spark.conf.set("hive.metastore.uris", "thrift://localhost:9083")
scala> var df = spark.range(20).withColumn("random", round(rand()*90))
df: org.apache.spark.sql.DataFrame = [id: bigint, random: double]
scala> df.head
res19: org.apache.spark.sql.Row = [0,46.0]
scala> df.repartition(10, col("random")).write.mode("overwrite").option("compression", "snappy").option("path", "s3a://company-bucket/dev/hive_confs/").format("orc").bucketBy(10, "random").sortBy("random").saveAsTable("hive_random")
19/08/01 19:26:55 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`hive_random` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
这是我在配置单元中查询的方式:
Beeline version 2.3.4-amzn-2 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select * from hive_random;
+------------------+
| hive_random.col |
+------------------+
+------------------+
No rows selected (0.213 seconds)
但它在 spark 中运行良好:
scala> spark.sql("SELECT * FROM hive_random").show
+---+------+
| id|random|
+---+------+
| 3| 13.0|
| 15| 13.0|
...
| 8| 46.0|
| 9| 65.0|
+---+------+
最佳答案
调用 saveAsTable 后出现警告。这就是提示所在——
“以 Spark SQL 特定格式将分桶数据源表 default
.hive_random
保存到 Hive 元存储中,这与 Hive 不兼容。”
'saveAsTable' 创建 RDD 分区而不是 Hive 分区的原因是,解决方法是在调用 DataFrame.saveAsTable 之前通过 hql 创建表。
关于scala - `saveAsTable` 之后无法从 Hive 查询 Spark DF - Spark SQL 特定格式,与 Hive 不兼容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57315909/