apache-spark-sql - Spark SQL 忽略 TBLPROPERTIES 中指定的 parquet.compression 属性

我需要从 Spark SQL 创建一个 Hive 表，它将采用 PARQUET 格式和 SNAPPY 压缩。
以下代码以 PARQUET 格式创建表，但使用 GZIP 压缩:

hiveContext.sql("create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='SNAPPY') as select * from OLD_TABLE")

但是在 Hue "Metastore Tables"-> TABLE -> "Properties"中它仍然显示:

|  Parameter            |  Value   |
| ================================ |
|  parquet.compression  |  SNAPPY  |

如果我将 SNAPPY 更改为任何其他字符串，例如 ABCDE 代码仍然可以正常工作，但压缩仍然是 GZIP:

hiveContext.sql("create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='ABCDE') as select * from OLD_TABLE")

Hue "Metastore Tables"-> TABLE -> "Properties"显示:

|  Parameter            |  Value   |
| ================================ |
|  parquet.compression  |  ABCDE   |

这让我觉得 TBLPROPERTIES 只是被 Spark SQL 忽略了。

注:我尝试直接从 Hive 运行相同的查询，如果属性等于 SNAPPY 表已通过适当的压缩(即 SNAPPY 而不是 GZIP)成功创建。

create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='ABCDE') as select * from OLD_TABLE

如果属性是 ABCDE 查询没有失败，但没有创建表。

问题是什么问题？

最佳答案

这是对我有用的组合(Spark 2.1.0):

spark.sql("SET spark.sql.parquet.compression.codec=GZIP")
spark.sql("CREATE TABLE test_table USING PARQUET PARTITIONED BY (date) AS SELECT * FROM test_temp_table")

在 HDFS 中验证:

/user/hive/warehouse/test_table/date=2017-05-14/part-00000-uid.gz.parquet

关于apache-spark-sql - Spark SQL 忽略 TBLPROPERTIES 中指定的 parquet.compression 属性，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36941122/

apache-spark-sql - Spark SQL 忽略 TBLPROPERTIES 中指定的 parquet.compression 属性

上一篇：r - 用指定颜色的绘图区域外的形状进行注释

下一篇：multithreading - 可以从另一个 QThread 安全地发出 Qt 信号吗