pyspark - 为什么在胶水 pyspark ETL 作业中无法添加到 Parquet 表中的新列？

我们一直在探索使用 Glue 将一些 JSON 数据转换为 parquet。我们尝试过的一个场景是向 Parquet 表添加一列。所以分区 1 有 [A] 列，分区 2 有 [A,B] 列。然后我们想编写更多的 Glue ETL 作业来聚合 parquet 表，但新列不可用。使用 glue_context.create_dynamic_frame.from_catalog为了加载动态框架，我们的新列从未出现在架构中。

我们为我们的表格爬虫尝试了几种配置。所有分区使用单一架构，s3 路径使用单一架构，每个分区使用架构。我们总是可以看到 Glue 表数据中的新列，但如果我们使用 pyspark 从 Glue 作业中查询它，它总是为空。当我们下载一些样本并且可以通过 Athena 查询时，该列位于 Parquet 中。

为什么 pyspark 无法使用新列？

最佳答案

结果证明这是一个 Spark 配置问题。来自 the spark docs :

Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by

setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or

setting the global SQL option spark.sql.parquet.mergeSchema to true.

我们可以通过两种方式启用模式合并。

在 spark session 中设置选项 spark.conf.set("spark.sql.parquet.mergeSchema", "true")

套装mergeSchema在 additional_options 中为真加载动态帧时。

source = glueContext.create_dynamic_frame.from_catalog(
   database="db",
   table_name="table",
   additional_options={"mergeSchema": "true"}
)

之后，新列在框架的架构中可用。

关于pyspark - 为什么在胶水 pyspark ETL 作业中无法添加到 Parquet 表中的新列？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55585066/

pyspark - 为什么在胶水 pyspark ETL 作业中无法添加到 Parquet 表中的新列？

上一篇：c# - 当程序集实际上在库中时，我收到 "Assembly outside lib folder"警告

下一篇：angularjs - 如何将一个类分配给一个报告，而不是奇数行，偶数行，而是日期列的变化？