apache-spark - 为什么 Iceberg rewriteDataFiles 不将文件重写为一个文件？

我有一个带有 2 个 Parquet 文件的冰山表，在 s3 中存储 4 行我尝试了以下命令:

val tables = new HadoopTables(conf);
val table = tables.load("s3://iceberg-tests-storage/data/db/test5");    
SparkActions.get(spark).rewriteDataFiles(table).option("target-file-size-bytes", "52428800").execute();

但没有任何改变。我做错了什么？

最佳答案

一些注意事项:

默认情况下，Iceberg 不会压缩文件，除非每个文件组和每个分区都有最小数量的小文件可供压缩。默认值为 5。
- 这可以是 configured via min-input-files作为一个选项。
Iceberg 不会跨分区压缩文件，因为一个文件必须 1:1 映射到分区值的元组。
- 举个例子:对于由 col1 和 col2 分区的表，col1=A 和 col2=1 的文件不能与 col1=A 和 col2=4 的文件压缩

在您的情况下，如果将 min-input-files 设置为 2，前提是文件属于同一分区或表未分区，则文件应压缩在一起。

关于apache-spark - 为什么 Iceberg rewriteDataFiles 不将文件重写为一个文件？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/72362044/

上一篇：machine-learning - 检测 Apple 硅 GPU 核心数

下一篇：python-3.x - 通过将网格放置在 st.container、st.column 或 st.empty 中来控制 Streamlit st_aggrid (AgGrid) 布局

相关文章：

apache-spark - Hadoop客户端无法连接到datanode

python - 在 PySpark 中读取文件并将其转换为 Pandas Dataframe 时如何将第一行作为标题

scala - Spark 错误 : Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

scala - 如何向现有 Iceberg 表添加分区

PySpark 通过 Hive Metastore 读取 Iceberg 表到 S3

azure - 如何在 Databricks 中的 Iceberg 表上执行 Spark SQL 合并语句？

python - 如何在 Spark 中将多个列作为逻辑回归分类器中的特征传递？

scala - 在 Spark Scala 中重命名 DataFrame 的列名称

apache-spark - 使用s3和glue时无法以iceberg格式保存分区数据