apache-spark - 使用已知架构保存空 DataFrame (Spark 2.2.1)

标签 apache-spark parquet databricks

是否可以使用已知模式保存一个空的 DataFrame 以便将模式写入文件,即使它有 0 条记录?

def example(spark: SparkSession, path: String, schema: StructType) = { 
  val dataframe = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) 
  val dataframeWriter = dataframe.write.mode(SaveMode.Overwrite).format("parquet") 
  dataframeWriter.save(path) 

  spark.read.load(path) // ERROR!! No files to read, so schema unknown 
} 

最佳答案

这是我从 Databricks 支持收到的答案:

This is actually a known issue in Spark. There is already fix done in opensource JIRA -> https://issues.apache.org/jira/browse/SPARK-23271. For more details on how this behavior will change from 2.4 please check this doc change https://github.com/apache/spark/pull/20525/files#diff-d8aa7a37d17a1227cba38c99f9f22511R1808 The behavior will be changed from Spark 2.4. Until then you need to go with any one of the following ways

  1. Save a dataframe with at-least one record to preserve its schema
  2. Save schema in a JSON file and use later

关于apache-spark - 使用已知架构保存空 DataFrame (Spark 2.2.1),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49821408/

相关文章:

apache-spark - Spark Dataframe 丢失分区

amazon-web-services - 在 AWS Glue 中读取 Parquet 文件

database - 如何将 Parquet 文件加载到雪花数据库中?

azure - 如何部署指定权限的Databricks集群?

python - 重命名写入的 CSV 文件 Spark 抛出错误 "Path must be absolute"- Azure Data Lake

python - 在 Spark 数据框中拆分列

java - 从 Java 中的 DataFrame 查找每天的最大行程 - Spark

apache-spark - Mesos 设置从属已停用

scala - Spark 驱动程序不会因异常而崩溃

apache-spark - 用于与 Spark JDBC DataFrame 读取器一起使用的 Cloud Spanner 的 Simba JDBC 驱动程序