scala - 为什么在 spark-3 上写入 1900 年之前的时间戳不会抛出 SparkUpgradeException?

标签 scala apache-spark parquet

在页面上: https://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-proleptic-calendar-date-time-management/read
我们可以阅读:

reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar

请考虑以下未抛出异常的场景:

scala> spark.conf.get("spark.sql.legacy.parquet.datetimeRebaseModeInWrite")
res27: String = EXCEPTION
scala> Seq(java.sql.Timestamp.valueOf("1899-01-01 00:00:00")).toDF("col").write.parquet("/tmp/someDate")
scala> // why did not it throw exception?

虽然对于抛出 1582 异常之前的日期:

scala> Seq(java.sql.Date.valueOf("1581-01-01")).toDF("col").write.parquet("/tmp/someOtherDate")
21/03/10 19:07:19 ERROR Utils: Aborting task
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.

谁能解释一下这个区别?

最佳答案

我有 spark 3.1.2 版本我已经测试了两种情况并且在两种情况下都抛出异常...请引用以下内容:

scala> Seq(java.sql.Timestamp.valueOf("1899-01-01 00:00:00")).toDF("col").write.parquet("/tmp/someDate")
22/01/04 18:03:53 ERROR Utils: Aborting task                        (0 + 1) / 1]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet INT96 files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.int96RebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar. here

还有第二种情况:

scala> Seq(java.sql.Date.valueOf("1581-01-01")).toDF("col").write.parquet("/tmp/someOtherDate1")
22/01/04 18:05:08 ERROR Utils: Aborting task
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.

关于scala - 为什么在 spark-3 上写入 1900 年之前的时间戳不会抛出 SparkUpgradeException?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66571309/

相关文章:

hadoop - 如何用 Java 读取 Avro-Parquet 文件中的特定字段?

json - 为基本特征具有(密封)类型成员的密封案例类族派生circe Codec

python - pyspark 脚本需要 sbt scala 包生成器吗?

scala - 在 Spark-Scala 中将 Dataset[Row] 转换为 RDD[Array[String]] 的最佳方法?

python - 在执行期间更改 Spark Streaming 中的批量大小

hadoop - 更改表列名称拼写格式Hadoop

scala - 从 List[Int] 到 List[Double] 的隐式转换失败

multithreading - 计时 Spark 过程,如果过慢则将其终止

scala - "[NoHostAvailableException: All host(s) tried for query failed"连接cassandra集群出现异常

python - 客户端错误 : An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied