在页面上:
https://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-proleptic-calendar-date-time-management/read
我们可以阅读:
reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar
请考虑以下未抛出异常的场景:
scala> spark.conf.get("spark.sql.legacy.parquet.datetimeRebaseModeInWrite")
res27: String = EXCEPTION
scala> Seq(java.sql.Timestamp.valueOf("1899-01-01 00:00:00")).toDF("col").write.parquet("/tmp/someDate")
scala> // why did not it throw exception?
虽然对于抛出 1582
异常之前的日期:
scala> Seq(java.sql.Date.valueOf("1581-01-01")).toDF("col").write.parquet("/tmp/someOtherDate")
21/03/10 19:07:19 ERROR Utils: Aborting task
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.
谁能解释一下这个区别?
最佳答案
我有 spark 3.1.2 版本我已经测试了两种情况并且在两种情况下都抛出异常...请引用以下内容:
scala> Seq(java.sql.Timestamp.valueOf("1899-01-01 00:00:00")).toDF("col").write.parquet("/tmp/someDate")
22/01/04 18:03:53 ERROR Utils: Aborting task (0 + 1) / 1]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet INT96 files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.int96RebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar. here
还有第二种情况:
scala> Seq(java.sql.Date.valueOf("1581-01-01")).toDF("col").write.parquet("/tmp/someOtherDate1")
22/01/04 18:05:08 ERROR Utils: Aborting task
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.
关于scala - 为什么在 spark-3 上写入 1900 年之前的时间戳不会抛出 SparkUpgradeException?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66571309/