apache-spark - 根据PySpark中的时区将UTC时间戳转换为本地时间

标签 apache-spark pyspark apache-spark-sql

我有一个 PySpark 数据帧,df ,其中一些列如下所示。 hour列是 UTC 时间,我想创建一个新列,该列具有基于 time_zone 的本地时间柱子。我怎样才能在 PySpark 中做到这一点?

df
    +-------------------------+------------+
    |  hour                   | time_zone  |
    +-------------------------+------------+
    |2019-10-16T20:00:00+0000 | US/Eastern |
    |2019-10-15T23:00:00+0000 | US/Central |
    +-------------------------+------------+

#What I want:
    +-------------------------+------------+---------------------+
    |  hour                   | time_zone  | local_time          |
    +-------------------------+------------+---------------------+
    |2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
    |2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
    +-------------------------+------------+---------------------+

最佳答案

您可以使用内置的 from_utc_timestamp 功能。请注意 hour column 需要作为没有时区的字符串传入函数。

下面的代码适用于从 2.4 开始的 spark 版本。

from pyspark.sql.functions import *
df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show()

对于 2.4 之前的 spark 版本,您必须将表示时区的常量字符串作为第二个参数传递给函数。

Documentation

pyspark.sql.functions.from_utc_timestamp(timestamp, tz)

This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.

However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.

This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.

Parameters timestamp – the column that contains timestamps

tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc

Changed in version 2.4: tz can take a Column containing timezone ID strings.

关于apache-spark - 根据PySpark中的时区将UTC时间戳转换为本地时间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59145296/

相关文章:

withColumn() 中的 PySpark list() 只工作一次,然后 AssertionError : col should be Column

apache-spark - Pyspark udf 对于不带参数的函数失败,但对于不带参数的 lambda 有效

scala - Spark - 将元组转换为制表符分隔的字符串

apache-spark - Spark 提交失败,并出现 Spark Streaming workdcount python 代码

scala - 如何使用 spark-shell 导入自己的 scala 包?

json - 如何有效地加载和处理包含不同的、不断发展的模式的 JSON 文件

python-3.x - 数据分析 - 如何计算空值、NaN 和空字符串值?

apache-spark - 如何展平结构类型数组的列(由Spark ML API返回)?

python - 向 Spark DataFrame 添加一个空列

python - PySpark 正则表达式引擎不匹配