我正在尝试获取两个时间戳列之间的差异,但毫秒数消失了。
如何纠正这个?
from pyspark.sql.functions import unix_timestamp
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
data = [
(1, '2018-07-25 17:15:06.39','2018-07-25 17:15:06.377'),
(2,'2018-07-25 11:12:49.317','2018-07-25 11:12:48.883')
]
df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts']).withColumn('diff',F.unix_timestamp('max_ts', format=timeFmt) - F.unix_timestamp('min_ts', format=timeFmt))
df.show(truncate = False)
最佳答案
这是 unix_timestamp
的预期行为- 它在 source code docstring 中明确说明它只返回秒,因此在计算时删除毫秒组件。
如果您想进行该计算,可以使用 substring
函数来连接数字然后做差异。请参阅下面的示例。请注意,这假设数据完全形成,例如毫秒完全满足(所有 3 位数字):
import pyspark.sql.functions as F
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
data = [
(1, '2018-07-25 17:15:06.390', '2018-07-25 17:15:06.377'), # note the '390'
(2, '2018-07-25 11:12:49.317', '2018-07-25 11:12:48.883')
]
df = spark.createDataFrame(data, ['ID', 'max_ts', 'min_ts'])\
.withColumn('max_milli', F.unix_timestamp('max_ts', format=timeFmt) + F.substring('max_ts', -3, 3).cast('float')/1000)\
.withColumn('min_milli', F.unix_timestamp('min_ts', format=timeFmt) + F.substring('min_ts', -3, 3).cast('float')/1000)\
.withColumn('diff', (F.col('max_milli') - F.col('min_milli')).cast('float') * 1000)
df.show(truncate=False)
+---+-----------------------+-----------------------+----------------+----------------+---------+
|ID |max_ts |min_ts |max_milli |min_milli |diff |
+---+-----------------------+-----------------------+----------------+----------------+---------+
|1 |2018-07-25 17:15:06.390|2018-07-25 17:15:06.377|1.53255330639E9 |1.532553306377E9|13.000011|
|2 |2018-07-25 11:12:49.317|2018-07-25 11:12:48.883|1.532531569317E9|1.532531568883E9|434.0 |
+---+-----------------------+-----------------------+----------------+----------------+---------+
关于PySpark 毫秒的时间戳,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54951348/