考虑我的数据:
+---+-------------------+-------------------+
| id| starttime| endtime|
+---+-------------------+-------------------+
| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
基于此,我想要一个 sql 查询,它为结束时间和开始时间之间的每一分钟差异创建一行,使我的数据完全像这样结束:
+---+-------------------+-------------------+
| id| starttime| endtime|
+---+-------------------+-------------------+
| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
| 1|1970-01-01 07:01:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
| 1|1970-01-01 07:02:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
| 1|1970-01-01 07:03:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
我对sql有很强的偏好,但如果不可能,你可以使用pyspark。
最佳答案
试试这个:
import pyspark.sql.functions as f
df.show()
+---+-------------------+-------------------+
| id| starttime| endtime|
+---+-------------------+-------------------+
| 1|1970-01-01 07:00:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
#df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- starttime: timestamp (nullable = true)
# |-- endtime: timestamp (nullable = true)
将expr
和sequence
以一分钟的间隔组合起来,将为您提供分钟的时间戳数组,然后将其分解
以按行进行转换。
df.select('id', f.explode(f.expr('sequence(starttime, endtime, interval 1 minute)')).alias('starttime'), 'endtime' ).show(truncate=False)
+---+-------------------+-------------------+
|id |starttime |endtime |
+---+-------------------+-------------------+
|1 |1970-01-01 07:00:00|1970-01-01 07:03:00|
|1 |1970-01-01 07:01:00|1970-01-01 07:03:00|
|1 |1970-01-01 07:02:00|1970-01-01 07:03:00|
|1 |1970-01-01 07:03:00|1970-01-01 07:03:00|
+---+-------------------+-------------------+
关于pyspark - 在 Spark SQL 中为每一分钟的差异创建一个新行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60170879/