pandas - 将字符串转换为标识正确年份的准确日期

我有这样的数据:

+---+------+                                                                    
| id|   col|
+---+------+
|  1|210927|
|  2|210928|
|  3|210929|
|  4|210930|
|  5|211001|
+---+------+

我想要如下所示的输出:

+---+------+----------+
| id|   col|   t_date1|
+---+------+----------+
|  1|210927|27-09-2021|
|  2|210928|28-09-2021|
|  3|210929|29-09-2021|
|  4|210930|30-09-2021|
|  5|211001|01-10-2021|
+---+------+----------+

我可以使用 pandas 获得它和strptime 。下面是我的代码:

pDF= df.toPandas()
valuesList = pDF['col'].to_list()
modifiedList = list()
 
for i in valuesList:
...  modifiedList.append(datetime.strptime(i, "%y%m%d").strftime('%d-%m-%Y'))
 
pDF['t_date1']=modifiedList
 
df = spark.createDataFrame(pDF)

现在，主要问题是我想要avoid使用pandas和list因为我要处理 millions甚至 billions数据，而当涉及到大数据时，pandas 会减慢这个过程。

我在spark中尝试了各种方法，例如unixtime , to_date , timestamp具有我需要的格式，但没有运气，因为 strptime仅适用于字符串我不能直接在列上使用它。我不愿意创建 UDF，因为它们也很慢。

主要问题是确定确切的年份，我无法在 Spark 中做到这一点，但我希望仅使用 Spark 来实现它。需要改变什么？我哪里出错了？

最佳答案

根据Python datetime.strptime

# Open Group specification for strptime() states that a %y
#value in the range of [00, 68] is in the century 2000, while
#[69,99] is in the century 1900
if year <= 68:
    year += 2000
else:
    year += 1900

使用 PySpark 的 when 和 otherwise 来实现这一点非常简单

from pyspark.sql import functions as F

(df
    .withColumn('y', F.substring('col', 0, 2).cast('int'))
    .withColumn('y', F
        .when(F.col('y') <= 68, F.col('y') + 2000)
        .otherwise(F.col('y') + 1900)
    )
    .withColumn('t_date', F.concat('y', F.regexp_replace('col', '(\d{2})(\d{2})(\d{2})', '-$2-$3')))
    .show()
)

# Output
# +---+------+----+----------+
# | id|   col|   y|    t_date|
# +---+------+----+----------+
# |  1|210927|2021|2021-09-27|
# |  2|910927|1991|1991-09-27|
# +---+------+----+----------+

从技术上讲，您可以整天争论这种方法(0-68，然后 69-99)。但它在这里是一种“标准”，所以我不认为在这里使用它有什么问题。

关于pandas - 将字符串转换为标识正确年份的准确日期，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69690620/

pandas - 将字符串转换为标识正确年份的准确日期

上一篇：javascript - 尝试使用 fs 访问目录时出现错误 : ENOENT: no such file or directory, open

下一篇：python - 阴谋冲刺 : How to display three graphs next to each other inside a tab