python - Pyspark 或 python 中 YYYYMM 格式的两个日期之间的计算

标签 python apache-spark pyspark apache-spark-sql

我有一张表，如下所示

ID  Index_month  Month_ID
1     201701      201701
1     201701      201702
1     201701      201703
1     201701      201704
1     201701      201705
1     201701      201706
2     201501      201701
2     201501      201702
2     201501      201703
2     201501      201704
2     201501      201705
2     201501      201706

我想计算从他们的索引月份到最大month_ID的长度，例如，对于ID 1，长度应该是201706 - 201701，即6个月；对于 ID 2，长度为 201706 - 201501，即 30 个月

期望的输出是

ID  length
1     6
2     30

Index_month 和 Month_ID 都是整数，一开始我只用 Month_ID - Index_month，但是 201706 - 201501 会得到 305。

Pyspark 中是否有任何 month_between 函数可以做到这一点？

最佳答案

您可以编写一个快速而肮脏的函数来将您的字符串转换为日期时间对象，例如

def datestring_to_datetime(datestring):
    return datetime.strptime(datestring, '%Y%m')

然后可以像这样计算月份的差异:

datestring1 = "201706"
datestring2 = "201501"
difference = (datestring_to_datetime(datestring1).year - datestring_to_datetime(datestring2).year) * 12 + (datestring_to_datetime(datestring1).month - datestring_to_datetime(datestring2).month) + 1

这将输出 30

关于python - Pyspark 或 python 中 YYYYMM 格式的两个日期之间的计算，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65636554/

上一篇：r - 计算就寝时间的平均值和标准差 (hh :mm) in R - problem are times before/after midnight

下一篇：javascript - 如何更改 Recharts 中每个条的颜色？

hadoop - Spark : Execute python script with Spark based on Hadoop Multinode

python - #ERROR使用棋盘计算内在参数

apache-spark - Spark 无法检测 ES 版本 - 如果网络/Elasticsearch 集群不可访问，通常会发生这种情况

python - Seaborn RegPlot 部分透明 (alpha)

python - Spark SQL:如果单词列表中的单词包含在列中，则在新列中返回找到的单词

python - 使用 numpy、pandas 和 scikit-learn 等依赖包运行 pyspark

pyspark - Spark SQL 使用 Python : Unable to instantiate org. apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

python - 如何使用 Django 和 Python 根据用户设置权限？

python - 随机数生成器的性能结果相互矛盾