python - 在 Python/Pyspark 中获取每月计数的更有效方法

我有一个表 DF，如下所示

我想得到如下计数:

month                           count
30 days and greater             20,000
60 days and greater             15,323
90 days and greater             11,232
.....
3600 days and greater           55

我的代码非常简单明了，我只是为每个月应用过滤器，并获取计数，然后复制并粘贴到 Excel 中，如下所示:

month1 = df.filter("Days >= 30").agg(countDistinct('ID')).show() 
month2 = df.filter("Days>= 60").agg(countDistinct('ID')).show() 
month3 = df.filter("Days >= 90").agg(countDistinct('ID')).show() ....

这确实效率不高。

不知道有没有更简单的方法呢？并像这样创建一个表。

感谢先进!

最佳答案

您需要先将天数除以 30，然后对这些值进行分组:

import pyspark.sql.functions as f
from pyspark.shell import spark
from pyspark.sql import Row
from pyspark.sql.window import Window

df = spark.createDataFrame([
    Row(ID=1, Days=30),
    Row(ID=2, Days=55),
    Row(ID=3, Days=32),
    Row(ID=4, Days=12),
    Row(ID=5, Days=100),
    Row(ID=6, Days=3600)
])

# Calculating quantity of months
df = df.withColumn('total_months', f.floor(f.col('Days') / f.lit(30)))

# Group and count distinct
df = df.groupBy('total_months').agg(f.countDistinct('ID').alias('count'))

# Adding description
df = df.withColumn('month', f.concat(f.col('total_months') * f.lit(30), f.lit(' days and greater')))

# Cumulative sum
window = Window.orderBy(f.col('total_months').desc()).rangeBetween(Window.unboundedPreceding, Window.currentRow)
df = df.withColumn('count', f.sum('count').over(window))

# Selecting only required columns and sorting asc
(df
 .select('month', 'count')
 .sort('total_months')
 .show(truncate=False))

输出

+---------------------+-----+
|month                |count|
+---------------------+-----+
|0 days and greater   |6    |
|30 days and greater  |5    |
|90 days and greater  |2    |
|3600 days and greater|1    |
+---------------------+-----+

关于python - 在 Python/Pyspark 中获取每月计数的更有效方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67524851/

python - 在 Python/Pyspark 中获取每月计数的更有效方法

上一篇：reactjs - 无法读取未定义、react-app-rewired 的属性 'use'

下一篇：javascript - 如何使用下一个 js 加速 getServerSideProps？