apache-spark - Spark-Python : Select rows and dates

我在 Spark (Python) 中有以下 df。我只是想选择“datos_acumulados”列超过 20480 的那一天。在这种情况下，输出应该是如下所示的表格:(包含空值的表格格式):

结果:

 grupo_edad|     fecha|acumuladosMB|datos_acumulados|
|         1|2020-08-04|        4864|           20921|
|         4|      null|        null|            null|

数据框:df_datos_acumulados

     grupo_edad|     fecha|acumuladosMB|datos_acumulados|
    +----------+----------+------------+----------------+
    |         1|2020-08-01|        6185|            6185|
    |         1|2020-08-02|        5854|           12039|
    |         1|2020-08-03|        4018|           16057|
    |         1|2020-08-04|        4864|           20921|
    |         1|2020-08-05|        5526|           26447|
    |         1|2020-08-06|        4818|           31265|
    |         1|2020-08-07|        5359|           36624|
    |         4|2020-08-01|         674|             674|
    |         4|2020-08-02|         744|            1418|
    |         4|2020-08-03|         490|            1908|
    |         4|2020-08-04|         355|            2263|
    |         4|2020-08-05|        1061|            3324|
    |         4|2020-08-06|         752|            4076|
    |         4|2020-08-07|         560|            4636|

谢谢!

感谢 @pasha701 的回答，我可以获得最终表，但它没有显示我也需要的空行:

grupoDistinctDF = df_datos_acumulados.withColumn("grupo_edad", col("grupo_edad"))


grupoWindow = Window.partitionBy("grupo_edad").orderBy("fecha")

df_datos_acumulados = df_datos_acumulados.where(col("datos_acumulados") >= 20480) \
  .withColumn("row_number", row_number().over(grupoWindow)) \
  .where(col("row_number") == 1) \
  .drop("row_number")


grupoDistinctDF = grupoDistinctDF.join(df_datos_acumulados,["grupo_edad"], "left")

输出:

 grupo_edad|     fecha|acumuladosMB|datos_acumulados|
|         1|2020-08-04|        4864|           20921|

最佳答案

如果需要“datos_acumulados”> 20480 的第一行，可以使用窗口函数“row_number()”来获取这样的第一行，并与不同的“grupo_edad”(Scala)连接:

val df = Seq(
  (1, "2020-08-01", 6185, 6185),
  (1, "2020-08-02", 5854, 12039),
  (1, "2020-08-03", 4018, 16057),
  (1, "2020-08-04", 4864, 20921),
  (1, "2020-08-05", 5526, 26447),
  (1, "2020-08-06", 4818, 31265),
  (1, "2020-08-07", 5359, 36624),
  (4, "2020-08-01", 674, 674),
  (4, "2020-08-02", 744, 1418),
  (4, "2020-08-03", 490, 1908),
  (4, "2020-08-04", 355, 2263),
  (4, "2020-08-05", 1061, 3324),
  (4, "2020-08-06", 752, 4076),
  (4, "2020-08-07", 560, 4636)
).toDF("grupo_edad", "fecha", "acumuladosMB", "datos_acumulados")

val grupoDistinctDF = df.select("grupo_edad").distinct()

val grupoWindow = Window.partitionBy("grupo_edad").orderBy("fecha")

val firstMatchingRowDF = df
  .where($"datos_acumulados" > 20480)
  .withColumn("row_number", row_number().over(grupoWindow))
  .where($"row_number" === 1)
  .drop("row_number")


grupoDistinctDF.join(firstMatchingRowDF, Seq("grupo_edad"), "left")

输出:

+----------+----------+------------+----------------+
|grupo_edad|fecha     |acumuladosMB|datos_acumulados|
+----------+----------+------------+----------------+
|4         |null      |null        |null            |
|1         |2020-08-04|4864        |20921           |
+----------+----------+------------+----------------+

关于apache-spark - Spark-Python : Select rows and dates，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69193818/

apache-spark - Spark-Python : Select rows and dates

上一篇：c# - xamarin 表单跨平台移动应用程序(iOS、Android)中的依赖注入(inject)

下一篇：c# - 检查 Unity 中的文本更改事件