python-2.7 - 将重复记录合并为 pyspark 数据框中的单个记录

标签 python-2.7 pyspark apache-spark-sql

我有一个包含重复行的数据框,我想将它们合并为一个包含所有不同列的记录。

我的代码示例如下:

df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ","",""),("81A01","TERR NAME 01","","NY",""),("81A01","TERR NAME 01","","","LA"),("81A02","TERR NAME 01","CA","",""),("81A02","TERR NAME 01","","","NY")], ["zip_code","territory_name","state","state1","state2"])

结果数据框如下:
df1.show()
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
|   81A01|  TERR NAME 01|   NJ|      |      |
|   81A01|  TERR NAME 01|     |    NY|      |
|   81A01|  TERR NAME 01|     |      |    LA|
|   81A02|  TERR NAME 01|   CA|      |      |
|   81A02|  TERR NAME 01|     |      |    NY|
+--------+--------------+-----+------+------+

我需要根据 zip_code 合并/合并重复记录,并在一行中获取所有不同的状态值。

预期结果:
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
|   81A01|  TERR NAME 01|   NJ|    NY|    LA|
|   81A02|  TERR NAME 01|   CA|      |    LA|
+--------+--------------+-----+------+------+

我是 pyspark 的新手,不确定如何使用组/连接。有人可以帮忙写代码。

最佳答案

如果您确定每个 zip_code 领土组合只有 1 个州、1 个州 1 和 1 个州 2,则可以使用以下代码。 max函数使用字符串,如果分组数据中有字符串,因为非空字符串具有更高的值(可能是 ASCII 明智的)然后空字符串 ""

from pyspark.sql.types import *
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ","",""),("81A01","TERR NAME 01","","NY",""),("81A01","TERR NAME 01","","","LA"),("81A02","TERR NAME 01","CA","",""),("81A02","TERR NAME 01","","","NY")], ["zip_code","territory_name","state","state1","state2"])

df1.groupBy("zip_code","territory_name").agg(max("state").alias("state"),max("state1").alias("state1"),max("state2").alias("state2")).show()

结果:
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
|   81A02|  TERR NAME 01|   CA|      |    NY|
|   81A01|  TERR NAME 01|   NJ|    NY|    LA|
+--------+--------------+-----+------+------+

关于python-2.7 - 将重复记录合并为 pyspark 数据框中的单个记录,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53881651/

相关文章:

unit-testing - 使用 Nutter 进行 Databricks 单元测试

scala - 在不同的类中访问 Spark 广播变量

null - 不在 Presto 与 Spark SQL 的实现中

python - 从 Python 更改进程可用的资源

python - Pandas:重新索引仅对具有唯一值的 Index 对象有效

apache-spark - pyspark 中的 first_value 窗口函数

pyspark - 设置 PYSPARK_SUBMIT_ARGS ="--name" "PySparkShell" "pyspark-shell"&& jupyter 笔记本

sql - spark数据帧分组不计算空值

python - pandas 数据框 to_csv 适用于 sep ='\n' 但不适用于 sep ='\t'

python - 2 个向量中所有点之间的欧氏距离