我有一个包含重复行的数据框,我想将它们合并为一个包含所有不同列的记录。
我的代码示例如下:
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ","",""),("81A01","TERR NAME 01","","NY",""),("81A01","TERR NAME 01","","","LA"),("81A02","TERR NAME 01","CA","",""),("81A02","TERR NAME 01","","","NY")], ["zip_code","territory_name","state","state1","state2"])
结果数据框如下:
df1.show()
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
| 81A01| TERR NAME 01| NJ| | |
| 81A01| TERR NAME 01| | NY| |
| 81A01| TERR NAME 01| | | LA|
| 81A02| TERR NAME 01| CA| | |
| 81A02| TERR NAME 01| | | NY|
+--------+--------------+-----+------+------+
我需要根据 zip_code 合并/合并重复记录,并在一行中获取所有不同的状态值。
预期结果:
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
| 81A01| TERR NAME 01| NJ| NY| LA|
| 81A02| TERR NAME 01| CA| | LA|
+--------+--------------+-----+------+------+
我是 pyspark 的新手,不确定如何使用组/连接。有人可以帮忙写代码。
最佳答案
如果您确定每个 zip_code 领土组合只有 1 个州、1 个州 1 和 1 个州 2,则可以使用以下代码。 max
函数使用字符串,如果分组数据中有字符串,因为非空字符串具有更高的值(可能是 ASCII 明智的)然后空字符串 ""
from pyspark.sql.types import *
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ","",""),("81A01","TERR NAME 01","","NY",""),("81A01","TERR NAME 01","","","LA"),("81A02","TERR NAME 01","CA","",""),("81A02","TERR NAME 01","","","NY")], ["zip_code","territory_name","state","state1","state2"])
df1.groupBy("zip_code","territory_name").agg(max("state").alias("state"),max("state1").alias("state1"),max("state2").alias("state2")).show()
结果:
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
| 81A02| TERR NAME 01| CA| | NY|
| 81A01| TERR NAME 01| NJ| NY| LA|
+--------+--------------+-----+------+------+
关于python-2.7 - 将重复记录合并为 pyspark 数据框中的单个记录,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53881651/