我想在名称上加入以下 Spark 数据框:
df1 = spark.createDataFrame([("Mark", 68), ("John", 59), ("Mary", 49)], ['Name', 'Weight'])
df2 = spark.createDataFrame([(31, "Mark"), (32, "Mark"), (41, "John"), (42, "John"), (43, "John")],[ 'Age', 'Name'])
但我希望结果是以下数据框:
df3 = spark.createDataFrame([([31, 32], "Mark", 68), ([41, 42, 43], "John", 59), `(None, "Mary", 49)],[ 'Age', 'Name', 'Weight'])
最佳答案
您可以使用模块pyspark.sql.functions
中的collect_list
。它收集与给定键相关的给定列的所有值。如果您想要一个包含唯一元素的列表,请使用collect_set
。
import pyspark.sql.functions as F
df1 = spark.createDataFrame([("Mark", 68), ("John", 59), ("Mary", 49)], ['Name', 'Weight'])
df2 = spark.createDataFrame([(31, "Mark"), (32, "Mark"), (41, "John"), (42, "John"), (43, "John")],[ 'Age', 'Name'])
df2_grouped = df.groupBy("Name").agg(F.collect_list(F.col("Age")).alias("Age"))
df_joined = df2_grouped.join(df1, "Name", "outer")
df_joined.show()
结果:
+----+------------+------+
|Name| Age|Weight|
+----+------------+------+
|Mary| null| 49|
|Mark| [32, 31]| 68|
|John|[42, 43, 41]| 59|
+----+------------+------+
关于python - PySpark:连接两个 Spark 数据帧时如何将列分组为列表?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40049380/