我需要使用 spark dataframe 将行值转换为列并按用户 ID 进行分区并创建一个 csv 文件。
val someDF = Seq(
("user1", "math","algebra-1","90"),
("user1", "physics","gravity","70"),
("user3", "biology","health","50"),
("user2", "biology","health","100"),
("user1", "math","algebra-1","40"),
("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")
someDF.show(false)
+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
| user1| math| algebra-1| 90|
| user1| physics| gravity| 70|
| user3| biology| health| 50|
| user2| biology| health| 100|
| user1| math| algebra-1| 40|
| user2| physics| gravity-2| 20|
+-------+---------+-----------+-----+
val result = someDF.groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
result.show(false)
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
| user3| biology| null| null| null| 50|
| user1| math| 90| null| null| null|
| user2| biology| null| null| null| 100|
| user2| physics| null| null| 20| null|
| user1| physics| null| 70| null| null|
+-------+---------+---------+-------+---------+------+
通过上面的代码,我可以将行值(类(class)名称)转换为列名称。
但我需要将输出保存在 course_wise
预计在 csv 格式下应该是这样的。
biology.csv // Expected Output
+-------+---------+------+
|user_id|course_id|health|
+-------+---------+------+
| user3| biology| 50 |
| user2| biology| 100 |
+-------+---------+-------
physics.csv // Expected Output
+-------+---------+---------+-------
|user_id|course_id|gravity-2|gravity|
+-------+---------+---------+-------+
| user2| physics| 50 | null |
| user1| physics| 100 | 70 |
+-------+---------+---------+-------+
**注意:csv 中的每门类(class)应仅包含其特定的类(class)名称,不应包含任何不相关的类(class)类(class)名称。
实际上在 csv 中我可以在下面的 formate 中**
result.write
.partitionBy("course_id")
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("header", "true")
.save(somepath)
例如:
biology.csv // Wrong output, Due to it is containing non-relevant course lesson's(algebra-1,gravity-2,algebra-1)
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
| user3| biology| null| null| null| 50|
| user2| biology| null| null| null| 100|
+-------+---------+---------+-------+---------+------+
谁能帮忙解决这个问题?
最佳答案
只需在调整之前按类(class)过滤:
val result = someDF.filter($"course_id" === "physics").groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
+-------+---------+-------+---------+
|user_id|course_id|gravity|gravity-2|
+-------+---------+-------+---------+
|user2 |physics |null |20 |
|user1 |physics |70 |null |
+--------+--------+------+--------+
关于scala - spark 数据框将行值转换为列名,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57727480/