sql - 如何在 SQL 中使用数据透视表(不是作为 DataFrame 分组运算符)？

我有这个数据框，我试图将此数据框操作转换为 sql 使用 pivot功能

  val df = Seq(
    (1, "a,b,c"),
    (2, "b,c")
  ).toDF("id", "page_path")
  df.createOrReplaceTempView("df")

  df.show()
  df
    .withColumn("splitted", split($"page_path", ","))
    .withColumn("exploded", explode($"splitted"))
    .groupBy("id")
    .pivot("exploded")
    .count().show

产生的输出:

+---+----+---+---+
| id|   a|  b|  c|
+---+----+---+---+
|  1|   1|  1|  1|
|  2|null|  1|  1|
+---+----+---+---+

我看到这个 databricks link使用 pivot以sql方式运行，我尝试申请但失败了..

任何人都知道以 sql 方式应用数据透视函数吗？

我就是这样试的

  spark.sql(
    """
      (select * from (Select id, exploded from ( select id, explode(split( page_path ,',')) as exploded from df )
      group by id, exploded )
      Pivot
      (id, exploded)   )
             """.stripMargin
  ).show

正在产生这样的结果

+---+--------+
| id|exploded|
+---+--------+
|  2|       c|
|  1|       c|
|  1|       b|
|  2|       b|
|  1|       a|
+---+--------+

不像上面使用数据帧操作显示的输出

我也试过这个

 spark.sql(
    """
      select * from  (  select * from df lateral view explode(split( page_path ,',')) as exploded )
          pivot (exploded)
             """.stripMargin
  ).show

异常(exception) :

Exception in thread "main" org.apache.spark.sql.AnalysisException: Number of column aliases does not match number of columns. Number of column aliases: 1; number of columns: 3.; line 2 pos 21

打断我的头，在想是否可能。

最佳答案

我们don't不得不提 groupBy 子句使用 sql pivot 作为 Spark 隐式执行 groupBy。

From documentation:

An important idea about pivot is that it performs a grouped aggregation based on a list of implicit group-by columns together with the pivot column. The implicit group-by columns are columns from the FROM clause that do not appear in any aggregate function or as the pivot column.

在 pivot 我们需要在透视列和 中添加聚合查询for 提及最终数据集中所需的所需列列表。

Using Spark-sql Pivot from spark-2.4:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.functions._

val df1: DataFrame = Seq((1, "a,b,c"),(2, "b,c")).toDF("id", "page_path")
df1.createOrReplaceTempView("df1")
spark.sql(
    """
      |Select * from
      |( select id, explode(split( page_path ,',')) as exploded from df )
      |pivot(count(exploded) for exploded in ('a','b','c')
      |)
   """.stripMargin)
.show

结果 :

+---+----+---+---+
| id|   a|  b|  c|
+---+----+---+---+
|  1|   1|  1|  1|
|  2|null|  1|  1|
+---+----+---+---+

Using Dataframe api:

df1.withColumn("splitted", split($"page_path", ",")).
withColumn("exploded", explode($"splitted")).
groupBy("id").
pivot("exploded").
count().
show

结果 :

+---+----+---+---+
| id|   a|  b|  c|
+---+----+---+---+
|  1|   1|  1|  1|
|  2|null|  1|  1|
+---+----+---+---+

关于sql - 如何在 SQL 中使用数据透视表(不是作为 DataFrame 分组运算符)？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61739018/

sql - 如何在 SQL 中使用数据透视表(不是作为 DataFrame 分组运算符)？

上一篇：bash - 让 CloudFormation 等待用户数据

下一篇：python - 我如何创建一个继承自 tf.Tensor 的新类？