apache-spark - 分组以将 hive 中的多列值合并为一列

标签 apache-spark hive apache-spark-sql hiveql

我正在尝试基于按键分组将多列值归入一列。基本上我将使用它来使用 spark 1.6 数据帧 api 创建嵌套的 JSON。

样本输入表 abc:-

a     b     c       d       e       f       g
---------------------------------------------
aa    bb    cc      dd      ee      ff      gg
aa    bb    cc1     dd1     ee1     ff1     gg1
aa    bb    cc2     dd2     ee2     ff2     gg2
aa1   bb1   cc3     dd3     ee3     ff3     gg3
aa1   bb1   cc4     dd4     ee4     ff4     gg4

最终输出组 a,b :-
aa      bb      {{cc,dd,ee,ff,gg},{cc1,dd1,ee1,ff1,gg1},{cc2,dd2,ee2,ff2,gg2}}
aa1     bb1     {{cc3,dd3,ee3,ff3,gg3},{cc4,dd4,ee4,ff4,gg4}}

我尝试使用 collect_list 但它只能对一列进行分组。不知道如何将多个列组合在一起。我尝试使用 concat 字符串,然后在其上使用 collect 但我会丢失模式映射,因为我必须最终以 json 格式转储它。以 map 或 struct 的形式将柱子连接起来也可以。请建议一些优雅的方法/解决方案来解决这个问题。谢谢
注意:使用 Spark 1.6

最佳答案

这两个查询都适用于 sqlContext.sql ("select ...");

select      a,b
           ,collect_list(array(c,d,e,f,g))

from        abc

group by    a,b
;
+-----+-----+----------------------------------------------------------------------------------------------+
| aa  | bb  | [["cc","dd","ee","ff","gg"],["cc1","dd1","ee1","ff1","gg1"],["cc2","dd2","ee2","ff2","gg2"]] |
+-----+-----+----------------------------------------------------------------------------------------------+
| aa1 | bb1 | [["cc3","dd3","ee3","ff3","gg3"],["cc4","dd4","ee4","ff4","gg4"]]                            |
+-----+-----+----------------------------------------------------------------------------------------------+
select      a,b
           ,collect_list(struct(c,d,e,f,g))

from        abc

group by    a,b
;
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| aa  | bb  | [{"col1":"cc","col2":"dd","col3":"ee","col4":"ff","col5":"gg"},{"col1":"cc1","col2":"dd1","col3":"ee1","col4":"ff1","col5":"gg1"},{"col1":"cc2","col2":"dd2","col3":"ee2","col4":"ff2","col5":"gg2"}] |
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| aa1 | bb1 | [{"col1":"cc3","col2":"dd3","col3":"ee3","col4":"ff3","col5":"gg3"},{"col1":"cc4","col2":"dd4","col3":"ee4","col4":"ff4","col5":"gg4"}]                                                               |
+-----+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Spark 演示
[cloudera@quickstart ~]$ spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Type --help for more information.
[cloudera@quickstart ~]$ 
[cloudera@quickstart ~]$ spark-shell

scala> sqlContext.sql("select * from abc").show;
+---+---+---+---+---+---+---+
|  a|  b|  c|  d|  e|  f|  g|
+---+---+---+---+---+---+---+
| aa| bb| cc| dd| ee| ff| gg|
| aa| bb|cc1|dd1|ee1|ff1|gg1|
| aa| bb|cc2|dd2|ee2|ff2|gg2|
|aa1|bb1|cc3|dd3|ee3|ff3|gg3|
|aa1|bb1|cc4|dd4|ee4|ff4|gg4|
+---+---+---+---+---+---+---+


scala> sqlContext.sql("select a,b,collect_list(array(c,d,e,f,g)) from abc group by a,b").show;
+---+---+--------------------+                                                  
|  a|  b|                 _c2|
+---+---+--------------------+
|aa1|bb1|[[cc3, dd3, ee3, ...|
| aa| bb|[[cc, dd, ee, ff,...|
+---+---+--------------------+


scala> sqlContext.sql("select a,b,collect_list(struct(c,d,e,f,g)) from abc group by a,b").show;
+---+---+--------------------+                                                  
|  a|  b|                 _c2|
+---+---+--------------------+
|aa1|bb1|[[cc3,dd3,ee3,ff3...|
| aa| bb|[[cc,dd,ee,ff,gg]...|
+---+---+--------------------+

关于apache-spark - 分组以将 hive 中的多列值合并为一列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42769952/

相关文章:

java - 为什么 Impala 不能在 Spark SQL 写入后读取 parquet 文件?

hadoop - Hadoop应用程序中Avro的用例

apache-spark - PySpark:withColumn() 有两个条件和三个结果

apache-spark - 即使在分区数据中,Spark 也会列出所有叶节点

java - 在类似运算符上应用字符串列表

multithreading - 如何使并发与写入 hive 表的数据帧一起工作?

apache-spark - 在 spark 2.3.0 中的结构化流中禁用 _spark_metadata

r - 将 csv 数据加载到 Hive 表中时出错

apache-spark - 如何估算 Spark Shuffle 所需的内存和磁盘?

scala - EMR 上的 Spark 日志在哪里?