sql - Hive 查询逻辑和优化

我有以下格式的数据:

输入

**ID     col1     Rank**
ID1      C1_abc      R1_1
ID1      C1_xce      R1_2
ID1      C1_fde      R1_3
ID1      C1_sde      R1_4
ID2      C1_sds      R1_1
ID2      C1_hhh      R1_2
ID3      C1_aaa      R1_1
ID4      C1_asw      R1_1
ID4      C1_eee      R1_2
ID4      C1_ttt      R1_3

输出:

**ID    col1    col2      col3**
1     C1_abc     C1_xce    C1_fde      
2     C1_sds     C1_hhh    null
3     C1_aaa     null      null
4     C1_asw     C1_eee    C1_ttt

我想使用配置单元脚本来实现这一点。我知道有多种实现方式，但由于数据量很大，所以需要最优化的实现方式。

最佳答案

只需使用条件聚合:

select id,
       max(case when rank = 1 then col1 end) as col1,
       max(case when rank = 2 then col1 end) as col2,
       max(case when rank = 3 then col1 end) as col3
from t
where t1.rank in (1, 2, 3)
group by id;

另一种方法是多路连接:

select t1.id, t1.col1, t2.col1 as col2, t3.col1 as col3
from t t1 left join
     t t2
     on t1.rank = 1 and t2.rank = 2 and t1.id = t2.id left join
     t t3
     on t1.id = t3.id and t3.rank = 3;

您可能需要同时尝试两者，看看哪个运行得更快。它可能会因您的数据而异。

关于sql - Hive 查询逻辑和优化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47114663/

上一篇：hadoop - Hive动态分区问题

下一篇：scala - 循环遍历文件行并通过 Spark 中的每次迭代执行函数

相关文章：

php - 如何使用 password_verify() 从数据库中检索密码？

mysql - 如果找到的值少于 2 个，则忽略选择

php - 优化 mysql 数据库 - 任务列表耗尽了我的服务器

hadoop - 无法从 Hive 外部表上的 Druid 数据源查询数据

hadoop - 如何创建以半列分隔并以逗号作为小数点的分区表？

主题标签的 SQL 索引 View

macos - 允许守护进程用户 SSH 登录 key

sql - 无法将 hive 中的String日期转换为unix时间戳

hadoop - TEZ 查询上的 Hive 在 Reducer 交叉产品中永远存在

regex - 使用配置单元 regexp_replace 从数据中删除大括号和美元符号