sql - Hive collect_set() 但要删除连续的重复项

我想在使用 hive 时删除数组中的连续重复项。
collect_list()保留所有重复项，而 collect_set()只保留不同的条目。我有点需要一些中间立场。

例如，从下表:

id  |  number
==============
fk        4
fk        4
fk        2
4f        1
4f        8
4f        8
h9        7
h9        4
h9        7

我想得到这样的东西:

id | aggregate
===========================
fk   Array<int>(4,2)
4f   Array<int>(1,8)
h9   Array<int>(7,4,7)

最佳答案

使用 lag()解析函数获取前一个数字并与当前数字进行比较以检查连续数字。

演示:

with your_table as (--replace this subquery with your table
select stack(11, --the number of tuples
'fk',4,'2019-01-01 10:10:10.123',
'fk',4,'2019-01-01 10:10:10.124',
'fk',2,'2019-01-01 10:10:10.125',
'4f',1,'2019-01-01 10:10:10.126',
'4f',8,'2019-01-01 10:10:10.127',
'4f',8,'2019-01-01 10:10:10.128',
'h9',7,'2019-01-01 10:10:10.129',
'h9',4,'2019-01-01 10:10:10.130',
'h9',7,'2019-01-01 10:10:10.131',
'h9',7,'2019-01-01 10:10:10.132',
'h9',7,'2019-01-01 10:10:10.133'
) as (id, number, order_ts)
) --replace this subquery with your table

select id, collect_list(case when number = lag_number then null else number end) as aggregate
  from 
      (select id, number, order_ts,
              lag(number) over (partition by id order by order_ts) lag_number
         from your_table 
       distribute by id sort by order_ts
      )s         
  group by id;

结果:

id  aggregate   
4f  [1,8]   
fk  [4,2]   
h9  [7,4,7]

关于sql - Hive collect_set() 但要删除连续的重复项，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55978504/

上一篇：jsf-2 - jsf2.0 无参数转发页面错误

下一篇：行尾的 Vim 注释延续

相关文章：

mysql - 是否可以制作一个脚本来更新具有相同表但不同逻辑的值的行？

c++ - 将结构数组传递给函数C++

php - 如何在php json编码后删除隐藏的垃圾字符

python - 将两个变量分配给一个列表切片

java - Java中列表的动态列表

SQL - 使用 COUNT() 作为 WHERE 条件

mysql - 在MySQL文本类型字段中序列化长数组并增加文本类型的大小。

sql - 如何将map传入spark中的UDF

python - 如何从数组中删除数据，然后在 python 中插入该间隙？

jquery - 如何仅显示 Twitter 列表中的成员列表及其头像