hadoop - 如何编写查询以避免在选择不同和大小的 collect_set 配置单元查询中使用单个 reducer？

如何重写这些查询以避免在 reduce 阶段使用单个 reducer？它需要永远，我失去了使用它的并行性的好处。

select id
, count(distinct locations) AS unique_locations
  from
  mytable
;

和

select id
, size(collect_set(locations)) AS unique_locations
  from
  mytable
;

最佳答案

使用两个查询对 count(distinct var) 有效:

SELECT
 count(1)
FROM (
 SELECT DISTINCT locations as unique_locations 
 from my_table
 ) t;

我认为大小 collect_set 也是如此:

SELECT
  size(unique_locations)
FROM (
 SELECT collect_set(locations) as unique_locations 
 from my_table
 ) t;

关于hadoop - 如何编写查询以避免在选择不同和大小的 collect_set 配置单元查询中使用单个 reducer？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31217198/