如何重写这些查询以避免在 reduce 阶段使用单个 reducer?它需要永远,我失去了使用它的并行性的好处。
select id
, count(distinct locations) AS unique_locations
from
mytable
;
和
select id
, size(collect_set(locations)) AS unique_locations
from
mytable
;
最佳答案
使用两个查询对 count(distinct var) 有效:
SELECT
count(1)
FROM (
SELECT DISTINCT locations as unique_locations
from my_table
) t;
我认为大小 collect_set 也是如此:
SELECT
size(unique_locations)
FROM (
SELECT collect_set(locations) as unique_locations
from my_table
) t;
关于hadoop - 如何编写查询以避免在选择不同和大小的 collect_set 配置单元查询中使用单个 reducer?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31217198/