performance - Hive:是否有更好的方法对列进行百分位排名?

标签 performance hadoop hive rank percentile

目前,要对 hive 中的列进行百分位数排名,我使用的是类似以下的内容。我正在尝试按项目所属的百分位数对列中的项目进行排名,为每个项目分配一个 0 到 1 的值。下面的代码分配了一个从 0 到 9 的值,本质上是说 char_percentile_rank 为 0 的项目在项目的后 10% 中,值为 9 的项目在前 10% 中.有更好的方法吗?

select item
    , characteristic
    , case when characteristic <= char_perc[0] then 0
        when characteristic <= char_perc[1] then 1
        when characteristic <= char_perc[2] then 2
        when characteristic <= char_perc[3] then 3
        when characteristic <= char_perc[4] then 4
        when characteristic <= char_perc[5] then 5
        when characteristic <= char_perc[6] then 6
        when characteristic <= char_perc[7] then 7
        when characteristic <= char_perc[8] then 8
        else 9
      end as char_percentile_rank
from (
    select split(item_id,'-')[0] as item
        , split(item_id,'-')[1] as characteristic
        , char_perc
    from (
        select collect_set(concat_ws('-',item,characteristic)) as item_set
            , PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as char_perc
        from(
            select item
                , sum(characteristic) as characteristic
            from table
            group by item
        ) t1
    ) t2
    lateral view explode(item_set) explodetable as item_id
) t3

注意:我必须执行 collect_set 以避免自连接,因为百分位数函数隐式执行 group by

我发现百分位数函数非常慢(至少在这种用法中)。也许手动计算百分位数会更好?

最佳答案

尝试删除您的一个派生表

select item
    , characteristic
    , case when characteristic <= char_perc[0] then 0
        when characteristic <= char_perc[1] then 1
        when characteristic <= char_perc[2] then 2
        when characteristic <= char_perc[3] then 3
        when characteristic <= char_perc[4] then 4
        when characteristic <= char_perc[5] then 5
        when characteristic <= char_perc[6] then 6
        when characteristic <= char_perc[7] then 7
        when characteristic <= char_perc[8] then 8
        else 9
      end as char_percentile_rank
from (
     select item, characteristic,
         , PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) over () as char_perc 
     from (
       select item
         , sum(characteristic) as characteristic             
       from table
       group by item            
     ) t1
) t2

关于performance - Hive:是否有更好的方法对列进行百分位排名?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31883480/

相关文章:

performance - 具有性能优化潜力的浮点算法

c++ - 当整数类型转换为浮点类型时,C++ 中会发生什么,反之亦然?

hive - 在 HBase 现有表之上定义 Hive 外部表

linux - Hadoop解压报错

hadoop - 失败 : Error in semantic analysis: Column Found in more than One Tables/Subqueries

hadoop - Impala 周、月、季度和年日期

hadoop - Sqoop从voltdb导出数据

SQL 查询 - 匹配日期时间与匹配整数

java - 将变量声明为最终的内部方法会提高性能吗?

python - 使用 pyspark 将数据框中的列调用到函数中