hadoop - 具有非空列的平均函数 - Hive

标签 hadoop hive aggregate-functions hiveql

我想计算前 3 年收入的平均值，该平均值不为 NULL，例如:

employee id    2016  2015 2014 2013  2012  2011  2010
      1         100  NULL 200   50   10     50    50

平均应该是 100 + 200 + 50/3

employee id    2016  2015 2014 2013  2012   2011 2010
      2        NULL  100  NULL  50    NULL  25   100

平均值应该是 100 + 50 + 25/3

最佳答案

每年使用union all 获取一行。然后使用 row_number 函数对行进行排序，使非空行排在第一位。然后获取前 3 行的平均值。

select employee_id,avg(income)
from (select employee_id,yr,income
      ,row_number() over(partition by employee_id order by cast((income is not null) as int) desc,yr desc) as rnum 
      from (select employee_id,2016 as yr,`2016` as income from tbl 
            union all
            select employee_id,2015 as yr,`2015` as income from tbl
            union all
            select employee_id,2014 as yr,`2014` as income from tbl
            union all
            select employee_id,2013 as yr,`2013` as income from tbl
            union all
            select employee_id,2012 as yr,`2012` as income from tbl
            union all
            select employee_id,2011 as yr,`2011` as income from tbl
            union all
            select employee_id,2010 as yr,`2010` as income from tbl
           ) t
      ) t
where rnum <= 3
group by employee_id

当 2 列有值时，结果将为 (val1+val2)/2。
当只有一列有值时，结果将是该列。
当所有列都有一个null 值时，返回null。

关于hadoop - 具有非空列的平均函数 - Hive，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48712287/

上一篇：hadoop - 使用配置单元更新表

下一篇：python - 为什么在使用 Python 脚本读取或写入 Hadoop 文件系统时会出现这些奇怪的连接错误？

python - Spark saveAsTable追加将数据保存到hive但抛出错误: org. apache.hadoop.hive.ql.metadata.Hive.alterTable

java - JAVA Spark数据集中的GroupBy和聚合函数

mysql - SQL 仅选择列上具有最大值的行

Python/Hive 接口(interface)使用 fetchone() 速度慢，使用 fetchall() 挂起

MySQL——条件分组依据

hadoop - hive :如何将总行数输出为变量

hadoop - 在相对较大的输入上运行 Spark 作业时出现内存问题

hadoop - Ganglia dfs.namenode.fileinfoops中的Hadoop指标。它代表什么？

scala - 每个列值的 Spark 计数和百分比异常处理和加载到 Hive DB