sql - 预聚合/已按指标分组的中值计算

标签 sql oracle statistics

是否有一种直接的方法来计算已按指标聚合的数据的中位数?换句话说,我有一个表,其中测量是分组依据的一部分,并且记录了每个测量的计数。

CREATE TABLE MEASUREMENTS AS 
SELECT 'RED'  COLOR, 4 MEASUREMENT, 5 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'RED'  COLOR, 5 MEASUREMENT, 3 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'RED'  COLOR, 6 MEASUREMENT, 1 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'BLUE' COLOR, 5 MEASUREMENT, 4 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'BLUE' COLOR, 6 MEASUREMENT, 5 MEASURE_COUNT FROM DUAL ;

╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED   ║           4 ║             5 ║
║ RED   ║           5 ║             3 ║
║ RED   ║           6 ║             1 ║
║ BLUE  ║           5 ║             4 ║
║ BLUE  ║           6 ║             5 ║
╚═══════╩═════════════╩═══════════════╝

“自然”的解决方案是将测量计数分解为具有值的单独行,然后使用 Oracle 提供的 MEDIAN 进行分组 - 数学看起来像这样:

RED=>(4,4,4,4,4,5,5,5,6), median = 4
BLUE=>(5,5,5,5,6,6,6,6,6), median = 6

但是 (1) 我正在处理数以百万计的行,这些行会激增至数以百万计的单个测量值,并且 (2) 感觉就像我在“撤消和重做”中位数的数学上昂贵的工作。

因为我想对此有一个 View 定义,并且将分析嵌入到 View 中往往会削弱执行计划,所以我想避免这样的事情:

    SELECT  COLOR,
            MIN(MEASUREMENT) MEDIAN_MEASUREMENT
    FROM 
      (        
        SELECT  COLOR, 
                MEASUREMENT, 
                SUM(MEASURE_COUNT) OVER (PARTITION BY COLOR ORDER BY MEASURE_COUNT)  / 
                    SUM(MEASURE_COUNT) OVER (PARTITION BY COLOR) PCT
        FROM    MEASUREMENTS            
      )
    WHERE PCT >=.5  
    GROUP BY COLOR               

如果数学上可行的话,我更愿意使用直接 GROUP BY(针对 AVG 给出的示例)来完成一些事情:

SELECT  COLOR, 
        SUM(MEASUREMENT) / SUM(MEASURE_COUNT) AVG_MEASUREMENT
        -- MEDIAN LOGIC (???) HERE  
FROM    MEASUREMENTS
GROUP BY COLOR

最佳答案

如果我理解正确的话,我可以看到一个相当直接的方式,我想我可以描述清楚。我很确定我今天不能用 SQL 表达它,但我会在我的浏览器中打开这个选项卡,如果没有其他人做出贡献,明天再尝试。

╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED   ║           4 ║             5 ║
║ RED   ║           5 ║             3 ║
║ RED   ║           6 ║             1 ║
║ BLUE  ║           5 ║             4 ║
║ BLUE  ║           6 ║             5 ║
╚═══════╩═════════════╩═══════════════╝

首先,计算哪个测量值代表中位数。您可以仅根据计数来做到这一点。例如,对于红色,总共有九个测量值。中值测量将是第 5 次测量。 SQL 应该很简单。

其次,我认为您可以使用分析函数确定中值测量值位于哪一行。对于红色,您确定第 5 次测量在哪一行;它在第一行。这有点像“运行平衡”问题。该行中“测量”列的值就是您要确定的值。

代码墙(我认为是标准 SQL)

“展开”聚合是昂贵的。所以这可能对你没有用。我依靠通用表表达式来减轻大脑负担。

with measurements as (
  select 'red'   color, 4 measurement, 5 measure_count union all
  select 'red'   color, 5 measurement, 3 measure_count union all
  select 'red'   color, 6 measurement, 1 measure_count union all
  select 'blue'  color, 5 measurement, 4 measure_count union all
  select 'blue'  color, 6 measurement, 5 measure_count union all
  -- Added green, even number of measurements, median should be 5.5.
  select 'green' color, 5 measurement, 4 measure_count union all
  select 'green' color, 6 measurement, 4 measure_count union all
  -- Added yellow, extreme differences in measurements, median should be 6.
  select 'yellow' color, 6 measurement, 2 measure_count union all
  select 'yellow' color, 100 measurement, 1 measure_count 
)
, measurement_starts as (
  select 
    *,
    sum(measure_count) over (partition by color order by measurement) total_rows_so_far
  from measurements
)
, extended_measurements as (
  select 
    color, measurement, measure_count,
    coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + 1 measure_start_row,
    coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + measure_count measure_end_row 
  from measurement_starts
)
, median_row_range as (
  select color, 
    sum(measure_count) num_measurements, 
    ceiling(sum(measure_count)/2.0) start_measurement, 
    case 
      sum(measure_count) % 2 = 0
      when true then ceiling(sum(measure_count)/2.0)+1
      else ceiling(sum(measure_count)/2.0)
    end
    end_measurement
  from measurements
  group by color
)
, median_row_values as (
  select m.color, c.measurement
  from median_row_range m
  inner join extended_measurements c 
          on c.color = m.color 
         and m.start_measurement between c.measure_start_row and c.measure_end_row
  union all
  select m.color, c.measurement
  from median_row_range m
  inner join extended_measurements c 
          on c.color = m.color 
         and m.end_measurement between c.measure_start_row and c.measure_end_row
)
select color, avg(measurement)
from median_row_values
group by color
order by color;

blue    6.00
green   5.50
red     4.00
yellow  6.00

CTE“extended_measurements”扩展了测量表以包括您在未聚合数据中找到的起始“行”号和结束“行”号。

color  measurement  measure_count  measure_start_row  measure_end_row
--
blue   5            4              1                  4
blue   6            5              5                  9
green  5            4              1                  4
green  6            4              5                  8
red    4            5              1                  5
red    5            3              6                  8
red    6            1              4                  4
yellow 6            2              1                  2
yellow 100          1              3                  3

CTE“median_row_range”确定中位数的起始“行”和结束“行”。

color  num_measurements  start_measurement  end_measurement
--
blue   9                 5                  5
green  8                 4                  5
red    9                 5                  5
yellow 3                 2                  2

这意味着“蓝色”的中位数可以计算为第 5 个“行”和第 5 个“行”的平均值。也就是说,“蓝色”的中位数只是第 5 个值。绿色的中位数是第 4“行”和第 5“行”的平均值。

关于sql - 预聚合/已按指标分组的中值计算,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22827133/

相关文章:

oracle - Oracle 10g-优化WHERE IS NOT NULL

java - 即时计算百分位数

SQL 查询仅查找 2 列中的唯一字符串

statistics - Octave mann-whitney/u_test p 值混淆

威 bool 分布的更新函数

mysql - 如何在远程 MySQL 客户端上输出我的查询记录的 .csv?

指定列上的 SQL 内连接

c# - 使用函数 sql 更新所有记录

sql - 重启自动递增

oracle - Oracle 中同一表不同行的多列更新,列非空 : Receiving error 01407