是否有一种直接的方法来计算已按指标聚合的数据的中位数?换句话说,我有一个表,其中测量是分组依据的一部分,并且记录了每个测量的计数。
CREATE TABLE MEASUREMENTS AS
SELECT 'RED' COLOR, 4 MEASUREMENT, 5 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'RED' COLOR, 5 MEASUREMENT, 3 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'RED' COLOR, 6 MEASUREMENT, 1 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'BLUE' COLOR, 5 MEASUREMENT, 4 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'BLUE' COLOR, 6 MEASUREMENT, 5 MEASURE_COUNT FROM DUAL ;
╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED ║ 4 ║ 5 ║
║ RED ║ 5 ║ 3 ║
║ RED ║ 6 ║ 1 ║
║ BLUE ║ 5 ║ 4 ║
║ BLUE ║ 6 ║ 5 ║
╚═══════╩═════════════╩═══════════════╝
“自然”的解决方案是将测量计数分解为具有值的单独行,然后使用 Oracle 提供的 MEDIAN 进行分组 - 数学看起来像这样:
RED=>(4,4,4,4,4,5,5,5,6), median = 4
BLUE=>(5,5,5,5,6,6,6,6,6), median = 6
但是 (1) 我正在处理数以百万计的行,这些行会激增至数以百万计的单个测量值,并且 (2) 感觉就像我在“撤消和重做”中位数的数学上昂贵的工作。
因为我想对此有一个 View 定义,并且将分析嵌入到 View 中往往会削弱执行计划,所以我想避免这样的事情:
SELECT COLOR,
MIN(MEASUREMENT) MEDIAN_MEASUREMENT
FROM
(
SELECT COLOR,
MEASUREMENT,
SUM(MEASURE_COUNT) OVER (PARTITION BY COLOR ORDER BY MEASURE_COUNT) /
SUM(MEASURE_COUNT) OVER (PARTITION BY COLOR) PCT
FROM MEASUREMENTS
)
WHERE PCT >=.5
GROUP BY COLOR
如果数学上可行的话,我更愿意使用直接 GROUP BY(针对 AVG 给出的示例)来完成一些事情:
SELECT COLOR,
SUM(MEASUREMENT) / SUM(MEASURE_COUNT) AVG_MEASUREMENT
-- MEDIAN LOGIC (???) HERE
FROM MEASUREMENTS
GROUP BY COLOR
最佳答案
如果我理解正确的话,我可以看到一个相当直接的方式,我想我可以描述清楚。我很确定我今天不能用 SQL 表达它,但我会在我的浏览器中打开这个选项卡,如果没有其他人做出贡献,明天再尝试。
╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED ║ 4 ║ 5 ║
║ RED ║ 5 ║ 3 ║
║ RED ║ 6 ║ 1 ║
║ BLUE ║ 5 ║ 4 ║
║ BLUE ║ 6 ║ 5 ║
╚═══════╩═════════════╩═══════════════╝
首先,计算哪个测量值代表中位数。您可以仅根据计数来做到这一点。例如,对于红色,总共有九个测量值。中值测量将是第 5 次测量。 SQL 应该很简单。
其次,我认为您可以使用分析函数确定中值测量值位于哪一行。对于红色,您确定第 5 次测量在哪一行;它在第一行。这有点像“运行平衡”问题。该行中“测量”列的值就是您要确定的值。
代码墙(我认为是标准 SQL)
“展开”聚合是昂贵的。所以这可能对你没有用。我依靠通用表表达式来减轻大脑负担。
with measurements as (
select 'red' color, 4 measurement, 5 measure_count union all
select 'red' color, 5 measurement, 3 measure_count union all
select 'red' color, 6 measurement, 1 measure_count union all
select 'blue' color, 5 measurement, 4 measure_count union all
select 'blue' color, 6 measurement, 5 measure_count union all
-- Added green, even number of measurements, median should be 5.5.
select 'green' color, 5 measurement, 4 measure_count union all
select 'green' color, 6 measurement, 4 measure_count union all
-- Added yellow, extreme differences in measurements, median should be 6.
select 'yellow' color, 6 measurement, 2 measure_count union all
select 'yellow' color, 100 measurement, 1 measure_count
)
, measurement_starts as (
select
*,
sum(measure_count) over (partition by color order by measurement) total_rows_so_far
from measurements
)
, extended_measurements as (
select
color, measurement, measure_count,
coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + 1 measure_start_row,
coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + measure_count measure_end_row
from measurement_starts
)
, median_row_range as (
select color,
sum(measure_count) num_measurements,
ceiling(sum(measure_count)/2.0) start_measurement,
case
sum(measure_count) % 2 = 0
when true then ceiling(sum(measure_count)/2.0)+1
else ceiling(sum(measure_count)/2.0)
end
end_measurement
from measurements
group by color
)
, median_row_values as (
select m.color, c.measurement
from median_row_range m
inner join extended_measurements c
on c.color = m.color
and m.start_measurement between c.measure_start_row and c.measure_end_row
union all
select m.color, c.measurement
from median_row_range m
inner join extended_measurements c
on c.color = m.color
and m.end_measurement between c.measure_start_row and c.measure_end_row
)
select color, avg(measurement)
from median_row_values
group by color
order by color;
blue 6.00
green 5.50
red 4.00
yellow 6.00
CTE“extended_measurements”扩展了测量表以包括您在未聚合数据中找到的起始“行”号和结束“行”号。
color measurement measure_count measure_start_row measure_end_row
--
blue 5 4 1 4
blue 6 5 5 9
green 5 4 1 4
green 6 4 5 8
red 4 5 1 5
red 5 3 6 8
red 6 1 4 4
yellow 6 2 1 2
yellow 100 1 3 3
CTE“median_row_range”确定中位数的起始“行”和结束“行”。
color num_measurements start_measurement end_measurement
--
blue 9 5 5
green 8 4 5
red 9 5 5
yellow 3 2 2
这意味着“蓝色”的中位数可以计算为第 5 个“行”和第 5 个“行”的平均值。也就是说,“蓝色”的中位数只是第 5 个值。绿色的中位数是第 4“行”和第 5“行”的平均值。
关于sql - 预聚合/已按指标分组的中值计算,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22827133/