sql - 连续重复/重复的有序计数

我非常怀疑我是否以最有效的方式执行此操作，这就是我在此处标记 plpgsql 的原因。我需要在20 亿行 上为千个测量系统 运行此程序。

您的测量系统经常会在失去连接时报告之前的值，并且它们经常突然失去连接，但有时会持续很长时间。您需要聚合，但是当您这样做时，您需要查看它重复了多长时间，并根据该信息制作各种过滤器。假设您正在测量汽车的 mpg，但它停留在 20 mpg 一个小时，然后移动到 20.1 等等。当它卡住时，你会想要评估准确性。您还可以放置一些替代规则来查找汽车何时在高速公路上行驶，并且使用窗口函数可以生成汽车的“状态”并进行分组。事不宜迟:

--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
    (
    select 
    system_measured, time_of_measurement, measurement, 
    case when 
     measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc) 
     then 1 else 0 end as repeat
    FROM
    (
    SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
    ) as data
) as data;

--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
    (
    select 
    *,
    sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
    from cumulative_repeat_calculator_data
    order by system_measured, time_of_measurement
) as data;

--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example  
select *, 
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement

那么，为了在大表上运行它，您会采取哪些不同的做法，或者您会使用哪些替代工具？我在考虑 plpgsql，因为我怀疑这需要在数据库中完成，或者在数据插入过程中完成，尽管我通常在数据加载后处理数据。有没有办法在不求助于子查询的情况下一次性完成？

我已经测试了一种替代方法，但它仍然依赖于子查询，而且我认为这种方法更快。对于该方法，您使用 start_timestamp、end_timestamp、system 创建一个“启动和停止”表。然后加入更大的表，如果时间戳介于这些之间，则将其归类为处于该状态，这实际上是 cumlative_sum_of_nonrepeats_by_system 的替代方法。但是，当你这样做时，你加入了 1=1 的数以千计的设备和数以千计或数以百万计的“事件”。您认为这是更好的方法吗？

最佳答案

测试用例

首先，一种更有用的方式来呈现您的数据——甚至更好，在 sqlfiddle 中，准备玩:

CREATE TEMP TABLE data(
   system_measured int
 , time_of_measurement int
 , measurement int
);

INSERT INTO data VALUES
 (1, 1, 5)
,(1, 2, 150)
,(1, 3, 5)
,(1, 4, 5)
,(2, 1, 5)
,(2, 2, 5)
,(2, 3, 5)
,(2, 4, 5)
,(2, 5, 150)
,(2, 6, 5)
,(2, 7, 5)
,(2, 8, 5);

简化查询

由于还不清楚，我假设只有上面给出的。
接下来，我简化了您的查询以得出:

WITH x AS (
   SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
                               ORDER BY time_of_measurement) = measurement
                  THEN 0 ELSE 1 END AS step
   FROM   data
   )
   , y AS (
   SELECT *, sum(step) OVER(PARTITION BY system_measured
                            ORDER BY time_of_measurement) AS grp
   FROM   x
   )
SELECT * ,row_number() OVER (PARTITION BY system_measured, grp
                             ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM   y
ORDER  BY system_measured, time_of_measurement;

现在，虽然使用纯 SQL 非常好，但使用 plpgsql 函数会更快很多，因为它可以在一个表扫描中完成，而这个查询至少需要三个扫描。

使用 plpgsql 函数更快:

CREATE OR REPLACE FUNCTION x.f_repeat_ct()
  RETURNS TABLE (
    system_measured int
  , time_of_measurement int
  , measurement int, repeat_ct int
  )  LANGUAGE plpgsql AS
$func$
DECLARE
   r    data;     -- table name serves as record type
   r0   data;
BEGIN

-- SET LOCAL work_mem = '1000 MB';  -- uncomment an adapt if needed, see below!

repeat_ct := 0;   -- init

FOR r IN
   SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement
LOOP
   IF  r.system_measured = r0.system_measured
       AND r.measurement = r0.measurement THEN
      repeat_ct := repeat_ct + 1;   -- start new array
   ELSE
      repeat_ct := 0;               -- start new count
   END IF;

   RETURN QUERY SELECT r.*, repeat_ct;

   r0 := r;                         -- remember last row
END LOOP;

END
$func$;

调用:

SELECT * FROM x.f_repeat_ct();

请确保在这种 plpgsql 函数中始终对您的列名进行表限定，因为我们使用与输出参数相同的名称，如果未限定则优先。

十亿行

如果您有十亿行，您可能希望将此操作拆分。我引用手册 here :

Note: The current implementation of RETURN NEXT and RETURN QUERY stores the entire result set before returning from the function, as discussed above. That means that if a PL/pgSQL function produces a very large result set, performance might be poor: data will be written to disk to avoid memory exhaustion, but the function itself will not return until the entire result set has been generated. A future version of PL/pgSQL might allow users to define set-returning functions that do not have this limitation. Currently, the point at which data begins being written to disk is controlled by the work_mem configuration variable. Administrators who have sufficient memory to store larger result sets in memory should consider increasing this parameter.

考虑一次为一个系统计算行，或者为 work_mem 设置足够高的值以应对负载。点击报价中提供的链接，了解有关 work_mem 的更多信息。

一种方法是使用 SET LOCAL 为 work_mem 设置一个非常高的值在您的功能中，它仅对当前交易有效。我在函数中添加了注释行。不要不要在全局范围内将其设置得非常高，因为这可能会破坏您的服务器。阅读手册。

关于sql - 连续重复/重复的有序计数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13078964/

sql - 连续重复/重复的有序计数

测试用例

简化查询

使用 plpgsql 函数更快:

十亿行

上一篇：sql - 使用数据作为外键删除具有其他表的 SQL 中的多个条目

下一篇：Django 1.3.1 Heroku Postgres 错误