python - 识别 Postgres 中重复的时间序列

我有一个时间序列表(在 Postgres 数据库中)，其中包含列

item_id,  country_id,  year,  month, value

此表中存在重复的时间序列:它们具有相同的country_id和时间序列日期/值，但已分配不同的item_id，例如:“Red Apples”和“Apples，Red”

如何识别这些重复的时间序列？我希望 (country_id,year,month,value) 匹配该项目存在的所有日期。

我是一个初学者，所以请原谅我遗漏的任何细节。我主要寻找概念方法 - 我可以在 Postgres 或 python/Pandas 中实现它。

例如，我希望能够检测到如下内容:

item_id,     country_id,     year,     month,    value
-------------------------------------------------------
Red Apples   5               1996      1         300
Red Apples   5               1996      2         500
Red Apples   5               1996      3         370
Apples, Red  5               1996      1         300
Apples, Red  5               1996      2         500
Apples, Red  5               1996      3         370

我希望输出如下所示:

item_id1,     item_id2,      country_id,     year,     month_range
-----------------------------------------------------------------
Red Apples    Apples, Red         5          1996       [1,3]

这样的事情也可以:

item_id1,     item_id2,      country_id,     year,     time_month,   value
--------------------------------------------------------------------------
Red Apples    Apples, Red         5          1996         1           300
Red Apples    Apples, Red         5          1996         2           500
Red Apples    Apples, Red         5          1996         3           370

我想尝试这样的事情:

select distinct A.country_id, A.item_id, B.item_id, A.year, A.month, A.value
                      from my_table as A,
                      my_table as B 
                      where
                      (A.country_id=B.country_id and 
                      A.item_id<>B.item_id and 
                      A.year=B.year and 
                      A.month=B.month and 
                      A.value=B.value )

然后我会检查以确保识别的每个 item_id 对都显示所有日期/值。但如果可能的话，我想立即检查所有日期/值。

我不确定表连接是否合适......？

最佳答案

_{请参阅下面的更新!}

除非您提供有关示例数据和预期结果的更多详细信息，否则我认为以下查询可能会有所帮助:

SELECT country_id,  year,  month, value
  FROM a_table
 GROUP BY country_id,  year,  month, value
HAVING count(*) > 1;

此查询将显示除 item_id 之外的所有相同条目。。如果您想查找与重复组相对应的所有行，请使用以下查询:

SELECT item_id, country_id,  year,  month, value
  FROM a_table
 WHERE (country_id,  year,  month, value)
    IN (
    SELECT country_id,  year,  month, value
      FROM a_table
     GROUP BY country_id,  year,  month, value
    HAVING count(*) > 1)
 ORDER BY country_id,  year,  month, value, item_id;

我制作了专栏item_id作为排序顺序中的最后一个，它应该更容易识别重复项。随意调整。此查询可能需要一段时间，具体取决于您的数据。

为了避免将来发生此类情况(重复的日期)，您可能需要创建一个唯一约束，如下所示:

ALTER TABLE a_table ADD CONSTRAIN u_cymv
    UNIQUE (country_id,  year,  month, value);

编辑: 添加评论后，我提出了以下查询来查找一系列重复项:

WITH a_table(item_id,country_id,year,month,value) AS (VALUES
    ('Red Apples'::text,5,1996,1,300::numeric),
    ('Red Apples',5,1996,2,500),
    ('Red Apples',5,1996,3,370),
    ('Apples, Red',5,1996,1,300),
    ('Apples, Red',5,1996,2,500),
    ('Apples, Red',5,1996,3,370)
), dups AS (
    SELECT string_agg(item_id,'/') AS items,
           country_id,value,
           daterange(to_date(year::text||month,'YYYYMM'),
                     (to_date(year::text||month,'YYYYMM')
                      +INTERVAL'1mon')::date,'[)') AS range
      FROM a_table
     GROUP BY country_id,year,month,value
    HAVING count(*) > 1
)
SELECT grp,count(*),items,country_id,
       daterange(min(lower(range)), max(upper(range)), '[)') r,
       array_agg(value)
  FROM ( 
    SELECT items,country_id,range,value,
           sum(g) OVER (ORDER BY country_id, range) grp
      FROM (
        SELECT items,country_id,
               range,value,
               CASE WHEN lag(range) OVER (PARTITION BY country_id
                                          ORDER BY range) -|- range
                    THEN NULL ELSE 1 END g
          FROM dups) s
    ) s
 GROUP BY grp,country_id,items
HAVING count(*) >= 3
 ORDER BY country_id,r,items;

它的作用:

a_table是所提供示例数据的副本；
dups是发现重复记录的人。我也在转换year,month列成 daterange ，因为我认为没有其他方法可以正确找到穿越纽约的系列；
列出重复项后，我将比较之前的 range (在 country_id 内)如果不是 adjacent 则使用当前的，组标志g已设置；
接下来，我使用 running total effect sum()的创建组标识符的函数 grp 。对于示例数据，这仅产生一组；
最后，我使用 grp对于GROUP BY将数据分组为系列。我还包括 country_id和items进入GROUP BY ，但这只是为了避免将它们包装到聚合函数中——它们将是唯一的 grp反正。我还形成了一个新的daterange列，这是由于 range类型没有内置的聚合函数。

您可能需要增加work_mem在执行此查询之前，最多 1GB我说过(取决于实际表中的行数)。请尝试一下，然后告诉我它是否适合您。如果您能分享 EXPLAIN (analyze, buffers) 那就太好了对于这个。

关于python - 识别 Postgres 中重复的时间序列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26366248/

python - 识别 Postgres 中重复的时间序列

上一篇：postgresql - pg_dump 中的 "interesting tables"是什么

下一篇：sql - 如何将唯一约束转换为 PostgreSQL 中的主键？