PostgreSQL:从两个表中选择计数和最大值

标签 postgresql join aggregate-functions

我有两个表通过一个公共(public) ID 列链接,如下所示:

CREATE TABLE IF NOT EXISTS names (
    uid BIGSERIAL,
    name VARCHAR(255) NOT NULL,
    PRIMARY KEY (uid)
);
CREATE TABLE IF NOT EXISTS texts (
    name_uid BIGINT NOT NULL REFERENCES names,
    timestamp TIMESTAMP NOT NULL,
    some_value TEXT NULL
);

这里我们有一些数据可以使用:

INSERT INTO names VALUES ( 0, '1/a' );
INSERT INTO names VALUES ( 1, '1/b' );
INSERT INTO names VALUES ( 2, '2/c' );
INSERT INTO names VALUES ( 3, '3/d' );
INSERT INTO names VALUES ( 4, '3/e' );
INSERT INTO names VALUES ( 5, '3/f' );
INSERT INTO texts VALUES ( 0, '2018-01-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 1, '2018-01-02 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 2, '2018-02-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 2, '2018-02-02 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 3, '2018-03-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 3, '2018-06-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 4, '2018-06-02 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 5, '2018-06-03 00:00:00', 'text...' );

我现在需要的是应用以下逻辑规则

  • 根据表名中列名的 SIMILAR TO 模式选择 names.uid 和 names.name,并按前缀对它们进行分组
  • 对于从名称中选择的行:从文本中获取最新的时间戳条目(不管它是什么时候)
  • 对于从名称中选择的行:统计表格文本中具有特定名称前缀且在特定日期之后的对应行

这可以通过以下查询来实现:

SELECT substring(names.name, '[^/]+' ) AS name_prefix, COALESCE( sum( text_counts.count ), 0) AS counter, max(text_timestamps.timestamp) AS timestamp
FROM names
LEFT JOIN (
    SELECT texts.name_uid, count(*)
    FROM texts
    WHERE texts.timestamp > '2018-05-01 00:00:00'
    GROUP BY texts.name_uid
) text_counts ON text_counts.name_uid = names.uid
LEFT JOIN(
    SELECT texts.name_uid, max(texts.timestamp) AS timestamp
    FROM texts
    GROUP BY texts.name_uid
) text_timestamps ON text_timestamps.name_uid = names.uid
WHERE names.name SIMILAR TO '1%|3%'
GROUP BY name_prefix

但是,这个查询很慢。所以我试图想出一个更好的解决方案,但到目前为止失败了。我得到的是这个:

SELECT name_info.name_prefix, count(*) AS counter, max(timestamp) AS timestamp
FROM texts
RIGHT JOIN (
    SELECT names.uid, substring(names.name, '[^/]+' ) AS name_prefix
    FROM names
    WHERE names.name SIMILAR TO '1%|3%'
) name_info ON name_info.uid = texts.name_uid
WHERE texts.timestamp > '2018-05-01 00:00:00'
GROUP BY name_info.name_prefix

与第一个解决方案相比,这个非常快。问题是,现在结果中缺少计数为零的行。

我现在的问题是如何设计一个查询来提供接近查询 2 的性能,但在结果中包含计数为零的行

一些上下文信息:我正在使用 PostgreSQL 10,表文本的行数大约是表名的一百万倍。事实上,文本在现实世界中甚至是分区的,但我决定在此示例中跳过这一点。

最佳答案

由于 WHERE 子句中的时间戳条件,第二个查询中的右连接就像内部连接一样。删除条件并将 count(*) 聚合与 FILTER 一起使用:

SELECT 
    name_info.name_prefix, 
    count(*) FILTER (WHERE texts.timestamp > '2018-05-01 00:00:00') AS counter, 
    max(timestamp) AS timestamp
FROM texts
RIGHT JOIN (
    SELECT names.uid, substring(names.name, '[^/]+' ) AS name_prefix
    FROM names
    WHERE names.name SIMILAR TO '1%|3%'
    ) name_info ON name_info.uid = texts.name_uid 
GROUP BY name_info.name_prefix;

DbFiddle.

您也可以尝试两阶段分组,例如:

select 
    name_prefix, 
    sum(counter) as counter, 
    max(timestamp) as timestamp
from (
    select 
        substring(name, '[^/]+' ) as name_prefix,
        sum((timestamp > '2018-05-01 00:00:00')::int) as counter,
        max(timestamp) as timestamp
    from texts
    join names on name_uid = uid
    where name similar to '1%|3%'
    group by uid
    ) s
group by name_prefix

关于PostgreSQL:从两个表中选择计数和最大值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50863563/

相关文章:

sql - Postgresql 创建 View

mysql - 没有分组依据的聚合列

PostgreSQL:ereport() 的输出到哪里去了?

PostgreSQL/PostGIS : shp2pgsql INSERT instead of Creating a new table

sql-server - SQL 查询 - 与其他表连接的更新语句

mysql - 如何避免 LEFT JOIN 查询中的重复记录?

sql - Postgres : count unique array entries from subquery

postgresql - 使用多个连接和分组优化 SQL 查询 (Postgres 9.3)

sql - 获取没有外键指向的行

php - 我想从 2 个不同的表进行连接查询,我想从两个表中输出