sql - 如何在 PostgreSQL 查询中对不同的元组进行排序

标签 sql postgresql distinct-on

我正在尝试在 Postgres 中提交一个只返回不同元组的查询。在我的示例查询中,我不希望 cluster_id/feed_id 组合的条目多次存在的重复条目。如果我做一个简单的:

select distinct on (cluster_info.cluster_id, feed_id) 
   cluster_info.cluster_id, num_docs, feed_id, url_time 
   from url_info 
   join cluster_info on (cluster_info.cluster_id = url_info.cluster_id) 
   where feed_id in (select pot_seeder from potentials) 
   and num_docs > 5 and url_time > '2012-04-16';

我明白了,但我还想根据 num_docs 进行分组。所以,当我执行以下操作时:

select distinct on (cluster_info.cluster_id, feed_id) 
   cluster_info.cluster_id, num_docs, feed_id, url_time 
   from url_info join cluster_info 
   on (cluster_info.cluster_id = url_info.cluster_id) 
   where feed_id in (select pot_seeder from potentials) 
   and num_docs > 5 and url_time > '2012-04-16' 
   order by num_docs desc;

我收到以下错误:

ERROR:  SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: select distinct on (cluster_info.cluster_id, feed_id) cluste...

我想我明白为什么我会收到错误(除非我以某种方式明确描述组,否则不能按元组分组)但我该怎么做?或者,如果我对错误的解释不正确,是否有办法实现我最初的目标?

最佳答案

最左边的 ORDER BY 项不能与 DISTINCT 子句的项不一致。我引用 the manual about DISTINCT :

The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.

尝试:

SELECT *
FROM  (
    SELECT DISTINCT ON (c.cluster_id, feed_id) 
           c.cluster_id, num_docs, feed_id, url_time 
    FROM   url_info u
    JOIN   cluster_info c ON (c.cluster_id = u.cluster_id) 
    WHERE  feed_id IN (SELECT pot_seeder FROM potentials) 
    AND    num_docs > 5
    AND    url_time > '2012-04-16'
    ORDER  BY c.cluster_id, feed_id, num_docs, url_time
           -- first columns match DISTINCT
           -- the rest to pick certain values for dupes
           -- or did you want to pick random values for dupes?
    ) x
ORDER  BY num_docs DESC;

或者使用GROUP BY:

SELECT c.cluster_id
     , num_docs
     , feed_id
     , url_time 
FROM   url_info u
JOIN   cluster_info c ON (c.cluster_id = u.cluster_id) 
WHERE  feed_id IN (SELECT pot_seeder FROM potentials) 
AND    num_docs > 5
AND    url_time > '2012-04-16'
GROUP  BY c.cluster_id, feed_id 
ORDER  BY num_docs DESC;

如果 c.cluster_id, feed_id 是您在 SELECT 列表中包含列的所有(在本例中)表的主键列,那么这只是适用于 PostgreSQL 9.1 或更高版本。

否则您需要GROUP BY 其余列或聚合或提供更多信息。

关于sql - 如何在 PostgreSQL 查询中对不同的元组进行排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10261627/

相关文章:

mysql - 选择列值以字符串开头且不以字符串开头的行

sql - 如何从表中删除某些列(但不是全部)在 postgresql 中重复的行?

sql - "distinct on"与 postgres 组

sql - PostgreSQL distinct on, group by, having in one?

sql - 从表中选择每条记录,不重复相同的记录

c# - 采取什么方法为用户缓存唯一的自动建议列表?

sql - 如何在不使用sql循环的情况下过滤日期之间的行

postgresql - 删除大部分大表后,重新启动现有行的主键编号

postgresql - 在jOOQ中设置PostgreSQL search_path

sql - 在 PostgreSQL 的另一个表中选择每行时间戳后的第一个事件