sql - 需要 SQL 优化(也许 DISTINCT ON 是原因?)

标签 sql postgresql query-optimization aggregate-functions postgis

相关的,前面的问题:
Select a random entry from a group after grouping by a value (not column)?

我当前的查询如下所示:

WITH
  points AS (
    SELECT unnest(array_of_points) AS p
  ),

 gtps AS (
   SELECT DISTINCT ON(points.p)
     points.p, m.groundtruth
   FROM measurement m, points
   WHERE st_distance(m.groundtruth, points.p) < distance
   ORDER BY points.p, RANDOM()
 )

SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
  m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()

语义:
  • 有两个输入值:
  • 第 4 行:点数组 array_of_points
  • 第 12 行: double 数:distance
  • 第一段(第 1-6 行):
  • 从点数组创建一个表,用于...
  • 第二段(第 8-14 行):
  • 对于 points 内的每个点表:得到一个随机(!)groundtruth点从measurement表,距离 < distance
  • 将这些元组保存在 gtps 中表
  • 第三段(第 16-19 行):
  • 每个groundtruth gtps 内的值表:获取所有 anchor_id值和...
  • 如果 anchor_id值不是唯一的:然后随机选择一个
  • 输出:id , anchor_id , groundtruth , p (来自 array_of_points 的输入值)

  • 示例表:
    id | anchor_id | groundtruth | data
    -----------------------------------
    1  | 1         | POINT(1 4)  | ...
    2  | 3         | POINT(1 4)  | ...
    3  | 8         | POINT(1 4)  | ...
    4  | 6         | POINT(1 4)  | ...
    -----------------------------------
    5  | 2         | POINT(3 2)  | ...
    6  | 4         | POINT(3 2)  | ...
    -----------------------------------
    7  | 1         | POINT(4 3)  | ...
    8  | 1         | POINT(4 3)  | ...
    9  | 6         | POINT(4 3)  | ...
    10 | 7         | POINT(4 3)  | ...
    11 | 3         | POINT(4 3)  | ...
    -----------------------------------
    12 | 1         | POINT(6 2)  | ...
    13 | 5         | POINT(6 2)  | ...
    

    示例结果:
    id  | anchor_id | groundtruth | p
    -----------------------------------------
    1   | 1         | POINT(1 4)  | POINT(1 0)
    2   | 3         | POINT(1 4)  | POINT(1 0)
    4   | 6         | POINT(1 4)  | POINT(1 0)
    3   | 8         | POINT(1 4)  | POINT(1 0)
    5   | 2         | POINT(3 2)  | POINT(2 2)
    6   | 4         | POINT(3 2)  | POINT(2 2)
    1   | 1         | POINT(1 4)  | POINT(4 8)
    2   | 3         | POINT(1 4)  | POINT(4 8)
    4   | 6         | POINT(1 4)  | POINT(4 8)
    3   | 8         | POINT(1 4)  | POINT(4 8)
    12  | 1         | POINT(6 2)  | POINT(7 3)
    13  | 5         | POINT(6 2)  | POINT(7 3)
    1   | 1         | POINT(4 3)  | POINT(9 1)
    11  | 3         | POINT(4 3)  | POINT(9 1)
    9   | 6         | POINT(4 3)  | POINT(9 1)
    10  | 7         | POINT(4 3)  | POINT(9 1)
    

    如你看到的:
  • 每个输入值可以有多个相等的 groundtruth值。
  • 如果一个输入值有多个 groundtruth值,这些值都必须相等。
  • 每个 groundtruth-inputPoint-tuple 都与每个可能的连接 anchor_id对于那个基本事实。
  • 两个不同的输入值可以有相同的对应 groundtruth值(value)。
  • 两个不同的groundtruth-inputPoint-tuples可以有相同的anchor_id
  • 两个相同的groundtruth-inputPoint-tuples必须有不同的anchor_id s

  • 基准 (对于两个输入值):
  • 第 1-6 行:16 毫秒
  • 第 8-14 行:48 毫秒
  • 第 16-19 行:600 毫秒

  • 详细解释:
    Unique  (cost=11119.32..11348.33 rows=18 width=72)
      Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
      CTE points
        ->  Result  (cost=0.00..0.01 rows=1 width=0)
              Output: unnest('{0101000000EE7C3F355EF24F4019390B7BDA011940:01010000003480B74082FA44402CD49AE61D173C40}'::geometry[])
      CTE gtps
        ->  Unique  (cost=7659.95..7698.12 rows=1 width=160)
              Output: points.p, m.groundtruth, (random())
              ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160)
                    Output: points.p, m.groundtruth, (random())
                    Sort Key: points.p, (random())
                    ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160)
                          Output: points.p, m.groundtruth, random()
                          Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                          ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32)
                                Output: points.p
                          ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=132)
                                Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
      ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72)
            Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
            Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())
            ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72)
                  Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, random()
                  ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64)
                        Output: gtps.p, gtps.groundtruth
                  ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=8)
                        Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
    

    解释分析:
    Unique  (cost=11119.32..11348.33 rows=18 width=72) (actual time=548.991..657.992 rows=36 loops=1)
      CTE points
        ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.004..0.011 rows=2 loops=1)
      CTE gtps
        ->  Unique  (cost=7659.95..7698.12 rows=1 width=160) (actual time=133.416..146.745 rows=2 loops=1)
              ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160) (actual time=133.415..142.255 rows=15683 loops=1)
                    Sort Key: points.p, (random())
                    Sort Method: external merge  Disk: 1248kB
                    ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160) (actual time=0.045..46.670 rows=15683 loops=1)
                          Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                          ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32) (actual time=0.007..0.020 rows=2 loops=1)
                          ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=132) (actual time=0.013..3.902 rows=22901 loops=2)
      ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72) (actual time=548.989..631.323 rows=45802 loops=1)
            Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())"
            Sort Method: external merge  Disk: 4008kB
            ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72) (actual time=133.449..166.294 rows=45802 loops=1)
                  ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64) (actual time=133.420..146.753 rows=2 loops=1)
                  ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=8) (actual time=0.014..4.409 rows=22901 loops=2)
    Total runtime: 834.626 ms
    

    实时运行时,这应该使用大约 100-1000 个输入值运行。所以现在需要 35 到 350 秒,这太长了。

    我已经尝试删除 RANDOM()职能。这将运行时间(对于 2 个输入值)从大约 670 毫秒减少到大约 530 毫秒。所以这不是目前的主要影响。

    如果更容易/更快,也可以运行 2 或 3 个单独的查询并在软件中执行某些部分(它在 Ruby on Rails 服务器上运行)。例如随机选择?!

    工作正在进行中:
    SELECT
      m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
    FROM
      measurement m
    JOIN
      (SELECT unnest(point_array) AS p) AS ps
      ON ST_DWithin(ps.p, m.groundtruth, distance)
    GROUP BY groundtruth, ps.p
    

    有了这个查询,它非常快( 15ms ),但是缺少很多:
  • 我只需要为每个 ps.p 随机一行
  • 这两个数组彼此属于。意思是:里面元素的顺序很重要!
  • 这两个数组需要过滤(随机):
    每个anchor_id在多次出现的数组中:保留一个随机一个并删除所有其他的。这也意味着删除相应的id来自 id -array 用于每个删除的 anchor_id

  • 如果 anchor_id 也不错和 id可以存储在一个元组数组中。例如:{[4,1],[6,3],[4,2],[8,5],[4,4]} (约束:每个元组都是唯一的,每个 id(示例中的 == 2nd 值)都是唯一的,anchor_ids 不是唯一的)。此示例显示没有仍必须应用的过滤器的查询。应用过滤器后,它看起来像这样 {[6,3],[4,4],[8,5]} .

    正在进行的工作二:
    SELECT DISTINCT ON (ps.p)
      m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
    FROM
      measurement m
    JOIN
      (SELECT unnest(point_array) AS p) AS ps
      ON ST_DWithin(ps.p, m.groundtruth, distance)
    GROUP BY ps.p, m.groundtruth
    ORDER BY ps.p, RANDOM()
    

    这现在给出了非常好的结果并且仍然非常快: 16 毫秒
    只剩下一件事要做:
  • ARRAY_AGG(m.anchor_id)已经随机化,但是:
  • 它包含很多重复的条目,所以:
  • 我想在上面使用类似 DISTINCT 的东西,但是:
  • 它必须与 ARRAY_AGG(m.id) 同步.这意味着:
    如果 DISTINCT 命令保留 anchor_id 的索引 1、4 和 7数组,那么它还必须保留 id 的索引 1、4 和 7数组(当然也删除所有其他的)
  • 最佳答案

    It would also be nice if anchor_id and id could be stored inside an array of tuples.



    多维数组的聚合函数

    我想你创建了一个 二维数组为了那个原因。这比 ARRAY of record 更容易处理.标准array_agg()无法聚合多维数组。但是你可以很容易地编写自己的聚合函数:
    CREATE AGGREGATE array_agg_mult (anyarray)  (
        SFUNC     = array_cat
       ,STYPE     = anyarray
       ,INITCOND  = '{}'
    );
    

    阅读此相关答案中的解释:
    Selecting data into a Postgres array

    For each anchor_id in the array that appears more than once: keep a random one and delete all other. This also means to remove the corresponding id from the id-array for every deleted anchor_id



    询问
    SELECT DISTINCT ON (p)
           p, groundtruth, array_agg_mult(ARRAY[ARRAY[anchor_id, id]]) AS ids
    FROM (
       SELECT DISTINCT ON (ps.p, m.groundtruth, m.anchor_id)
              ps.p, m.groundtruth, m.anchor_id, m.id
       FROM  (SELECT unnest(point_array) AS p) AS ps
       JOIN   measurement m ON ST_DWithin(ps.p, m.groundtruth, distance)
       ORDER  BY ps.p, m.groundtruth, m.anchor_id, random()
       ) x
    GROUP  BY p, groundtruth
    ORDER  BY p, random();
    
  • 子查询 x变得不同 anchor_id(p, groundtruth)如果有多个对等点,则随机选择一行。这样连接anchor_id - id保持完整。
  • 外部查询按照您的意愿聚合一个二维数组,按 anchor_id 排序。 .如果你想拥有 anchor_id随机排序,再次使用随机:
    array_agg_mult(ARRAY[ARRAY[anchor_id, id]] ORDER BY random())
    
  • 最后,DISTINCT ON仅选择 1 groundtruthp ,再随机。
  • 关于sql - 需要 SQL 优化(也许 DISTINCT ON 是原因?),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15099242/

    相关文章:

    json - Postgres 加入 JSON 列表

    java - 2019年客户端-服务器之间的高效数据库同步

    sql - 一系列日期的最近记录

    java - PostgreSQL(奇怪): NoClassDefFoundError: com/mysql/jdbc/StringUtils

    sql - 最佳用户角色权限数据库设计实践?

    mysql - 优化 MySQL NOT IN( 查询

    java - 优化大表的选择查询?

    PHP 和 mySQL 不工作

    php - 无法创建具有空登录名的用户