PostgreSQL - 如何计算查询计划中排序节点的成本?

标签 postgresql

我在 postgreSQL 中有以下查询计划:

Unique  (cost=487467.14..556160.88 rows=361546 width=1093)
  ->  Sort  (cost=487467.14..488371.00 rows=361546 width=1093)
        Sort Key: (..)
        ->  Append  (cost=0.42..108072.53 rows=361546 width=1093)
              ->  Index Scan using (..)  (cost=0.42..27448.06 rows=41395 width=1093)
                    Index Cond: (..)
                    Filter: (..)
              ->  Seq Scan on (..)  (cost=0.00..77009.02 rows=320151 width=1093)
                    Filter: (..)

我只是想知道 sort 中两个值的精确计算是如何完成的?我了解扫描和追加的工作原理,但我找不到有关排序成本计算的任何信息。

类似于 SeqScan 的东西是:

(disk pages read * seq_page_cost) + (rows scanned * cpu_tuple_cost)

计划的查询基本上是这样的:(不完全是因为它包含一个 View ,但你明白了)

SELECT * FROM (
  SELECT *, true AS storniert
    FROM auftragsposition
    WHERE mengestorniert > 0::numeric AND auftragbestaetigt = true
  UNION
  SELECT *, false AS storniert
    FROM auftragsposition
    WHERE mengestorniert < menge AND auftragbestaetigt = true
) as bla

最佳答案

它在 src/backend/optimizer/path/costsize.c 函数 cost_sort() 中实现(并记录,因为源代码通常是唯一的文档)和基本成本就像 N*log(N) compare operations对于内存排序(基于磁盘的排序可能会更慢,并且它的成本也是估计的)。

这个 N*log(N) 是预期的:https://en.wikipedia.org/wiki/Sorting_algorithm#Efficient_sorts一般排序算法几乎总是基于具有平均时间复杂度的算法......O(n log n)”):

https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/optimizer/path/costsize.c#L1409

/*
 * cost_sort
 *    Determines and returns the cost of sorting a relation, including
 *    the cost of reading the input data.
 *
 * If the total volume of data to sort is less than sort_mem, we will do
 * an in-memory sort, which requires no I/O and about t*log2(t) tuple
 * comparisons for t tuples.
 *
 * If the total volume exceeds sort_mem, we switch to a tape-style merge
 * algorithm.  There will still be about t*log2(t) tuple comparisons in
 * total, but we will also need to write and read each tuple once per
 * merge pass.  We expect about ceil(logM(r)) merge passes where r is the
 * number of initial runs formed and M is the merge order used by tuplesort.c.
 * Since the average initial run should be about sort_mem, we have
 *      disk traffic = 2 * relsize * ceil(logM(p / sort_mem))
 *      cpu = comparison_cost * t * log2(t)
 *
 * If the sort is bounded (i.e., only the first k result tuples are needed)
 * and k tuples can fit into sort_mem, we use a heap method that keeps only
 * k tuples in the heap; this will require about t*log2(k) tuple comparisons.
 *
 * The disk traffic is assumed to be 3/4ths sequential and 1/4th random
 * accesses (XXX can't we refine that guess?)
 *
 * By default, we charge two operator evals per tuple comparison, which should
 * be in the right ballpark in most cases.  The caller can tweak this by
 * specifying nonzero comparison_cost; typically that's used for any extra
 * work that has to be done to prepare the inputs to the comparison operators.
 *
 * 'pathkeys' is a list of sort keys
 * 'input_cost' is the total cost for reading the input data
 * 'tuples' is the number of tuples in the relation
 * 'width' is the average tuple width in bytes
 * 'comparison_cost' is the extra cost per comparison, if any
 * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
 * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
 *
 * NOTE: some callers currently pass NIL for pathkeys because they
 * can't conveniently supply the sort keys.  Since this routine doesn't
 * currently do anything with pathkeys anyway, that doesn't matter...
 * but if it ever does, it should react gracefully to lack of key data.
 * (Actually, the thing we'd most likely be interested in is just the number
 * of sort keys, which all callers *could* supply.)
 */

部分实际计算——磁盘、堆排序、快速排序。现在没有对并行排序的估计(https://wiki.postgresql.org/wiki/Parallel_Internal_Sorthttps://wiki.postgresql.org/wiki/Parallel_External_Sort)?

...
    path->rows = tuples;

    /*
     * We want to be sure the cost of a sort is never estimated as zero, even
     * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
     */
    if (tuples < 2.0)
        tuples = 2.0;

    /* Include the default cost-per-comparison */
    comparison_cost += 2.0 * cpu_operator_cost;

.. 
    if (output_bytes > sort_mem_bytes)
    {
...
        /*
         * We'll have to use a disk-based sort of all the tuples
         */
        /*
         * CPU costs
         *
         * Assume about N log2 N comparisons
         */
        startup_cost += comparison_cost * tuples * LOG2(tuples);


        /* Disk costs */

        /* Compute logM(r) as log(r) / log(M) */
        if (nruns > mergeorder)
            log_runs = ceil(log(nruns) / log(mergeorder));
        else
            log_runs = 1.0;
        npageaccesses = 2.0 * npages * log_runs;
        /* Assume 3/4ths of accesses are sequential, 1/4th are not */
        startup_cost += npageaccesses *
            (seq_page_cost * 0.75 + random_page_cost * 0.25);
    }
    else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
    {
        /*
         * We'll use a bounded heap-sort keeping just K tuples in memory, for
         * a total number of tuple comparisons of N log2 K; but the constant
         * factor is a bit higher than for quicksort.  Tweak it so that the
         * cost curve is continuous at the crossover point.
         */
        startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
    }
    else
    {
        /* We'll use plain quicksort on all the input tuples */
        startup_cost += comparison_cost * tuples * LOG2(tuples);
    }

    /*
     * Also charge a small amount (arbitrarily set equal to operator cost) per
     * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
     * doesn't do qual-checking or projection, so it has less overhead than
     * most plan nodes.  Note it's correct to use tuples not output_tuples
     * here --- the upper LIMIT will pro-rate the run cost so we'd be double
     * counting the LIMIT otherwise.
     */
    run_cost += cpu_operator_cost * tuples;

关于PostgreSQL - 如何计算查询计划中排序节点的成本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43916451/

相关文章:

postgresql - 监视在 docker 容器内运行的 PostgreSQL 的正确方法

mysql - Rails 切换数据库

sql - 构造 where 子句以包含 null/0 值的问题

postgresql - Postgres 中的多语句查询

Django + Postgresql -> 未处理的异常

postgresql - 使用 dblink.sql 在 PostgreSQL 中的两个表之间复制数据

java - 将自定义 `DataType` 添加到 postgres-enum-typed `Binding` 时,Field 的 `TableField` 发生意外变化

python - Django - PostgreSQL 设置 statement_timeout

sql - Postgresql 查看聚合数据

ruby-on-rails - 推送到 Heroku 被拒绝