sql - 依靠条件连接大表的速度很慢

当表很小时，此查询有合理的时间。我试图确定瓶颈是什么，但我不确定如何分析 EXPLAIN结果。

SELECT
  COUNT(*)
FROM performance_analyses
INNER JOIN total_sales ON total_sales.id = performance_analyses.total_sales_id
WHERE
  (size > 0) AND
  total_sales.customer_id IN (
    SELECT customers.id FROM customers WHERE customers.active = 't'
    AND customers.visible = 't' AND customers.organization_id = 3
  ) AND
  total_sales.product_category_id IN (
    SELECT product_categories.id FROM product_categories
    WHERE product_categories.organization_id = 3
  ) AND
  total_sales.period_id = 193;

我已经尝试过 INNER JOIN'ing customers 的两种方法和 product_categories表并进行内部选择。两人有相同的时间。

这是 EXPLAIN 的链接:https://explain.depesz.com/s/9lhr

Postgres 版本:

PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit

表和索引:

CREATE TABLE total_sales (
  id serial NOT NULL,
  value double precision,
  start_date date,
  end_date date,
  product_category_customer_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  processed boolean,
  customer_id integer,
  product_category_id integer,
  period_id integer,
  CONSTRAINT total_sales_pkey PRIMARY KEY (id)
);
CREATE INDEX index_total_sales_on_customer_id ON total_sales (customer_id);
CREATE INDEX index_total_sales_on_period_id ON total_sales (period_id);
CREATE INDEX index_total_sales_on_product_category_customer_id ON total_sales (product_category_customer_id);
CREATE INDEX index_total_sales_on_product_category_id ON total_sales (product_category_id);
CREATE INDEX total_sales_product_category_period ON total_sales (product_category_id, period_id);
CREATE INDEX ts_pid_pcid_cid ON total_sales (period_id, product_category_id, customer_id);


CREATE TABLE performance_analyses (
  id serial NOT NULL,
  total_sales_id integer,
  status_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  size double precision,
  period_size integer,
  nominal_variation double precision,
  percentual_variation double precision,
  relative_performance double precision,
  time_ago_max integer,
  deseasonalized_series text,
  significance character varying,
  relevance character varying,
  original_variation double precision,
  last_level double precision,
  quantiles text,
  range text,
  analysis_method character varying,
  CONSTRAINT performance_analyses_pkey PRIMARY KEY (id)
);
CREATE INDEX index_performance_analyses_on_status_id ON performance_analyses (status_id);
CREATE INDEX index_performance_analyses_on_total_sales_id ON performance_analyses (total_sales_id);


CREATE TABLE product_categories (
  id serial NOT NULL,
  name character varying,
  organization_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  external_id character varying,
  CONSTRAINT product_categories_pkey PRIMARY KEY (id)
);
CREATE INDEX index_product_categories_on_organization_id ON product_categories (organization_id);


CREATE TABLE customers (
  id serial NOT NULL,
  name character varying,
  external_id character varying,
  region_id integer,
  organization_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  active boolean DEFAULT false,
  visible boolean DEFAULT false,
  segment_id integer,
  "group" boolean,
  group_id integer,
  ticket_enabled boolean DEFAULT true,
  CONSTRAINT customers_pkey PRIMARY KEY (id)
);
CREATE INDEX index_customers_on_organization_id ON customers (organization_id);    
CREATE INDEX index_customers_on_region_id ON customers (region_id);
CREATE INDEX index_customers_on_segment_id ON customers (segment_id);

行数:

客户 - 6,970 行

product_categories - 34 行

性能分析 - 1,012,346 行

total_sales - 7,104,441 行

最佳答案

您的查询，重写和 100 % 等效:

SELECT count(*)
FROM   product_categories   pc 
JOIN   customers            c  USING (organization_id) 
JOIN   total_sales          ts ON ts.customer_id = c.id
JOIN   performance_analyses pa ON pa.total_sales_id = ts.id
WHERE  pc.organization_id = 3
AND    c.active  -- boolean can be used directly
AND    c.visible
AND    ts.product_category_id = pc.id
AND    ts.period_id = 193
AND    pa.size > 0;

另一个答案建议将所有条件移动到 FROM 中的连接子句和顺序表中。列表。这可能适用于具有相对原始查询计划器的某个其他 RDBMS。但是，虽然它对 Postgres 也没有什么坏处，但它对查询的性能也没有影响——假设是默认的服务器配置。 The manual:

Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN) is semantically the same as listing the input relations in FROM, so it does not constrain the join order.

大胆强调我的。还有更多，请阅读手册。

关键设置是 join_collapse_limit (默认为 8)。 Postgres 查询计划器将以任何它期望最快的方式重新排列您的 4 个表，无论您如何排列表以及是否将条件写为 WHERE或 JOIN条款。没有任何区别。 (对于其他一些不能自由重新排列的连接类型，情况并非如此。)

The important point is that these different join possibilities give semantically equivalent results but might have hugely different execution costs. Therefore, the planner will explore all of them to try to find the most efficient query plan.

有关的:

Sample Query to show Cardinality estimation error in PostgreSQL

A: Slow fulltext search due to wildly inaccurate row estimates

最后，WHERE id IN (<subquery>)通常不等同于连接。它不会将左侧的行乘以右侧的重复匹配值。并且子查询的列对于查询的其余部分不可见。连接可以将具有重复值的行相乘，并且列是可见的。
您的简单子(monad)查询在两种情况下都挖掘出一个唯一的列，因此在这种情况下没有有效的区别 - 除了 IN (<subquery>)通常(至少有点)更慢且更冗长。使用连接。

您的查询

索引

product_categories 有 34 行。除非您计划添加更多，否则索引不会帮助该表的性能。顺序扫描总是更快。放下index_product_categories_on_organization_id .

customers 有 6,970 行。索引开始变得有意义。但是根据 EXPLAIN，您的查询使用了其中的 4,988 个。输出。只有 index-only scan在比表格小得多的索引上可能会有所帮助。假设 WHERE active AND visible是常量谓词，我建议使用部分多列索引:

CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id)
WHERE active AND visible;

我附加了 id允许仅索引扫描。否则，该列在此查询的索引中是无用的。

total_sales 有 7,104,441 行。索引非常重要。我建议:

CREATE INDEX index_total_sales_on_product_category_customer_id
ON total_sales (period_id, product_category_id, customer_id, id)

同样，目标是仅索引扫描。这是最重要的一个。

您可以删除完全冗余的索引index_total_sales_on_product_category_id .

performance_analyses 有 1,012,346 行。索引非常重要。
我建议使用条件 size > 0 的另一个部分索引:

CREATE INDEX index_performance_analyses_on_status_id
ON performance_analyses (total_sales_id)
WHERE pa.size > 0;

然而:

Rows Removed by Filter: 0"

好像这个条件没有用？有没有 size > 0 的行是不是真的？

创建这些索引后，您需要 ANALYZE table 。

表统计

一般来说，我看到很多不好的估计。 Postgres 低估几乎每一步返回的行数。我们看到的嵌套循环对于更少的行会更好。除非这不太可能是巧合，否则您的表统计信息已严重过时。您需要访问 autovacuum 的设置，可能还需要访问两个大表的每个表设置performance_analyses和 total_sales .

你已经运行了VACUUM和 ANALYZE ，这使查询变慢，according to your comment .这没有多大意义。我会跑VACUUM FULL在这两个表上一次(如果你能负担得起排他锁)。其他尝试pg_repack .
有了所有可疑的统计数据和糟糕的计划，我会考虑运行一个完整的 vacuumdb -fz yourdb 在你的数据库上。这会在原始条件下重写所有表和索引，但定期使用并不好。它也很昂贵，并且会长时间锁定您的数据库!

在此过程中，还请查看数据库的成本设置。
有关的:

Keep PostgreSQL from sometimes choosing a bad query plan

Postgres Slow Queries - Autovacuum frequency

关于sql - 依靠条件连接大表的速度很慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38235142/

sql - 依靠条件连接大表的速度很慢

上一篇：Postgresql 删除行前创建触发器

下一篇：python - 如何使用 python 将 IN 或 NOT IN 子句作为变量传递给 postgresql 查询