我有一个复杂的查询:
SELECT DISTINCT ON (delivery.id)
delivery.id, dl_processing.pid
FROM mailer.mailer_message_recipient_rel AS delivery
JOIN mailer.mailer_message AS message ON delivery.message_id = message.id
JOIN mailer.mailer_message_recipient_rel_log AS dl_processing ON dl_processing.rel_id = delivery.id AND dl_processing.status = 1000
-- LEFT JOIN mailer.mailer_recipient AS r ON delivery.email = r.email
JOIN mailer.mailer_mailing AS mailing ON message.mailing_id = mailing.id
WHERE
NOT EXISTS (SELECT dl_finished.id FROM mailer.mailer_message_recipient_rel_log AS dl_finished WHERE dl_finished.rel_id = delivery.id AND dl_finished.status <> 1000) AND
dl_processing.date <= NOW() - (36000 * INTERVAL '1 second') AND
NOT EXISTS (SELECT ml.id FROM mailer.mailer_message_log AS ml WHERE ml.message_id = message.id) AND
-- (r.times_bounced < 5 OR r.times_bounced IS NULL) AND
NOT EXISTS (SELECT ur.id FROM mailer.mailer_unsubscribed_recipient AS ur WHERE ur.email = delivery.email AND ur.list_id = mailing.list_id)
ORDER BY delivery.id, dl_processing.id DESC
LIMIT 1000;
它运行非常缓慢,原因似乎是 Postgres 始终避免在其查询计划中使用合并连接,尽管我拥有为此所需的所有索引。看起来真的很郁闷:
http://explain.depesz.com/s/tVY
http://i.stack.imgur.com/Myw4R.png
为什么会这样?如何解决此类问题?
UPD:在@wildplasser 的帮助下,我重新设计了查询以修复性能(同时稍微改变了它的语义):
SELECT delivery.id, dl_processing.pid
FROM mailer.mailer_message_recipient_rel AS delivery
JOIN mailer.mailer_message AS message ON delivery.message_id = message.id
JOIN mailer.mailer_message_recipient_rel_log AS dl_processing ON dl_processing.rel_id = delivery.id AND dl_processing.status in (1000, 2, 5) AND dl_processing.date <= NOW() - (36000 * INTERVAL '1 second')
LEFT JOIN mailer.mailer_recipient AS r ON delivery.email = r.email
WHERE
(r.times_bounced < 5 OR r.times_bounced IS NULL) AND
NOT EXISTS (SELECT dl_other.id FROM mailer.mailer_message_recipient_rel_log AS dl_other WHERE dl_other.rel_id = delivery.id AND dl_other.id > dl_processing.id) AND
NOT EXISTS (SELECT ml.id FROM mailer.mailer_message_log AS ml WHERE ml.message_id = message.id) AND
NOT EXISTS (SELECT ur.id FROM mailer.mailer_unsubscribed_recipient AS ur JOIN mailer.mailer_mailing AS mailing ON message.mailing_id = mailing.id WHERE ur.email = delivery.email AND ur.list_id = mailing.list_id)
ORDER BY delivery.id
LIMIT 1000
它现在运行良好,但查询计划仍然使用这些可怕的嵌套循环连接 <_<:
http://explain.depesz.com/s/MTo3
我还是想知道这是为什么。
最佳答案
原因是 Postgres 实际上在做正确的事情,而我的数学很烂。假设表 A 有 N 行,表 B 有 M 行,并且它们通过一个列连接,它们都有一个 B 树索引。则以下为真:
- 嵌套循环连接的时间复杂度不是 O(MN),就像我天真地想的那样,但是O(M log N) or < em>O(N log M),取决于线性扫描哪个表。如果两者都被索引扫描,我们将分别得到 O(M log M log N) 或 O(N log M log N)。但由于仅当另一个连接需要特定的行顺序或由于
ORDER
子句时才需要这样做,正如我们将看到的那样,这根本不是一件坏事。 - 合并连接的时间复杂度是O(M log M + N log N),这意味着它输到嵌套循环连接,前提是渐近比例系数是相同的,据我所知,在大多数实现中它们都应该等于 1。由于两个表必须由相同的索引沿相同的方向迭代,如果需要不同的顺序,则需要额外的排序,这很容易使复杂性比嵌套循环排序的情况更糟。
所以基本上,尽管与我们都喜欢的归并排序相关联,归并联接几乎总是很糟糕。
我的第一个查询之所以这么慢,是因为它必须在应用限制之前执行排序,而且它在许多其他方面也很糟糕。在应用@wildplasser 的建议后,我设法减少了(仍然很昂贵的)嵌套循环的数量,并且还允许在没有排序的情况下进行限制,从而确保 Postgres 很可能不需要运行外部扫描来完成它,这是我获得大部分性能提升的地方。
关于sql - Postgres 一贯偏爱嵌套循环连接而不是合并连接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24034111/