我知道以前可能有人问过这个问题,但我无法通过 SO 的搜索找到它。
假设我有 TABLE1 和 TABLE2,我应该如何期望这样的查询的性能:
SELECT * FROM TABLE1 WHERE id IN SUBQUERY_ON_TABLE2;
随着 TABLE1 和 TABLE2 中行数的增长而下降,并且 id 是 TABLE1 上的主键。
是的,我知道使用 IN 是一个 n00b 错误,但是 TABLE2 与多个其他表有通用关系(django 通用关系),所以我想不出另一种方法来过滤数据。 TABLE1 和 TABLE2 中的行数(大约)是多少,我应该因此而注意到性能问题?性能是否会根据行数呈线性、指数等方式下降?
最佳答案
当子查询返回的记录数较少且主查询返回的结果行数也较少时,您只需对每个记录进行快速索引查找。随着返回数据的百分比增加,最终两者中的每一个都将切换到使用顺序扫描而不是索引扫描,以一次性获取整个表而不是将它们拼凑在一起。这不是简单的线性或指数性能下降。随着计划类型的变化,存在重大的不连续性。发生这些情况的行数取决于表格的大小,因此也没有对您有用的经验法则。您应该像我在下面所做的那样构建一个模拟,看看您自己的数据集上发生了什么,以了解曲线的样子。
下面是一个使用加载了 Dell Store 2 的 PostgreSQL 9.0 数据库的示例。数据库。一旦子查询返回了 1000 行,它就会对主表进行全表扫描。一旦子查询考虑到 10,000 条记录,这也会变成全表扫描。这些都运行了两次,因此您会看到缓存的性能。性能如何根据缓存状态与未缓存状态发生变化完全是另一个话题:
dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN
(SELECT customerid FROM orders WHERE orderid<2);
Nested Loop (cost=8.27..16.56 rows=1 width=268) (actual time=0.051..0.060 rows=1 loops=1)
-> HashAggregate (cost=8.27..8.28 rows=1 width=4) (actual time=0.028..0.030 rows=1 loops=1)
-> Index Scan using orders_pkey on orders (cost=0.00..8.27 rows=1 width=4) (actual time=0.011..0.015 rows=1 loops=1)
Index Cond: (orderid < 2)
-> Index Scan using customers_pkey on customers (cost=0.00..8.27 rows=1 width=268) (actual time=0.013..0.016 rows=1 loops=1)
Index Cond: (customers.customerid = orders.customerid)
Total runtime: 0.191 ms
dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN
(SELECT customerid FROM orders WHERE orderid<100);
Nested Loop (cost=10.25..443.14 rows=100 width=268) (actual time=0.488..2.591 rows=98 loops=1)
-> HashAggregate (cost=10.25..11.00 rows=75 width=4) (actual time=0.464..0.661 rows=98 loops=1)
-> Index Scan using orders_pkey on orders (cost=0.00..10.00 rows=100 width=4) (actual time=0.019..0.218 rows=99 loops=1)
Index Cond: (orderid < 100)
-> Index Scan using customers_pkey on customers (cost=0.00..5.75 rows=1 width=268) (actual time=0.009..0.011 rows=1 loops=98)
Index Cond: (customers.customerid = orders.customerid)
Total runtime: 2.868 ms
dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN
(SELECT customerid FROM orders WHERE orderid<1000);
Hash Semi Join (cost=54.25..800.13 rows=1000 width=268) (actual time=4.574..80.319 rows=978 loops=1)
Hash Cond: (customers.customerid = orders.customerid)
-> Seq Scan on customers (cost=0.00..676.00 rows=20000 width=268) (actual time=0.007..33.665 rows=20000 loops=1)
-> Hash (cost=41.75..41.75 rows=1000 width=4) (actual time=4.502..4.502 rows=999 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 24kB
-> Index Scan using orders_pkey on orders (cost=0.00..41.75 rows=1000 width=4) (actual time=0.056..2.487 rows=999 loops=1)
Index Cond: (orderid < 1000)
Total runtime: 82.024 ms
dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN
(SELECT customerid FROM orders WHERE orderid<10000);
Hash Join (cost=443.68..1444.68 rows=8996 width=268) (actual time=79.576..157.159 rows=7895 loops=1)
Hash Cond: (customers.customerid = orders.customerid)
-> Seq Scan on customers (cost=0.00..676.00 rows=20000 width=268) (actual time=0.007..27.085 rows=20000 loops=1)
-> Hash (cost=349.97..349.97 rows=7497 width=4) (actual time=79.532..79.532 rows=7895 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 186kB
-> HashAggregate (cost=275.00..349.97 rows=7497 width=4) (actual time=45.130..62.227 rows=7895 loops=1)
-> Seq Scan on orders (cost=0.00..250.00 rows=10000 width=4) (actual time=0.008..20.979 rows=9999 loops=1)
Filter: (orderid < 10000)
Total runtime: 167.882 ms
关于sql - PostgreSQL 中的 IN 语句性能(以及一般),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3002070/