sql - Postgres : Why did adding index slow down regexp queries?

标签 sql postgresql database-performance postgresql-performance postgresql-11

我在 Postgres 中有一个 TEXT keyvalues 列:

select * from test5 limit 5;

 id |                      keyvalues
----+------------------------------------------------------
  1 | ^ first 1 | second 3
  2 | ^ first 1 | second 2 ^ first 2 | second 3
  3 | ^ first 1 | second 2 | second 3
  4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2
  5 | ^ first 2 | second 3 ^ first 1 | second 3

我的查询必须从匹配的中间排除 ^ 字符,所以我使用正则表达式:

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
   ->  Gather  (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
               ->  Parallel Seq Scan on test5  (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
                     Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
                     Rows Removed by Filter: 1666668
 Planning Time: 0.068 ms
 Execution Time: 7337.184 ms

查询有效(零行匹配),但速度太慢,超过 7 秒。

我认为用三元组建立索引会有所帮助,但运气不好:

create extension if not exists pg_trgm;
create index on test5 using gin (keyvalues gin_trgm_ops);

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1)
   ->  Bitmap Heap Scan on test5  (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1)
         Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
         Rows Removed by Index Recheck: 5000005
         Heap Blocks: exact=47620
         ->  Bitmap Index Scan on test5_keyvalues_idx  (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1)
               Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
 Planning Time: 0.412 ms
 Execution Time: 23734.722 ms

使用 trigram 索引的查询慢了 3 倍!它仍然返回正确的结果(零行)。我希望 trigram 索引能够立即找出任何地方都没有 second 0 字符串,并且速度非常快。

(动机:我想避免将 keyvalues 规范化为 another table ,所以我希望在单个 TEXT 字段中对匹配逻辑进行编码改用文本索引和正则表达式。逻辑有效,但速度太慢,as is JSONB。)

最佳答案

根据OP,给出了正确答案here用户@jjanes 在 DBA.SE 上:

I expected the trigram index to figure out immediately there's no second 0 string anywhere

'second' 和 '0' 是单独的词,因此它无法检测到它们的联合缺失。它似乎可以检测到“0”的缺失,但来自“contrib/pg_trgm/trgm_regexp.c”的评论似乎是相关的:

     * Note: Using again the example "foo bar", we will not consider the
     * trigram "  b", though this trigram would be found by the trigram
     * extraction code.  Since we will find " ba", it doesn't seem worth
     * trying to hack the algorithm to generate the additional trigram.

由于 0 是模式字符串中的最后一个字符,因此也不会有“0a”形式的三元组,因此它就错过了这个机会。

即使不是因为这个限制,你的方法看起来也非常脆弱。

关于sql - Postgres : Why did adding index slow down regexp queries?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56505822/

相关文章:

sql - 数据库架构设计 : Tracking User Balance with concurrency

sql - 我如何强制 Postgres 使用特定的索引?

sql - 无重复数组组合

sql-server - 为什么从派生表迁移到临时表解决方案时性能会提高?

mysql - 在oracle中查询时创建临时表

c# - 关于SQl查询二

php - 使用默认值优化对多个表的查询

javascript - 如何查询 postgres 数据库中现有的电子邮件

mysql - 计算唯一行性能

oracle - 序列缓存和性能