sql - 如何让查询更快?

标签 sql postgresql full-text-search query-optimization trigram

我的 tracks 表包含大约 300 万条记录(每天增加 500 条),大约有 30 列,但我在 WHERE 子句中只使用了 15 列。查询平均需要 4800 毫秒,没有其他用户/进程使用数据库。如何让它更快?我希望看到接近 100 毫秒的结果。

寻找歌曲(轨道)的人填写表格:

  • 字符串 -> 表示“歌曲标题或艺术家姓名”
  • 字符串 -> 代表“流派”
  • 日期 -> 表示“发布时间”
  • 剩余 12 个参数的最小/最大值的几个整数

99% 的用例是 SELECT 查询:

SELECT
    "public"."tracks"."sys_id",
    "public"."tracks"."all_artists",
    "public"."tracks"."name",
    "public"."tracks"."genres",
    "public"."tracks"."release_date",
    "public"."tracks"."tempo",
    "public"."tracks"."popularity",
    "public"."tracks"."danceability",
    "public"."tracks"."energy",
    "public"."tracks"."speechiness",
    "public"."tracks"."acousticness",
    "public"."tracks"."instrumentalness",
    "public"."tracks"."liveness",
    "public"."tracks"."valence",
    "public"."tracks"."main_artist_popularity",
    "public"."tracks"."main_artist_followers",
    "public"."tracks"."key",
    "public"."tracks"."preview_url"
FROM
    "public"."tracks"
WHERE
    (
    "public"."tracks"."name" LIKE '%oultec%'
    OR "public"."tracks"."all_artists_string" LIKE '%oultec%'
    )
    AND ("public"."tracks"."genres_string" LIKE '%rum%')
    AND "public"."tracks"."tempo" >= '80'
    AND "public"."tracks"."tempo" <= '210'
    AND "public"."tracks"."popularity" >= '0'
    AND "public"."tracks"."popularity" <= '100'
    AND "public"."tracks"."main_artist_popularity" >= '1'
    AND "public"."tracks"."main_artist_popularity" <= '100'
    AND "public"."tracks"."main_artist_followers" >= '1'
    AND "public"."tracks"."main_artist_followers" <= '50000000'
    AND "public"."tracks"."danceability" >= '0'
    AND "public"."tracks"."danceability" <= '1000'
    AND "public"."tracks"."energy" >= '0'
    AND "public"."tracks"."energy" <= '1000'
    AND "public"."tracks"."speechiness" >= '0'
    AND "public"."tracks"."speechiness" <= '1000'
    AND "public"."tracks"."acousticness" >= '0'
    AND "public"."tracks"."acousticness" <= '1000'
    AND "public"."tracks"."instrumentalness" >= '0'
    AND "public"."tracks"."instrumentalness" <= '1000'
    AND "public"."tracks"."liveness" >= '0'
    AND "public"."tracks"."liveness" <= '1000'
    AND "public"."tracks"."valence" >= '0'
    AND "public"."tracks"."valence" <= '1000'
    AND "public"."tracks"."release_date" >= '2020-01-01'
    AND "public"."tracks"."key" = '10'
ORDER BY
    "public"."tracks"."release_date" DESC,
    "public"."tracks"."popularity" DESC,
    "public"."tracks"."sys_id" ASC
LIMIT 5 OFFSET 0;

索引(索引):

PRIMARY sys_id
UNIQUE  main_artist, name, duration_ms
INDEX   energy
INDEX   tempo, popularity, main_artist_popularity, main_artist_followers, danceability, energy, speechiness, acousticness, instrumentalness, liveness, valence, name, all_artists_string, genres_string, release_date, key

解释/分析:

Limit  (cost=308411.32..308411.33 rows=1 width=279) (actual time=4582.756..4584.010 rows=0 loops=1)
  ->  Sort  (cost=308411.32..308411.33 rows=1 width=279) (actual time=4546.306..4547.559 rows=0 loops=1)
        Sort Key: release_date DESC, popularity DESC, sys_id
        Sort Method: quicksort  Memory: 25kB
        ->  Gather  (cost=1000.00..308411.31 rows=1 width=279) (actual time=4546.121..4547.374 rows=0 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Parallel Seq Scan on tracks  (cost=0.00..307411.21 rows=1 width=279) (actual time=4473.923..4473.926 rows=0 loops=3)
                    Filter: ((genres_string ~~ '%rum%'::text) AND (tempo >= '80'::smallint) AND (tempo <= '210'::smallint) AND (popularity >= '0'::smallint) AND (popularity <= '100'::smallint) AND (main_artist_popularity >= '1'::smallint) AND (main_artist_popularity <= '100'::smallint) AND (main_artist_followers >= 1) AND (main_artist_followers <= 50000000) AND (danceability >= '0'::smallint) AND (danceability <= '1000'::smallint) AND (energy >= '0'::smallint) AND (energy <= '1000'::smallint) AND (speechiness >= '0'::smallint) AND (speechiness <= '1000'::smallint) AND (acousticness >= '0'::smallint) AND (acousticness <= '1000'::smallint) AND (instrumentalness >= '0'::smallint) AND (instrumentalness <= '1000'::smallint) AND (liveness >= '0'::smallint) AND (liveness <= '1000'::smallint) AND (valence >= '0'::smallint) AND (valence <= '1000'::smallint) AND (release_date >= '2020-01-01'::date) AND (key = '10'::smallint) AND ((name ~~ '%oultec%'::text) OR (all_artists_string ~~ '%oultec%'::text)))
                    Rows Removed by Filter: 1034268
Planning Time: 5.145 ms
JIT:
  Functions: 14
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 18.941 ms, Inlining 0.000 ms, Optimization 8.644 ms, Emission 110.864 ms, Total 138.449 ms
Execution Time: 4667.446 ms

从“官方”镜像运行 PostgreSQL:postgres:14.1-alpine:

  • Ubuntu-20.04-x86_64
  • 2 个 CPU
  • 2GB 内存
  • 20GB SSD 驱动器

system htop

表结构:

tracks table structure

运行查询的网站(通过 API/后端,更多带有最小/最大整数的字段,但此处未显示):

Form view for running the query

最佳答案

您的查询对日期和数字进行LIKE '%something%'全文搜索和范围扫描。但 BTREE 索引(默认情况下)只能处理一次范围扫描。而且,他们根本无法处理LIKE '%something%'。因此,您将对每个查询进行全表扫描。考虑到三个巨型行的 4.8 秒还算不错。

对于列 LIKE '%something%' 搜索,您可以尝试使用三元组索引,这是 postgreSQL 的一项功能。此代码将在 name 上创建三元组索引。这可能会缩小选择范围,从而使您必须扫描更少的数据。

CREATE EXTENSION pg_trgm;  -- you may or may not need this statement.
CREATE INDEX CONCURRENTLY tracks_name
     ON tracks
  USING GIN (name gin_trgm_ops);
CREATE INDEX CONCURRENTLY tracks_all_artists_string
     ON tracks
  USING GIN (all_artists_string gin_trgm_ops);
CREATE INDEX CONCURRENTLY tracks_genres_string
     ON tracks
  USING GIN (genres_string gin_trgm_ops);

但是您仍然需要扫描所有匹配的轨道。

如果您创建这些索引,然后重构 WHERE 子句的前几位以使用像这样的集合计算,您可能(或可能不会)获得更好的性能。

WHERE sys_id IN (
     (SELECT sys_id FROM tracks WHERE name LIKE '%oultec%'
       UNION 
      SELECT sys_id FROM tracks WHERE all_artists_string LIKE '%oultec%'
     )
     INTERSECT
     SELECT sys_id FROM tracks WHERE genres_string LIKE '%oultec%'
    )
  AND tempo >= '80' ... 

但事实是 SQL 并不适合所有这些范围扫描。

关于sql - 如何让查询更快?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74047973/

相关文章:

database - 在服务器中实现自动完成的最佳方式是什么?

sql - 如何使用 Cube、Pivo​​t 或 Rollup SQL 转换列中的行

MYSQL 5.7.17 全文搜索

search - 提升完成建议器的输入字段

node.js - 解析 promise 数组 node.js

python - 如何在Django中获取导致IntegrityError的记录

sql - Postgres 中的正则表达式 - 没有做我想做的事(新手问题)

sql - 如何对不同表进行计数和求和 (SQL)

php - Yii 一个 actionIndex() 用于数据库中的 4 个表

mysql - Tricky sql query for popularity