SQL join 1-many 只查询没有重复的父行

标签 sql postgresql count indexing pattern-matching

我有两张 table - invoicesinvoiceitems .关系是 1-many。我的应用程序允许使用查询中的发票项目字段来查询发票。只退回发票,不退回任何元素。

例如,我想获取所有具有项目的发票,其名称包含 ac , 不区分大小写。
输出是分页的,所以我执行一个查询来获取满足条件的发票数量,然后执行另一个查询来获取相应的发票页面。

table 大小是:

  • 发票 - 65,000 条记录
  • invoiceitems - 3,281,518 条记录
  • 条款 - 5 项
  • 代表 - 5 项
  • shipVia - 5 件

  • 每张发票最多链接到 100 个发票项目。

    我的问题是我无法确定查询的最佳索引:

    架构:
    CREATE TABLE invoiceitems
    (
      id serial NOT NULL,
      invoice_id integer NOT NULL,
      name text NOT NULL,
      ...
      CONSTRAINT invoiceitems_pkey PRIMARY KEY (id),
      CONSTRAINT invoiceitems_invoice_id_fkey FOREIGN KEY (invoice_id)
          REFERENCES invoices (id) MATCH SIMPLE
          ON UPDATE NO ACTION ON DELETE NO ACTION,
    );
    
    CREATE INDEX idx_lower_name
      ON invoiceitems
      USING btree
      (lower(name) COLLATE pg_catalog."default" text_pattern_ops);
    
    CREATE TABLE invoices
    (
      id serial NOT NULL,
      term_id integer,
      rep_id integer NOT NULL,
      ship_via_id integer,
      ...
      CONSTRAINT invoices_pkey PRIMARY KEY (id),
      CONSTRAINT invoices_rep_id_fkey FOREIGN KEY (rep_id)
          REFERENCES reps (id) MATCH SIMPLE
          ON UPDATE NO ACTION ON DELETE NO ACTION,
      CONSTRAINT invoices_ship_via_id_fkey FOREIGN KEY (ship_via_id)
          REFERENCES shipvia (id) MATCH SIMPLE
          ON UPDATE NO ACTION ON DELETE NO ACTION,
      CONSTRAINT invoices_term_id_fkey FOREIGN KEY (term_id)
          REFERENCES terms (id) MATCH SIMPLE
          ON UPDATE NO ACTION ON DELETE NO ACTION,
    );
    

    计数查询:
    SELECT COUNT(DISTINCT(o.id))
    FROM invoices o
    JOIN invoiceitems items ON items.invoice_id = o.id
    LEFT JOIN terms t ON t.id = o.term_id
    LEFT JOIN reps r ON r.id = o.rep_id
    LEFT JOIN shipVia s ON s.id = o.ship_via_id WHERE LOWER(items.name) LIKE '%ac%';
    

    结果:

    6518

    查询计划
    "Aggregate  (cost=107651.35..107651.36 rows=1 width=4)"
    "  ->  Hash Join  (cost=3989.50..106010.59 rows=656304 width=4)"
    "        Hash Cond: (items.invoice_id = o.id)"
    "        ->  Seq Scan on invoiceitems items  (cost=0.00..85089.77 rows=656304 width=4)"
    "              Filter: (lower(name) ~~ '%ac%'::text)"
    "        ->  Hash  (cost=2859.00..2859.00 rows=65000 width=16)"
    "              ->  Seq Scan on invoices o  (cost=0.00..2859.00 rows=65000 width=16)"
    

    似乎我在 invoiceitems.name 上的功能索引场根本不玩。我认为这是因为我正在寻找名称的一部分,这不是名称的严格前缀。我不确定,但似乎我的发票主键索引在这里也不起作用。

    我的问题是我可以优化计数查询和/或我的架构以提高性能吗?

    我必须允许按名称的一部分进行搜索,这不是严格的前缀,而且我还必须支持不区分大小写的搜索。

    我返回匹配记录的查询同样糟糕:
    SELECT DISTINCT(o.id), t.terms, r.rep, s.ship_via, ...
    FROM invoices o
    JOIN invoiceitems items ON items.invoice_id = o.id
    LEFT JOIN terms t ON t.id = o.term_id
    LEFT JOIN reps r ON r.id = o.rep_id
    LEFT JOIN shipVia s ON s.id = o.ship_via_id WHERE LOWER(items.name) LIKE '%ac%' LIMIT 100;
    

    及其计划:
    "Limit  (cost=901846.63..901854.13 rows=100 width=627)"
    "  ->  Unique  (cost=901846.63..951069.43 rows=656304 width=627)"
    "        ->  Sort  (cost=901846.63..903487.39 rows=656304 width=627)"
    "              Sort Key: o.id, t.terms, r.rep, s.ship_via, ..."
    "              ->  Hash Join  (cost=11509.54..286596.53 rows=656304 width=627)"
    "                    Hash Cond: (items.invoice_id = o.id)"
    "                    ->  Seq Scan on invoiceitems items  (cost=0.00..85089.77 rows=656304 width=4)"
    "                          Filter: (lower(name) ~~ '%ac%'::text)"
    "                    ->  Hash  (cost=5491.03..5491.03 rows=65000 width=627)"
    "                          ->  Hash Left Join  (cost=113.02..5491.03 rows=65000 width=627)"
    "                                Hash Cond: (o.ship_via_id = s.id)"
    "                                ->  Hash Left Join  (cost=75.35..4559.61 rows=65000 width=599)"
    "                                      Hash Cond: (o.rep_id = r.id)"
    "                                      ->  Hash Left Join  (cost=37.67..3628.19 rows=65000 width=571)"
    "                                            Hash Cond: (o.term_id = t.id)"
    "                                            ->  Seq Scan on invoices o  (cost=0.00..2859.00 rows=65000 width=543)"
    "                                            ->  Hash  (cost=22.30..22.30 rows=1230 width=36)"
    "                                                  ->  Seq Scan on terms t  (cost=0.00..22.30 rows=1230 width=36)"
    "                                      ->  Hash  (cost=22.30..22.30 rows=1230 width=36)"
    "                                            ->  Seq Scan on reps r  (cost=0.00..22.30 rows=1230 width=36)"
    "                                ->  Hash  (cost=22.30..22.30 rows=1230 width=36)"
    "                                      ->  Seq Scan on shipvia s  (cost=0.00..22.30 rows=1230 width=36)"
    

    我仅限于 PostgreSQL。切换到 SQL Server 不是一种选择。

    编辑 =================================================== =================

    我遵循了 Erwin 提供的非常丰富的指示,这就是我所拥有的。

    指数:
    CREATE INDEX invoiceitems_name_gin_trgm_idx ON invoiceitems USING gin (name gin_trgm_ops);
    

    使用 JOIN 的计数查询,但没有额外的表:
    EXPLAIN ANALYZE SELECT COUNT(DISTINCT(o.id)) 
    FROM invoices o 
    JOIN invoiceitems items ON items.invoice_id = o.id 
    WHERE items.name ILIKE '%ac%';
    
    "Aggregate  (cost=78961.52..78961.53 rows=1 width=4) (actual time=5205.448..5205.450 rows=1 loops=1)"
    "  ->  Nested Loop  (cost=0.00..78960.73 rows=316 width=4) (actual time=0.396..5176.761 rows=6518 loops=1)"
    "        ->  Seq Scan on invoiceitems items  (cost=0.00..76885.98 rows=316 width=4) (actual time=0.021..4502.043 rows=6518 loops=1)"
    "              Filter: (name ~~* '%ac%'::text)"
    "              Rows Removed by Filter: 3275000"
    "        ->  Index Only Scan using invoices_pkey on invoices o  (cost=0.00..6.56 rows=1 width=4) (actual time=0.012..0.015 rows=1 loops=6518)"
    "              Index Cond: (id = items.invoice_id)"
    "              Heap Fetches: 6518"
    "Total runtime: 5205.509 ms"
    

    半连接计数查询:
    EXPLAIN ANALYZE SELECT COUNT(1)
    FROM   invoices o
    WHERE EXISTS (
       SELECT 1
       FROM   invoiceitems i 
       WHERE  i.invoice_id = o.id
       AND    i.name ILIKE '%ac%'
       );
    
    "Aggregate  (cost=76920.43..76920.44 rows=1 width=0) (actual time=5713.597..5713.598 rows=1 loops=1)"
    "  ->  Nested Loop  (cost=76886.76..76919.64 rows=316 width=0) (actual time=5583.706..5703.801 rows=6518 loops=1)"
    "        ->  HashAggregate  (cost=76886.76..76886.82 rows=5 width=4) (actual time=5583.568..5594.977 rows=6518 loops=1)"
    "              ->  Seq Scan on invoiceitems i  (cost=0.00..76885.98 rows=316 width=4) (actual time=0.295..5148.801 rows=6518 loops=1)"
    "                    Filter: (name ~~* '%ac%'::text)"
    "                    Rows Removed by Filter: 3275000"
    "        ->  Index Only Scan using invoices_pkey on invoices o  (cost=0.00..6.56 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=6518)"
    "              Index Cond: (id = i.invoice_id)"
    "              Heap Fetches: 6518"
    "Total runtime: 5713.804 ms"
    

    半连接似乎没有效果。为什么?

    (我认为这并不重要,但我删除了 lower(invoiceitems.name) 上的原始功能索引)。

    编辑2================================================= =================

    我想专注于获取行查询并提供更多上下文。

    首先,用户可能要求按发票中的任意字段(而不是发票项目)对列进行排序。

    此外,用户可以提供涉及发票和发票项目字段的过滤器语句列表。这些过滤器语句捕获通过字符串或数值过滤的语义,例如,过滤器可以是“发票项目名称包含'ac'并且发票折扣高于5%”

    我很清楚,我不太可能对每个字段都进行索引,我可能只需要索引最常见的字段,例如发票项目名称和其他一些字段。

    无论如何,这里是我目前在 invoices 和 invoiceitems 表上的索引:

    发票
  • id 作为主键

  • 发票项目
  • id 作为主键
  • CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems USING btree (invoice_id);
  • CREATE INDEX invoiceitems_name_gin_trgm_idx ON invoiceitems USING gin (name COLLATE pg_catalog."default" gin_trgm_ops);

  • 下面是使用 JOIN 对发票项目的 fetch rows 查询的分析:
    explain analyze
    SELECT DISTINCT(o.id), t.terms, r.rep, s.ship_via, ...
    FROM invoices o
    JOIN invoiceitems items ON items.invoice_id = o.id
    LEFT JOIN terms t ON t.id = o.term_id
    LEFT JOIN reps r ON r.id = o.rep_id
    LEFT JOIN shipVia s ON s.id = o.ship_via_id 
    WHERE (items.name ILIKE '%df%' AND items.name IS NOT NULL) LIMIT 100;
    
    "Limit  (cost=79100.70..79106.95 rows=100 width=312) (actual time=4637.195..4637.195 rows=0 loops=1)"
    "  ->  Unique  (cost=79100.70..79120.45 rows=316 width=312) (actual time=4637.190..4637.190 rows=0 loops=1)"
    "        ->  Sort  (cost=79100.70..79101.49 rows=316 width=312) (actual time=4637.186..4637.186 rows=0 loops=1)"
    "              Sort Key: o.id, o.customer, o.business_no, o.bill_to_name, o.bill_to_address1, o.bill_to_address2, o.bill_to_postal_code, o.ship_to_name, o.ship_to_address1, o.ship_to_address2, o.ship_to_postal_code, o.purchase_order_no, t.terms, r.rep, ((o.ship_date)::text), s.ship_via, o.delivery, o.hst_percents, o.sub_total, o.total_before_hst, o.total, o.total_discount, o.hst, o.item_count"
    "              Sort Method: quicksort  Memory: 25kB"
    "              ->  Hash Left Join  (cost=113.02..79087.58 rows=316 width=312) (actual time=4637.179..4637.179 rows=0 loops=1)"
    "                    Hash Cond: (o.ship_via_id = s.id)"
    "                    ->  Hash Left Join  (cost=75.35..79043.98 rows=316 width=284) (actual time=4637.123..4637.123 rows=0 loops=1)"
    "                          Hash Cond: (o.rep_id = r.id)"
    "                          ->  Hash Left Join  (cost=37.67..79001.96 rows=316 width=256) (actual time=4637.119..4637.119 rows=0 loops=1)"
    "                                Hash Cond: (o.term_id = t.id)"
    "                                ->  Nested Loop  (cost=0.00..78960.73 rows=316 width=228) (actual time=4637.115..4637.115 rows=0 loops=1)"
    "                                      ->  Seq Scan on invoiceitems items  (cost=0.00..76885.98 rows=316 width=4) (actual time=4637.108..4637.108 rows=0 loops=1)"
    "                                            Filter: ((name IS NOT NULL) AND (name ~~* '%df%'::text))"
    "                                            Rows Removed by Filter: 3281518"
    "                                      ->  Index Scan using invoices_pkey on invoices o  (cost=0.00..6.56 rows=1 width=228) (never executed)"
    "                                            Index Cond: (id = items.invoice_id)"
    "                                ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (never executed)"
    "                                      ->  Seq Scan on terms t  (cost=0.00..22.30 rows=1230 width=36) (never executed)"
    "                          ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (never executed)"
    "                                ->  Seq Scan on reps r  (cost=0.00..22.30 rows=1230 width=36) (never executed)"
    "                    ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (never executed)"
    "                          ->  Seq Scan on shipvia s  (cost=0.00..22.30 rows=1230 width=36) (never executed)"
    "Total runtime: 4637.731 ms"
    

    下面是使用 WHERE EXISTS 而不是 JOIN 对发票项目的 fetch rows 查询的分析:
    explain analyze
    SELECT o.id, t.terms, r.rep, s.ship_via, ...
    FROM invoices o
    LEFT JOIN terms t ON t.id = o.term_id
    LEFT JOIN reps r ON r.id = o.rep_id
    LEFT JOIN shipVia s ON s.id = o.ship_via_id 
    WHERE EXISTS (
       SELECT 1
       FROM   invoiceitems i 
       WHERE  i.invoice_id = o.id
       AND    i.name ILIKE '%df%'
       AND    i.name IS NOT NULL
       ) LIMIT 100;
    
    "Limit  (cost=0.19..43302.88 rows=100 width=610) (actual time=5771.852..5771.852 rows=0 loops=1)"
    "  ->  Nested Loop Left Join  (cost=0.19..136836.68 rows=316 width=610) (actual time=5771.848..5771.848 rows=0 loops=1)"
    "        ->  Nested Loop Left Join  (cost=0.19..135404.33 rows=316 width=582) (actual time=5771.844..5771.844 rows=0 loops=1)"
    "              ->  Nested Loop Left Join  (cost=0.19..134052.55 rows=316 width=554) (actual time=5771.841..5771.841 rows=0 loops=1)"
    "                    ->  Merge Semi Join  (cost=0.19..132700.78 rows=316 width=526) (actual time=5771.837..5771.837 rows=0 loops=1)"
    "                          Merge Cond: (o.id = i.invoice_id)"
    "                          ->  Index Scan using invoices_pkey on invoices o  (cost=0.00..3907.27 rows=65000 width=526) (actual time=0.017..0.017 rows=1 loops=1)"
    "                          ->  Index Scan using invoiceitems_invoice_id_idx on invoiceitems i  (cost=0.00..129298.19 rows=316 width=4) (actual time=5771.812..5771.812 rows=0 loops=1)"
    "                                Filter: ((name IS NOT NULL) AND (name ~~* '%df%'::text))"
    "                                Rows Removed by Filter: 3281518"
    "                    ->  Index Scan using terms_pkey on terms t  (cost=0.00..4.27 rows=1 width=36) (never executed)"
    "                          Index Cond: (id = o.term_id)"
    "              ->  Index Scan using reps_pkey on reps r  (cost=0.00..4.27 rows=1 width=36) (never executed)"
    "                    Index Cond: (id = o.rep_id)"
    "        ->  Index Scan using shipvia_pkey on shipvia s  (cost=0.00..4.27 rows=1 width=36) (never executed)"
    "              Index Cond: (id = o.ship_via_id)"
    "Total runtime: 5771.948 ms"
    

    我没有尝试第三个选项,它通过不同的 invoice_id 对 invoiceitems 行进行排序,因为这种方法似乎仅在未给出排序时才可行,而通常情况恰恰相反 - 存在排序。

    最佳答案

    指数

    三元索引

    使用 三元索引 , 由模块 pg_trgm 提供它为 GIN 或 GiST 索引提供运算符类以支持所有 LIKE (and ILIKE) 模式,而不仅仅是左 anchor 模式:

    在此 related answer on dba.SE 中查找有关模式匹配和索引的概述.
    更多关于如何在这个相关答案中使用三元组索引(在许多其他答案中):
    PostgreSQL LIKE query performance variations

    例子:

    CREATE EXTENSION pg_tgrm;  -- only once per db
    
    CREATE INDEX invoiceitems_name_gist_trgm_idx
    ON invoiceitems USING gist (name gist_trgm_ops);
    

    一个 GIN index 可能会更快,但也更大。 I quote the manual:

    As a rule of thumb, a GIN index is faster to search than a GiST index, but slower to build or update; so GIN is better suited for static data and GiST for often-updated data.



    这完全取决于您的确切要求。

    额外的 btree 索引

    当然,您还需要在 invoiceitems.invoice_id 上的普通 btree 索引(默认) !
    CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems (invoice_id);
    

    用于仅索引扫描的多列索引

    Postgres 9.2 或更高版本 ,您可能会从 making this index "covering" for an index-only scan 获得一些额外的好处. GIN 索引通常对 integer 没有意义像 invoice_id 这样的列.但是为了节省额外的堆查找,将它包含在多列 GIN(或 GiST)索引中可能是值得的。你必须进行测试。

    为此,您需要附加模块 btree_gin (或 btree_gist 分别)。 GIN 示例:
    CREATE EXTENSION btree_gin;
    
    CREATE INDEX invoiceitems_name_gin_trgm_idx
    ON invoiceitems USING gin (name gin_trgm_ops, invoice_id);
    

    这将消除对上述 btree 索引的需求,但无论如何都要确保创建它,以使 fk-checks 单独更快,但对于许多其他情况也是如此。

    查询

    数数

    为一个 ...

    query to get the count of the invoices



    ...省略只会造成伤害的其他表格(如果有的话):
    SELECT COUNT(DISTINCT(item.invoice_id))
    FROM   invoiceitems item 
    JOIN   invoices o ON item.invoice_id = o.id
    LEFT   JOIN terms t ON t.id = o.term_id
    LEFT   JOIN reps r ON r.id = o.rep_id
    LEFT   JOIN shipVia s ON s.id = o.ship_via_id
    WHERE  item.name ILIKE '%ac%';

    由于您的外键约束保证了引用完整性,您甚至可以省略表 invoices从这个查询。你 Shiny 的新索引应该开始了!

    (更新后的表格取消了我在初稿中提出的 EXIST 变体。)

    返回行

    对于退货:

    EXISTS 这里仍然很好:
    SELECT t.terms, r.rep, s.ship_via, ...
    FROM   invoices     o
    LEFT   JOIN terms   t ON t.id = o.term_id
    LEFT   JOIN reps    r ON r.id = o.rep_id
    LEFT   JOIN shipVia s ON s.id = o.ship_via_id
    WHERE EXISTS (
       SELECT 1
       FROM   invoiceitems i 
       WHERE  i.invoice_id = o.id
       AND    i.name ILIKE '%ac%'
       )
    -- ORDER BY ???
    LIMIT 100;
    

    或者您可以测试这个作为子选择加入上述查询的变体。可能更快:
    SELECT t.terms, r.rep, s.ship_via, ...
    FROM  (
       SELECT DISTINCT invoice_id
       FROM   invoiceitems
       WHERE  name ILIKE '%ac%'
       ORDER  BY invoice_id           -- order by id = cheapest with above index
       LIMIT  100                     -- LIMIT early!
       ) item
    JOIN   invoices     o ON o.id = item.invoice_id
    LEFT   JOIN terms   t ON t.id = o.term_id
    LEFT   JOIN reps    r ON r.id = o.rep_id
    LEFT   JOIN shipVia s ON s.id = o.ship_via_id
    -- ORDER BY ???
    ;
    

    此示例通过 invoice_id 获取前 100 个(因为您没有提供排序顺序)。这一切都取决于细节......

    关于SQL join 1-many 只查询没有重复的父行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19106982/

    相关文章:

    node.js - allowNull :false and required:true did not validate null string input

    java - Spring JPA :PropertyAccessException 1:. ..MethodInvocationException : .'driverClassName' . .. org/postgresql/Driver : Unsupported major. 次要版本 52.0

    php - 如何在返回时对从 mysql 返回的每一行进行编号

    android - SQLite COUNT JOIN DISTINCT

    sql - 如何转换sqlite表中列的日期格式?

    SQL Server FileStream - 如何获取文件路径

    php - 用户登录(使用 session )问题

    java - 我怎样才能在Hibernate中调用这种函数呢?

    database - 返回在PostgreSQL中使用Rocket和Diesel创建的单个记录(Rust)

    Mysql选择唯一值