database - 批处理/拆分 PostgreSQL 数据库

我正在从事一个项目，该项目批量处理数据并填充 PostgreSQL(9.6，但我可以升级)数据库。它目前的工作方式是，该过程在不同的步骤中发生，并且每个步骤都将数据添加到它拥有的表中(很少有两个进程写入同一个表，如果有的话，它们会写入不同的列)。

数据恰好是这样的，数据往往会随着每一步变得越来越细粒度。作为一个简化示例，我有一个定义数据源的表。很少(几十个/几百个)，但是这些数据源中的每一个都会生成成批的数据样本(批和样本是单独的表，用于存储元数据)。每批通常生成大约 50k 个样本。然后逐步处理这些数据点中的每一个，并且每个数据样本在下一个表中生成更多数据点。

这工作正常，直到我们在示例表中达到 150 万行(从我们的角度来看这不是很多数据)。现在，对批处理的过滤开始变慢(我们检索的每个样本大约需要 10 毫秒)。它开始成为一个主要的瓶颈，因为获取批处理数据的执行时间需要 5-10 分钟(获取是毫秒)。

我们在这些查询涉及的所有外键上都有 b 树索引。

由于我们的计算针对的是批处理，因此我通常不需要在计算期间跨批处理进行查询(此时查询时间非常耗时)。然而，出于数据分析的原因，跨批处理的临时查询需要保持可能。

因此，一个非常简单的解决方案是为每个批处理生成一个单独的数据库，并在需要时以某种方式跨这些数据库进行查询。如果我在每个数据库中只有一个批处理，显然对单个批处理的过滤将是即时的，我的问题将得到解决(暂时)。然而，那样的话我最终会得到数以千计的数据库，而且数据分析会很痛苦。

在 PostgreSQL 中，有没有办法假装我有单独的数据库用于某些查询？理想情况下，我希望在“注册”新批处理时对每个批处理都这样做。

在 PostgreSQL 的世界之外，我应该为我的用例尝试另一个数据库吗？

编辑:DDL/模式

在我们目前的实现中，sample_representation 是所有处理结果所依赖的表。批处理实际上是由 (batch.id, representation.id) 的元组定义的。我尝试并在上面描述为缓慢的查询是(每个样本 10 毫秒，50k 个样本加起来大约 5 分钟)

SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'

我们目前有大约 1.5 ssamples，2 representations，460 batches(其中 49 个已经处理，其他人没有与之关联的样本)，这意味着每个批处理平均有 30k 个样本。有些大约有 50k。

架构如下。有一些与所有表关联的元数据，但在这种情况下我没有查询它。实际样本数据单独存储在磁盘上而不是数据库中，以防出现差异。

    create table batch
(
    id uuid default uuid_generate_v1mc() not null
        constraint batch_pk
            primary key,
    path text not null
        constraint unique_batch_path
            unique,
    id_data_source uuid
)
;
create table sample
(
    id uuid default uuid_generate_v1mc() not null
        constraint sample_pk
            primary key,
    sample_pos integer,
    id_batch uuid
        constraint batch_fk
            references batch
                on update cascade on delete set null
)
;
create index sample_sample_pos_index
    on sample (sample_pos)
;
create index sample_id_batch_sample_pos_index
    on sample (id_batch, sample_pos)

;
create table representation
(
    id uuid default uuid_generate_v1mc() not null
        constraint representation_pk
            primary key,
    id_data_source uuid
)
;
create table data_source
(
    id uuid default uuid_generate_v1mc() not null
        constraint data_source_pk
            primary key
)
;
alter table batch
    add constraint data_source_fk
        foreign key (id_data_source) references data_source
            on update cascade on delete set null
;
alter table representation
    add constraint data_source_fk
        foreign key (id_data_source) references data_source
            on update cascade on delete set null
;
create table sample_representation
(
    id uuid default uuid_generate_v1mc() not null
        constraint sample_representation_pk
            primary key,
    id_sample uuid
        constraint sample_fk
            references sample
                on update cascade on delete set null,
    id_representation uuid
        constraint representation_fk
            references representation
                on update cascade on delete set null
)
;
create unique index sample_representation_id_sample_id_representation_uindex
    on sample_representation (id_sample, id_representation)
;
create index sample_representation_id_sample_index
    on sample_representation (id_sample)
;
create index sample_representation_id_representation_index
    on sample_representation (id_representation)
;

最佳答案

折腾了半天，终于找到了解决办法。但我仍然不确定为什么原始查询真的需要那么多时间:

SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'

所有内容都已编入索引，但表相对较大，sample_representation 和 sample 中有 150 万行。我想发生的事情是，首先将表连接起来，然后使用 WHERE 进行过滤。但是，即使通过连接创建了一个大 View ，也不应该花那么长时间？!

无论如何，我尝试使用 CTE 而不是连接两个“庞大”的表。想法是尽早过滤然后加入:

WITH sel_samplerepresentation AS (
  SELECT *
  FROM sample_representation
  WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
  ), sel_samples AS (
  SELECT *
  FROM sample
  WHERE id_video='75c04b9c-e4b9-11e7-a93f-132baa27ac91'
)
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_representation

这个查询也需要永远。这里的原因很清楚。 sel_samples 和 sel_samplerepresentation 各有 50k 条记录。联接发生在 CTE 的非索引列上。

由于 CTE 没有索引，我将它们重新表述为可以为其添加索引的物化 View :

CREATE MATERIALIZED VIEW sel_samplerepresentation AS (
  SELECT *
  FROM sample_representation
  WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
  );

CREATE MATERIALIZED VIEW sel_samples AS (
  SELECT *
  FROM sample
  WHERE id_video = '75c04b9c-e4b9-11e7-a93f-132baa27ac91'
);

CREATE INDEX sel_samplerepresentation_sample_id_index ON sel_samplerepresentation (id_sample);
CREATE INDEX sel_samples_id_index ON sel_samples (id);

SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_sample;

DROP MATERIALIZED VIEW sel_samplerepresentation;
DROP MATERIALIZED VIEW sel_samples;

这与其说是解决方案，不如说是黑客攻击，但执行这些查询需要 1 秒! (低于 8 分钟)

关于database - 批处理/拆分 PostgreSQL 数据库，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47924749/

database - 批处理/拆分 PostgreSQL 数据库

上一篇：如果列值未知，SQL 行到列

下一篇：sql - 可变日期格式查询