regex - 在 Postgres 9.1 上更新查询太慢

我的问题是我对一个有 1400 万行的表进行非常缓慢的更新查询。我尝试了不同的方法来调整我的服务器，这带来了良好的性能，但不是更新查询。

我有两个表:

T1 有 4 列和 3 个索引(530 行)
T2 有 15 列和 3 个索引(1400 万行)
我想通过在文本字段 stxt 上连接两个表，用 T1 中 vid 的相同值更新 T2 中的字段 vid(整数类型)。

这是我的查询及其输出:

explain analyse 
update T2 
  set vid=T1.vid 
from T1 
where stxt2 ~ stxt1 and T2.vid = 0;

Update on T2  (cost=0.00..9037530.59 rows=2814247 width=131) (actual time=25141785.741..25141785.741 rows=0 loops=1)
 ->  Nested Loop  (cost=0.00..9037530.59 rows=2814247 width=131) (actual time=32.636..25035782.995 rows=679354 loops=1)
             Join Filter: ((T2.stxt2)::text ~ (T1.stxt1)::text)
             ->  Seq Scan on T2  (cost=0.00..594772.96 rows=1061980 width=121) (actual time=0.067..5402.614 rows=1037809 loops=1)
                         Filter: (vid= 1)
             ->  Materialize  (cost=0.00..17.95 rows=530 width=34) (actual time=0.000..0.069 rows=530 loops=1037809)
                         ->  Seq Scan on T1  (cost=0.00..15.30 rows=530 width=34) (actual time=0.019..0.397 rows=530 loops=1)
Total runtime: 25141785.904 ms

如您所见，查询大约花费了 25141 秒(约 7 小时)。 f 我理解得很好，计划者估计执行时间为 9037 秒(~ 2.5 小时)。我在这里遗漏了什么吗？

这是关于我的服务器配置的信息:

CentOS 5.8、20GB 内存
shared_buffers = 12GB
work_mem = 64MB
maintenance_work_mem = 64MB
bgwriter_lru_maxpages = 500
checkpoint_segments = 64
checkpoint_completion_target = 0.9
effective_cache_size = 10GB

我已经在表 T2 上运行了 vacuum full 并分析了几次，但这仍然没有太大改善情况。

PS:如果我将 full_page_writes 设置为关闭，这会大大改进更新查询，但我不想冒数据丢失的风险。您有什么建议吗？

最佳答案

这不是解决方案，而是数据建模的变通办法

将 url 分解为 {protocol,hostname,pathname} 组件。
现在您可以使用完全匹配来连接主机名部分，避免正则表达式匹配中的前导 %。
该 View 旨在证明可以根据需要重建 full_url。

更新可能需要几分钟。

SET search_path='tmp';

DROP TABLE urls CASCADE;
CREATE TABLE urls
        ( id SERIAL NOT NULL PRIMARY KEY
        , full_url varchar
        , proto varchar
        , hostname varchar
        , pathname varchar
        );

INSERT INTO urls(full_url) VALUES
 ( 'ftp://www.myhost.com/secret.tgz' )
,( 'http://www.myhost.com/robots.txt' )
,( 'http://www.myhost.com/index.php' )
,( 'https://www.myhost.com/index.php' )
,( 'http://www.myhost.com/subdir/index.php' )
,( 'https://www.myhost.com/subdir/index.php' )
,( 'http://www.hishost.com/index.php' )
,( 'https://www.hishost.com/index.php' )
,( 'http://www.herhost.com/index.php' )
,( 'https://www.herhost.com/index.php' )
        ;

UPDATE urls
SET proto = split_part(full_url, '://' , 1)
        , hostname = split_part(full_url, '://' , 2)
        ;

UPDATE urls
SET pathname = substr(hostname, 1+strpos(hostname, '/' ))
        , hostname = split_part(hostname, '/' , 1)
        ;

        -- the full_url field is now redundant: we can drop it
ALTER TABLE urls
        DROP column full_url
        ;
        -- and we could always reconstruct the full_url from its components.
CREATE VIEW vurls AS (
        SELECT id
        , proto || '://' || hostname || '/' || pathname AS full_url
        , proto
        , hostname
        , pathname
        FROM urls
        );

SELECT * FROM urls;
        ;
SELECT * FROM vurls;
        ;

输出:

INSERT 0 10
UPDATE 10
UPDATE 10
ALTER TABLE
CREATE VIEW
 id | proto |    hostname     |     pathname     
----+-------+-----------------+------------------
  1 | ftp   | www.myhost.com  | secret.tgz
  2 | http  | www.myhost.com  | robots.txt
  3 | http  | www.myhost.com  | index.php
  4 | https | www.myhost.com  | index.php
  5 | http  | www.myhost.com  | subdir/index.php
  6 | https | www.myhost.com  | subdir/index.php
  7 | http  | www.hishost.com | index.php
  8 | https | www.hishost.com | index.php
  9 | http  | www.herhost.com | index.php
 10 | https | www.herhost.com | index.php
(10 rows)

 id |                full_url                 | proto |    hostname     |     pathname     
----+-----------------------------------------+-------+-----------------+------------------
  1 | ftp://www.myhost.com/secret.tgz         | ftp   | www.myhost.com  | secret.tgz
  2 | http://www.myhost.com/robots.txt        | http  | www.myhost.com  | robots.txt
  3 | http://www.myhost.com/index.php         | http  | www.myhost.com  | index.php
  4 | https://www.myhost.com/index.php        | https | www.myhost.com  | index.php
  5 | http://www.myhost.com/subdir/index.php  | http  | www.myhost.com  | subdir/index.php
  6 | https://www.myhost.com/subdir/index.php | https | www.myhost.com  | subdir/index.php
  7 | http://www.hishost.com/index.php        | http  | www.hishost.com | index.php
  8 | https://www.hishost.com/index.php       | https | www.hishost.com | index.php
  9 | http://www.herhost.com/index.php        | http  | www.herhost.com | index.php
 10 | https://www.herhost.com/index.php       | https | www.herhost.com | index.php
(10 rows)

关于regex - 在 Postgres 9.1 上更新查询太慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11381908/

regex - 在 Postgres 9.1 上更新查询太慢

上一篇：python - 从制表符分隔值文件中删除某些返回字符

下一篇：sql - Postgres : Selecting Distinct on a column is not returning distinct results with Joins