由于扫描了数百万条记录,我的查询非常慢。该查询搜索特定范围内的数字数量。
我有 2 个表:numbers_in_ranges 和 person 表
Create table numbers_in_ranges
( range_id number(9,0) ,
begin_range number(9,0),
end_range number(9,0)
) ;
Create table person
(
id integer,
a_number varchar(9),
first_name varchar(25),
last_name varchar(25)
);
numbers_in_ranges 的数据
range_id| begin_range | end_range
--------|------------------------
101 | 100000000 | 200000000
102 | 210000000 | 290000000
103 | 350000000 | 459999999
104 | 461000000 | 569999999
106 | 241000000 | 241999999
e.t.c.
人的数据
id | a_number | first_name | last_name
---|------------|------------|-----------
1 | 100000001 | Maria | Doe
2 | 100000999 | Emily | Davis
3 | 150000000 | Dave | Smith
4 | 461000000 | Jane | Jones
6 | 241000001 | John | Doe
7 | 100000002 | Maria | Doe
8 | 100009999 | Emily | Davis
9 | 150000010 | Dave | Smith
10 | 210000001 | Jane | Jones
11 | 210000010 | John | Doe
12 | 281000000 | Jane | Jones
13 | 241000000 | John | Doe
14 | 460000001 | Maria | Doe
15 | 500000999 | Emily | Davis
16 | 550000010 | Dave | Smith
17 | 461000010 | Jane | Jones
18 | 241000020 | John | Doe
e.t.c.
我们通过数据库链接从远程数据库获取范围数据并将其存储在物化 View 中。
查询
select nums.range_id, count(p. a_number) as a_count
from number_in_ranges nums
left join person p on to_number(p. a_number)
between nums.begin_range and nums.end_range
group by nums.range_id;
结果是这样的
range_id| a_count
--------|------------------------
101 | 6
102 | 5
103 | 2
104 | 3
e.t.c
正如我所说,这个查询非常慢。
这是解释计划
Plan hash value: 3785994407
---------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9352 | 264K| | 42601 (31)| 00:00:02 | | | |
| 1 | PX COORDINATOR | | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,02 | P->S | QC (RAND) |
| 3 | HASH GROUP BY | | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10001 | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,01 | P->P | HASH |
| 6 | HASH GROUP BY | | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,01 | PCWP | |
| 7 | MERGE JOIN OUTER | | 2084M| 56G| | 37793 (23)| 00:00:02 | Q1,01 | PCWP | |
| 8 | SORT JOIN | | 9352 | 173K| | 3 (34)| 00:00:01 | Q1,01 | PCWP | |
| 9 | PX BLOCK ITERATOR | | 9352 | 173K| | 2 (0)| 00:00:01 | Q1,01 | PCWC | |
| 10 | MAT_VIEW ACCESS FULL | NUMBERS_IN_RANGES | 9352 | 173K| | 2 (0)| 00:00:01 | Q1,01 | PCWP | |
|* 11 | FILTER | | | | | | | Q1,01 | PCWP | |
|* 12 | SORT JOIN | | 89M| 850M| 2732M| 29681 (1)| 00:00:02 | Q1,01 | PCWP | |
| 13 | BUFFER SORT | | | | | | | Q1,01 | PCWC | |
| 14 | PX RECEIVE | | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,01 | PCWP | |
| 15 | PX SEND BROADCAST | :TQ10000 | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,00 | P->P | BROADCAST |
| 16 | PX BLOCK ITERATOR | | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,00 | PCWC | |
| 17 | INDEX FAST FULL SCAN| PERSON_AN_IDX | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,00 | PCWP | |
---------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
11 - filter("NUMS"."END_RANGE">=TO_NUMBER("P"."A_NUMBER"(+)))
12 - access("NUMS"."BEGIN_RANGE"<=TO_NUMBER("P"."A_NUMBER"(+)))
filter("NUMS"."BEGIN_RANGE"<=TO_NUMBER("P"."A_NUMBER"(+)))
Note
-----
- automatic DOP: Computed Degree of Parallelism is 16 because of degree limit
我尝试运行该月的 deltas
,然后将它们附加到表中,例如:
如果找到新的 range_id
然后
插入
如果找到 range_id then
更新
所以我们不必扫描整个表。
但是这个解决方案并没有奏效,因为一些范围被更新了,并且发生了拼接,例如:
我们创建一个新的 range_id = 110
,范围在 100110000
和 210000001
之间
然后将range_id = 101
拼接为100000000
和100110000
而range_id = 102
拼接为100110001
和210000000
;
现在我想创建一个触发器,用于在创建或更新新范围时更新该表;然而,这是不可能的,因为我们是从将数据存储到物化 View 中的远程数据库获取此数据,并且我们无法在只读物化 View 上放置触发器。
我的问题是还有其他方法可以执行此操作或优化此查询吗?
谢谢!
最佳答案
问题是 Oracle 尝试广播包含所有 ID 的表,对于这种情况看起来很奇怪。
但是,由于您只需要计算行数并且(看起来)间隔不重叠,您可以提高性能并避免使用技巧join
两个数据集:将数据转换为事件流,其中每个开始和结束值标识系列的开始和结束,然后计算该系列中的事件数。这样你就可以使用 match_recognize
这比 join
快得多。
查询将是:
with ranges_unpivot as (
/*Transform from_ ... to_... to the event-like structure*/
select
id
, val
, val_type
from ranges_table
unpivot(
val for val_type in (from_num as '01_START', to_num as '03_END')
)
union all
/*Append the rest of the data to the event stream*/
select
null,
id,
/*
This should be ordered between START mark and END mark
to process edge cases correctly
*/
'02_val'
from other_table
where id <= (select max(to_num) from ranges_table)
)
select /*+parallel(4) gather_plan_statistics*/ *
from ranges_unpivot
match_recognize (
order by val asc, val_type asc
measures
start_.id as range_id,
count(values_.val) as count_
pattern (start_ values_* end_)
define
start_ as val_type = '01_START',
values_ as val_type = '02_val',
end_ as val_type = '03_END'
)
此时在查询计划中显示:
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.33
与join
查询相比:
select /*+gather_plan_statistics*/
rt.id as range_id,
count(ot.id) as count_
from ranges_table rt
left join other_table ot
on rt.from_num <= ot.id
and rt.to_num >= ot.id
group by rt.id
显示:
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:13.84 |
参见 db<>fiddle .
关于sql - 通过 Oracle SQL 检索范围内的数字,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73656086/