我创建了以下存储过程来从源表中提取数据并仅将“新记录”插入到目标表中。我需要每分钟运行一次。挑战在于该表存储了超过 200 万行的“图像”。
当我运行存储过程时,它仍然运行超过 1 小时 22 分钟。我已经将所有图像移至此表中,因此没有新图像可供拖动,因此我假设它仍在读取表中的新记录。该作业没有被阻止并且仍然可以运行。
有什么方法可以优化这个存储过程,使其只提取新记录。我的目标是进行优化,以便我们可以按重复的时间表运行它。业务要求是每分钟运行一次,但根据当前结果这是不可能的。
我已经包含了源表架构和目标表架构,以便您可以看到我必须使用的所有列。我不确定是否可以使用源表中的created_date来过滤数据,这样它就不会浏览所有数据。源表中还有一个标识列 (acc_image_id
)。我不确定应该如何修改存储过程以根据 max(identity)
更快地提取数据。
仅供引用,如果添加列以帮助进一步过滤此数据以便它仅查找最近的图像有意义,我可以向目标表添加列。
源表:
CREATE TABLE [dbo].[acc_image]
(
[acc_image_id] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
[acc_id] [int] NULL,
[image_type_id] [int] NOT NULL,
[data_format] [varchar](10) NOT NULL,
[label] [varchar](50) NOT NULL,
[description] [varchar](255) NULL,
[image_width] [smallint] NULL,
[image_height] [smallint] NULL,
[image_color_depth] [tinyint] NULL,
[image_thumbnail] [image] NOT NULL,
[data] [image] NOT NULL,
[notes] [text] NULL,
[acc_specimen_id] [int] NULL,
[acc_slide_id] [int] NULL,
[include_in_report] [char](1) NOT NULL,
[include_in_internet] [char](1) NOT NULL,
[created_date] [datetime] NOT NULL,
[row_version] [timestamp] NOT NULL,
[report_section_number] [smallint] NULL,
[external_notes] [text] NULL,
[source_filename] [varchar](255) NULL,
[page_count] [int] NOT NULL,
[include_as_attachment] [char](1) NOT NULL,
[sort_order] [smallint] NULL,
[acc_parent_image_id] [int] NULL,
[created_by_id] [int] NULL,
[annotation] [text] NULL,
[specimen_results_enabled] [char](1) NOT NULL,
[dis_imageserver_id] [int] NULL,
[external_slide_image_id] [varchar](80) NULL,
[external_report_image_id] [varchar](80) NULL,
)
这是目标表架构。我有一个 inserted_date
,我不知道是否可以用来优化:
CREATE TABLE [dbo].[acc_image]
(
[acc_image_id] [int] NOT NULL,
[acc_id] [int] NULL,
[image_type_id] [int] NOT NULL,
[data_format] [varchar](10) NOT NULL,
[label] [varchar](50) NOT NULL,
[description] [varchar](255) NULL,
[image_width] [int] NULL,
[image_height] [int] NULL,
[image_color_depth] [tinyint] NULL,
[image_thumbnail] [image] NOT NULL,
[data] [image] NOT NULL,
[image_guid] [uniqueidentifier] NULL,
[created_date] [datetime] NOT NULL,
[row_version] [varbinary](12) NOT NULL,
[sort_order] [int] NULL,
[insert_date] [datetime] NOT NULL,
[updated_date] [datetime] NULL,
)
这是存储过程代码:
ALTER PROCEDURE [dbo].[get_image]
AS
BEGIN
DECLARE @RecCt AS int = 0
BEGIN TRY
INSERT INTO connect_onprem.dbo.acc_image (acc_image_id, acc_id, image_type_id,
data_format, label, description,
image_width, image_height,
image_color_depth, image_thumbnail,
data, created_date, row_version, sort_order)
SELECT
src.acc_image_id, a.id, src.image_type_id,
src.data_format, src.label, src.description,
src.image_width, src.image_height,
src.image_color_depth, src.image_thumbnail,
src.data, src.created_date, src.row_version, src.sort_order
FROM
[ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].accession_2 a
INNER JOIN
[ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].acc_specimen s ON a.primary_specimen_id = s.id
INNER JOIN
[ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].acc_slide ass ON s.id = ass.acc_specimen_id
--source table
INNER JOIN
[ARKPPTEST\POWERPATHTEST].[Powerpath_Test].[dbo].acc_image src ON ass.id = src.acc_slide_id
--target table
LEFT JOIN
connect_onprem.dbo.acc_image tgt ON src.acc_image_id = tgt.acc_image_id
WHERE
tgt.acc_image_id IS NULL
AND a.acc_type_id <> 134
AND a.status_final = 'Y'
ORDER BY
acc_image_id
SET @RecCt = @@ROWCOUNT
IF @RecCt > 0
BEGIN
INSERT INTO connect_onprem.dbo.ErrorLog (UserName, ErrorNumber, ErrorState,
ErrorSeverity, ErrorLine, ErrorProcedure, ErrorMsg, ErrorDateTime)
Values ('RecTrack', @RecCt, 0, 0, 0, 'get_image',
'connect_onprem.dbo.acc_image Inserted Records', GETDATE());
END
END TRY
BEGIN CATCH
INSERT INTO connect_onprem.dbo.ErrorLog (UserName, ErrorNumber, ErrorState, ErrorSeverity,
ErrorLine, ErrorProcedure, ErrorMsg, ErrorDateTime)
VALUES (SUSER_SNAME(), ERROR_NUMBER(), ERROR_STATE(), ERROR_SEVERITY(), ERROR_LINE(),
ERROR_PROCEDURE(), ERROR_MESSAGE(), GETDATE());
END CATCH
END
目标:优化它以尽可能快地运行;使用跨网络运行的链接服务器读取源表并将"new"记录插入到目标表中。
目前,存储过程运行时间超过 1 小时 22 分钟。我需要以某种方式改变它,以便能够至少每 1 分钟运行一次(如果可能)。
最佳答案
我怀疑查询的远程部分正在从远程数据库检索和传输所有选定的数据(包括可能较大的 data
和 image_thumbnail
值),然后再进行本地检查是否存在本地副本。
我假设一旦初始加载完成,大部分检索到的数据将作为重复数据被丢弃。例如,您的远程源中可能有 100 万条记录,但只需将 1000 条记录作为新行插入到目标表中。
解决方案可能是最初只选择 ID 值到临时表(或表变量)中,然后在第二个查询中使用它来实际选择和插入最终数据。检索要预先检查的一百万个 ID 值应该比检索一百万个完整行要快得多。稍后,在筛选 ID 后,仅检索这 1000 行可能会相对较快地完成。
类似于:
DECLARE @Selected_Ids TABLE(acc_image_id int not null)
-- Pre-select IDs
INSERT INTO @Selected_Ids(acc_image_id)
SELECT
src.acc_image_id
FROM [RemoteServer].[Powerpath_Test].[dbo].accession_2 a
INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_specimen s ON
a.primary_specimen_id = s.id
INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_slide ass ON
s.id = ass.acc_specimen_id
--source table
INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_image src on
ass.id = src.acc_slide_id
--target table
LEFT JOIN connect_onprem.dbo.acc_image tgt ON src.acc_image_id = tgt.acc_image_id
WHERE tgt.acc_image_id is null
AND a.acc_type_id <> 134
AND a.status_final = 'Y'
-- order by acc_image_id
-- Main select
INSERT INTO connect_onprem.dbo.acc_image(acc_image_id, acc_id, image_type_id,
data_format, label, description, image_width, image_height, image_color_depth, image_thumbnail,
data, created_date, row_version, sort_order)
SELECT
src.acc_image_id,
a.id,
src.image_type_id,
src.data_format,
src.label,
src.description,
src.image_width,
src.image_height,
src.image_color_depth,
src.image_thumbnail,
src.data,
src.created_date,
src.row_version,
src.sort_order
FROM [RemoteServer].[Powerpath_Test].[dbo].accession_2 a
INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_specimen s ON
a.primary_specimen_id = s.id
INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_slide ass ON
s.id = ass.acc_specimen_id
INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_image src on
ass.id = src.acc_slide_id
WHERE src.acc_image_id IN (SELECT sel.acc_image_id FROM @Selected_Ids sel)
--order by src.acc_image_id
看来您可以通过在预选择期间保存访问 ID 来消除第二个查询中的多个联接。由于所有过滤器已经应用,其余数据来自acc_image表,因此不需要引用其他表。
DECLARE @Selected_Ids TABLE(acc_image_id int not null, acc_id int not null)
-- Pre-select IDs
INSERT INTO @Selected_Ids(acc_image_id, acc_id)
SELECT
src.acc_image_id,
a.id
FROM [RemoteServer].[Powerpath_Test].[dbo].accession_2 a
INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_specimen s ON
a.primary_specimen_id = s.id
INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_slide ass ON
s.id = ass.acc_specimen_id
--source table
INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_image src on
ass.id = src.acc_slide_id
--target table
LEFT JOIN connect_onprem.dbo.acc_image tgt ON src.acc_image_id = tgt.acc_image_id
WHERE tgt.acc_image_id is null
AND a.acc_type_id <> 134
AND a.status_final = 'Y'
-- order by acc_image_id
-- Main select
INSERT INTO connect_onprem.dbo.acc_image(acc_image_id, acc_id, image_type_id,
data_format, label, description, image_width, image_height, image_color_depth, image_thumbnail,
data, created_date, row_version, sort_order)
SELECT
src.acc_image_id,
sel.acc_id,
src.image_type_id,
src.data_format,
src.label,
src.description,
src.image_width,
src.image_height,
src.image_color_depth,
src.image_thumbnail,
src.data,
src.created_date,
src.row_version,
src.sort_order
FROM @Selected_Ids sel
INNER JOIN [RemoteServer].[Powerpath_Test].[dbo].acc_image src on
sel.acc_image_id = src.acc_image_id
--order by src.acc_image_id
查询引擎可能会执行循环连接来一次检索一个远程 acc_image
行,或者可能会将所有预选 ID 的列表发送到远程服务器以进行批量检索。您应该对两者运行性能测试并检查生成的执行计划。
您甚至可以在第二个查询中尝试上述内容的以下混合:
...
FROM [RemoteServer].[Powerpath_Test].[dbo].acc_image src
INNER JOIN @Selected_Ids sel on
src.acc_image_id = sel.acc_image_id
WHERE src.acc_image_id IN (SELECT sel1.acc_image_id FROM @Selected_Ids sel1)
(请注意,我注释掉了 order by
,因为我认为它是不必要的。)
关于sql-server - 如何优化 T-SQL 存储过程以搜索源表并将新记录插入目标表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76622795/