tsql - T-SQL 中的编辑距离

标签 tsql edit-distance levenshtein-distance

我对 T-SQL 计算 Levenshtein 距离的算法感兴趣。

最佳答案

我在 TSQL 中实现了标准 Levenshtein 编辑距离函数,并进行了多项优化,与我所知的其他版本相比,速度有所提高。如果两个字符串的开头有共同的字符(共享前缀),结尾有共同的字符(共享后缀),并且当字符串很大并且提供了最大编辑距离时,速度的提高是显着的。例如,当输入是两个非常相似的 4000 个字符串,并且指定最大编辑距离为 2 时,这几乎比接受的答案中的 edit_distance_within 函数快三个数量级,返回在 0.073 秒(73 毫秒)内给出答案,而在 55 秒内给出答案。它的内存效率也很高,使用的空间等于两个输入字符串中较大的一个加上一些常量空间。它使用表示列的单个 nvarchar“数组”,并在其中就地进行所有计算,以及一些辅助 int 变量。

优化:

  • 跳过共享前缀和/或后缀的处理
  • 如果较大的字符串以整个较小的字符串开头或结尾,则提前返回
  • 如果尺寸差异保证超出最大距离,请提前返回
  • 仅使用表示矩阵中的列的单个数组(实现为 nvarchar)
  • 当给定最大距离时,时间复杂度从 (len1*len2) 到 (min(len1,len2)),即线性
  • 给定最大距离后,一旦已知无法实现最大距离限制,就会尽早返回

这里是代码(2014 年 1 月 20 日更新以加快速度):

-- =============================================
-- Computes and returns the Levenshtein edit distance between two strings, i.e. the
-- number of insertion, deletion, and sustitution edits required to transform one
-- string to the other, or NULL if @max is exceeded. Comparisons use the case-
-- sensitivity configured in SQL Server (case-insensitive by default).
-- 
-- Based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described
-- at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm,
-- with some additional optimizations.
-- =============================================
CREATE FUNCTION [dbo].[Levenshtein](
    @s nvarchar(4000)
  , @t nvarchar(4000)
  , @max int
)
RETURNS int
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @distance int = 0 -- return variable
          , @v0 nvarchar(4000)-- running scratchpad for storing computed distances
          , @start int = 1      -- index (1 based) of first non-matching character between the two string
          , @i int, @j int      -- loop counters: i for s string and j for t string
          , @diag int          -- distance in cell diagonally above and left if we were using an m by n matrix
          , @left int          -- distance in cell to the left if we were using an m by n matrix
          , @sChar nchar      -- character at index i from s string
          , @thisJ int          -- temporary storage of @j to allow SELECT combining
          , @jOffset int      -- offset used to calculate starting value for j loop
          , @jEnd int          -- ending value for j loop (stopping point for processing a column)
          -- get input string lengths including any trailing spaces (which SQL Server would otherwise ignore)
          , @sLen int = datalength(@s) / datalength(left(left(@s, 1) + '.', 1))    -- length of smaller string
          , @tLen int = datalength(@t) / datalength(left(left(@t, 1) + '.', 1))    -- length of larger string
          , @lenDiff int      -- difference in length between the two strings
    -- if strings of different lengths, ensure shorter string is in s. This can result in a little
    -- faster speed by spending more time spinning just the inner loop during the main processing.
    IF (@sLen > @tLen) BEGIN
        SELECT @v0 = @s, @i = @sLen -- temporarily use v0 for swap
        SELECT @s = @t, @sLen = @tLen
        SELECT @t = @v0, @tLen = @i
    END
    SELECT @max = ISNULL(@max, @tLen)
         , @lenDiff = @tLen - @sLen
    IF @lenDiff > @max RETURN NULL

    -- suffix common to both strings can be ignored
    WHILE(@sLen > 0 AND SUBSTRING(@s, @sLen, 1) = SUBSTRING(@t, @tLen, 1))
        SELECT @sLen = @sLen - 1, @tLen = @tLen - 1

    IF (@sLen = 0) RETURN @tLen

    -- prefix common to both strings can be ignored
    WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1)) 
        SELECT @start = @start + 1
    IF (@start > 1) BEGIN
        SELECT @sLen = @sLen - (@start - 1)
             , @tLen = @tLen - (@start - 1)

        -- if all of shorter string matches prefix and/or suffix of longer string, then
        -- edit distance is just the delete of additional characters present in longer string
        IF (@sLen <= 0) RETURN @tLen

        SELECT @s = SUBSTRING(@s, @start, @sLen)
             , @t = SUBSTRING(@t, @start, @tLen)
    END

    -- initialize v0 array of distances
    SELECT @v0 = '', @j = 1
    WHILE (@j <= @tLen) BEGIN
        SELECT @v0 = @v0 + NCHAR(CASE WHEN @j > @max THEN @max ELSE @j END)
        SELECT @j = @j + 1
    END

    SELECT @jOffset = @max - @lenDiff
         , @i = 1
    WHILE (@i <= @sLen) BEGIN
        SELECT @distance = @i
             , @diag = @i - 1
             , @sChar = SUBSTRING(@s, @i, 1)
             -- no need to look beyond window of upper left diagonal (@i) + @max cells
             -- and the lower right diagonal (@i - @lenDiff) - @max cells
             , @j = CASE WHEN @i <= @jOffset THEN 1 ELSE @i - @jOffset END
             , @jEnd = CASE WHEN @i + @max >= @tLen THEN @tLen ELSE @i + @max END
        WHILE (@j <= @jEnd) BEGIN
            -- at this point, @distance holds the previous value (the cell above if we were using an m by n matrix)
            SELECT @left = UNICODE(SUBSTRING(@v0, @j, 1))
                 , @thisJ = @j
            SELECT @distance = 
                CASE WHEN (@sChar = SUBSTRING(@t, @j, 1)) THEN @diag                    --match, no change
                     ELSE 1 + CASE WHEN @diag < @left AND @diag < @distance THEN @diag    --substitution
                                   WHEN @left < @distance THEN @left                    -- insertion
                                   ELSE @distance                                        -- deletion
                                END    END
            SELECT @v0 = STUFF(@v0, @thisJ, 1, NCHAR(@distance))
                 , @diag = @left
                 , @j = case when (@distance > @max) AND (@thisJ = @i + @lenDiff) then @jEnd + 2 else @thisJ + 1 end
        END
        SELECT @i = CASE WHEN @j > @jEnd + 1 THEN @sLen + 1 ELSE @i + 1 END
    END
    RETURN CASE WHEN @distance <= @max THEN @distance ELSE NULL END
END

正如该函数的注释中所提到的,字符比较的区分大小写将遵循有效的排序规则。默认情况下,SQL Server 的排序规则将导致不区分大小写的比较。 修改此函数以始终区分大小写的一种方法是向比较字符串的两个位置添加特定的排序规则。但是,我还没有对此进行彻底测试,特别是对于数据库使用非默认排序规则时的副作用。 以下是如何更改这两行以强制区分大小写的比较:

    -- prefix common to both strings can be ignored
    WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS) 

            SELECT @distance = 
                CASE WHEN (@sChar = SUBSTRING(@t, @j, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS) THEN @diag                    --match, no change

关于tsql - T-SQL 中的编辑距离,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/560709/

相关文章:

tsql - 在转换为 int 的列上将 varchar 转换为 int 失败

sql - 根据成员(member)资格、成员(member)级别以及到期日期或成员(member)资格对用户进行重复数据删除

c# - Entity Framework Code First - 获取具有特定标签的博客文章

SQL Server 2008 : Why won't this round to two decimal places?

python - 如何将一列的不同行与 pandas 中的 Levenshtein 距离度量进行比较?

java - 与Levenshtein的快速比较

algorithm - 使用后缀树近似子字符串匹配

string - Levenshtein 编辑距离和不同的编辑集

elasticsearch - 从结果中排除相似的文件(重复项)

python - 编辑距离,例如 Levenshtein 考虑到键盘上的接近度