algorithm - 那里有用于 Delphi 2010 String (UnicodeString) 的 Boyer-Moore 字符串搜索和快速搜索和替换功能以及快速字符串计数吗?

标签 algorithm delphi delphi-2010 replace boyer-moore

我需要三个快速处理大字符串的函数:快速搜索、快速搜索和替换以及快速计算字符串中的子字符串。

我在 C++ 和 Python 中遇到过 Boyer-Moore 字符串搜索,但我发现的唯一用于实现快速搜索和替换的 Delphi Boyer-Moore 算法是 Peter Morris 的 FastStrings 的一部分,他以前是 DroopyEyes 软件,他的网站和电子邮件不再有效。

我已经移植了FastStrings继续在 Delphi 2009/2010 中为 AnsiStrings 工作,其中一个字节等于一个 AnsiChar,但让它们也与 Delphi 2010 中的字符串 (UnicodeString) 一起工作似乎很重要。

使用这个 Boyer-Moore 算法,应该可以轻松地进行不区分大小写的搜索,以及不区分大小写的搜索和替换,而无需任何临时字符串(使用 StrUpper 等),并且无需调用速度较慢的 Pos()当需要对同一文本进行重复搜索时,比 Boyer-Moore 搜索要好。

(编辑:我有一个部分解决方案,写成这个问题的答案,它几乎 100% 完成,它甚至有一个快速的字符串替换功能。我相信它一定有错误,特别是因为它假装要具备 Unicode 能力,就必须存在由于未实现的 Unicode promise 而导致的故障。)

(编辑 2:有趣且意外的结果;堆栈上 unicode 代码点表的大堆栈大小 - 下面代码中的 SkipTable 严重阻碍了您可以在这里进行的双赢优化的数量unicode 字符串 boyer-moore 字符串搜索。感谢 Florent Ouchet 指出我应该立即注意到的内容。)

最佳答案

这个答案现在已经完成并且适用于区分大小写的模式,但不适用于不区分大小写的模式,并且可能还有其他错误,因为它没有经过很好的单元测试,并且可能会进一步优化,例如我重复了本地函数 __SameChar 而不是使用会更快的比较函数回调,实际上,允许用户为所有这些传递比较函数对于想要提供一些额外逻辑(等效的 Unicode 字形集)的 Unicode 用户来说非常好对于某些语言)。

基于 Dorin Dominica 的代码,我构建了以下内容。

{ _FindStringBoyer:
  Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
  Credited to Dorin Duminica.
}
function _FindStringBoyer(const sString, sPattern: string;
  const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;

    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if bCaseSensitive then
        Result := (sString[StringIndex] = sPattern[PatternIndex])
      else
        Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

var
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
begin
  if fromPos < 1 then
    raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
  LengthPattern := Length(sPattern);
  LengthString := Length(sString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[sPattern[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[sPattern[LengthPattern]];
  SkipTable[sPattern[LengthPattern]] := Large;
  Index := fromPos + LengthPattern -1;
  Result := 0;
  while Index <= LengthString do begin
    repeat
      Index := Index + SkipTable[sString[Index]];
    until Index > LengthString;
    if Index <= Large then
      Break
    else
      Index := Index - Large;
    kIndex := 1;
    while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
      Inc(kIndex);
    if kIndex = LengthPattern then begin
      // Found, return.
      Result := Index - kIndex + 1;
      Index := Index + LengthPattern;
      exit;
    end else begin
      if __SameChar(Index, LengthPattern) then
        Index := Index + LastMarker
      else
        Index := Index + SkipTable[sString[Index]];
    end; // if kIndex = LengthPattern then begin
  end; // while Index <= LengthString do begin
end;

{ Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
  a substring inside the main string, at a much faster rate than we
  could have done otherwise.  Another thing that would be great is
  to have a function that returns an array of find-locations,
  which would be way faster to do than repeatedly calling Pos.
}
function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
var
  foundPos:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := 0;
  foundPos := 1;
  fromPos := 1;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin
        // Found, return.
        //Result := Index - kIndex + 1;
        Index := Index + LengthPattern;
        fromPos := Index;
        Inc(Result);
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin

  end;
end; 

这真的是一个很好的算法,因为:

  • 用这种方法计算字符串 Y 中的子字符串 X 的实例要快得多,真是太棒了。
  • 仅仅替换 Pos() _FindStringBoyer() 比 FastCode 项目人员贡献给 Delphi 的 Pos() 的纯 asm 版本更快,它目前用于 Pos,如果你需要不区分大小写,你可以想象一下当我们不必在 100 兆字节的字符串上调用 UpperCase 时的性能提升。 (好吧,你的字符串不会那么大。但是,高效的算法仍然是一件美事。)

好吧,我写了一个 Boyer-Moore 风格的字符串替换:

function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
var
  errors:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
  CaseSensitive:Boolean;
  foundAt:Integer;
  lastFoundAt:Integer;
  copyStartsAt:Integer;
  copyLen:Integer;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := '';
  lastFoundAt := 0;
  fromPos := 1;
  errors := 0;
  CaseSensitive := rfIgnoreCase in Flags;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    foundAt := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin


        foundAt := Index - kIndex + 1;
        Index := Index + LengthPattern;
        //fromPos := Index;
        fromPos := (foundAt+LengthPattern);
        if lastFoundAt=0 then begin
                copyStartsAt := 1;
                copyLen := foundAt-copyStartsAt;
        end else begin
                copyStartsAt := lastFoundAt+LengthPattern;
                copyLen := foundAt-copyStartsAt;
        end;

        if (copyLen<=0)or(copyStartsAt<=0) then begin
                Inc(errors);
        end;

        Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
        lastFoundAt := foundAt;
        if not (rfReplaceAll in Flags) then
                 fromPos := 0; // break out of outer while loop too!
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin
  end;
  if (lastFoundAt=0) then
  begin
     // nothing was found, just return whole original string
      Result := aSourceString;
  end
  else
  if (lastFoundAt+LengthPattern < Limit) then begin
     // the part that didn't require any replacing, because nothing more was found,
     // or rfReplaceAll flag was not specified, is copied at the
     // end as the final step.
    copyStartsAt := lastFoundAt+LengthPattern;
    copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
    Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
  end;

end;

好的,问题:堆栈占用空间:

var
  skiptable : array [Char] of Integer;  // 65536*4 bytes stack usage on Unicode delphi

再见 CPU hell ,你好堆栈 hell 。如果我选择动态数组,则必须在运行时调整它的大小。所以这个东西基本上很快,因为你计算机上的虚拟内存系统不会在 256K 进入堆栈时闪烁,但这并不总是最佳代码段。尽管如此,我的电脑不会对这样的大堆东西眨眼。它不会成为 Delphi 标准库的默认设置,也不会在未来赢得任何 fastcode 挑战,因为它有那么大的足迹。我认为重复搜索是一种情况,上面的代码应该写成一个类,而skiptable应该是那个类中的一个数据字段。然后,您可以一次构建 boyer-moore 表,随着时间的推移,如果字符串不变,则重复使用该对象进行快速查找。

关于algorithm - 那里有用于 Delphi 2010 String (UnicodeString) 的 Boyer-Moore 字符串搜索和快速搜索和替换功能以及快速字符串计数吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3310865/

相关文章:

java - 除大整数以获得精确值

delphi - 有没有一种简单的方法可以解决 Delphi utf8 文件缺陷?

c++ - 如何使用 WM_* 消息调整窗口大小

Delphi正确使用TMultiView

德尔福2010 : Where is DockForm. 过吗?

javascript - 如何为树中每个深度嵌套的节点管理状态 'collapsed'

algorithm - 通过使用 Select 算法中的枢轴重复出现

delphi - 需要创建一个已编译的delphi应用程序,可以制作单独的编译应用程序

performance - 面试解决方案 - 修剪间隙缓冲区

delphi - Delphi 2010 RTTI:如何集成/探索包含自定义属性的枚举