regex - 如何跳过正则表达式中引用的文本(或如何将 HyperStr ParseWord 与 Unicode 文本一起使用？)

我需要正则表达式帮助来创建一个 delphi 函数来替换 Rad Studio XE2 中的 HyperString ParseWord 函数。 HyperString 是一个非常有用的字符串库，但从未跳转到 Unicode。我已经让它大部分工作了，但它根本不尊重引号分隔符。我需要它与下面描述的函数完全匹配:

function ParseWord(const Source,Table:String;var Index:Integer):String;

Sequential, left to right token parsing using a table of single character delimiters. Delimiters within quoted strings are ignored. Quote delimiters are not allowed in Table.

Index is a pointer (initialize to '1' for first word) updated by the function to point to next word. To retrieve the next word, simply call the function again using the prior returned Index value.

Note: If Length(Resultant) = 0, no additional words are available. Delimiters within quoted strings are ignored. (my emphasis)

这是我到目前为止所拥有的:

function ParseWord( const Source, Table: String; var Index: Integer):string;
var
  RE : TRegEx;
  match : TMatch;
  Table2,
  chars : string;
begin
  if index = length(Source) then
  begin
    result:= '';
    exit;
  end;

  // escape the special characters and wrap in a Group
  Table2 :='['+TRegEx.Escape(Table, false)+']';
  RE := TRegEx.create(Table2);
  match := RE.Match(Source,Index);
  if match.success then
  begin
    result := copy( Source, Index, match.Index - Index);
    Index := match.Index+match.Length;
  end
  else
  begin
    result := copy(Source, Index, length(Source)-Index+1);
    Index := length(Source);
  end;
end;

  while ( Length(result)= 0) and (Index<length(Source)) do
  begin
    Inc(Index);
    result := ParseWord(Source,Table, Index);
  end;

干杯，谢谢。

最佳答案

我会为 Table2 尝试这个正则表达式:

Table2 := '''[^'']+''|"[^"]+"|[^' + TRegEx.Escape(Table, false) + ']+';

演示:
这个演示更像是一个 POC，因为我无法找到在线 delphi 正则表达式测试器。

分隔符为空格(ASCII 代码32)和管道(ASCII 代码124)字符。
测试语句为:

toto titi "alloa toutou" 'dfg erre' 1245|coucou "nestor|delphi" "" ''

http://regexr.com?32i81

讨论:
我假设带引号的字符串是由两个单引号 (') 或两个双引号 (") 括起来的字符串。如果我错了，请纠正我。

正则表达式将匹配:

单引号字符串
双引号字符串
不由任何传递的分隔符组成的字符串

已知错误:
由于我不知道 ParseWord 如何处理字符串内转义的引号，因此正则表达式不支持此功能。

例如:

如何解释这个'foo''bar'？ => 两个标记:'foo' 和 'bar' 或一个标记 'foo''bar'。
<
这个例子也怎么样:"foo""bar"？ => 两个标记:"foo" 和 "bar" 或一个标记 "foo""bar"。

关于regex - 如何跳过正则表达式中引用的文本(或如何将 HyperStr ParseWord 与 Unicode 文本一起使用？)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12966334/

regex - 如何跳过正则表达式中引用的文本(或如何将 HyperStr ParseWord 与 Unicode 文本一起使用？)

上一篇：delphi - 访问 Ubuntu One API 结果为 "401 UNAUTHORIZED"或 "400 BAD REQUEST"

下一篇：delphi - delphi WSDL 导入器有问题