regex - 查找重复标记的子字符串

我有一个文件，其中的行由以下字段组成:

由以特殊字符(在我下面的示例中为“%”)开头的字母数字标记分隔
标签文本以空格结束
字段内容以','结尾
字段内容永远不会包含%或,

示例行:

%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

标签集对于搜索很重要 -- 这是我的示例的标签集:

%t, %u, %v, %w, %x, %xx, %y, %z

我想找到标签在集合中的字段的内容，并且字段内容在集合中标记的后续字段中重复。这是我尝试失败的代码:

my $tagmrkr='%';
my $line='%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff';

my $searchtags = qr/t|u|v|w|x|xx|y|z/; # excludes q

print qq/The line:$line\n\n/;
for ($line =~ m/
    $tagmrkr$searchtags\ ([^\,]*,)
    .*?
    $tagmrkr$searchtags\ \1
    /gx) {
        print qq/First field contents:$1\n/;
        print qq/Entire match:$&\n/;
        print qq/\n/;
        }

我期待:

The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

First field contents:this,
Entire match:%t this,%u that,%v this,

First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,

我得到了:

The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,

First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,

问题一:
为什么第一个匹配项的 $1 和 $& 被第二个匹配项的值替换？

问题 2:-- 我应该改变什么才能得到我想要(如下)而不是我期望的？

我想要的是能够重新旋转匹配，以便它在重叠的情况下也能找到重复的字段——其中第二个匹配的第一个字段出现在第一个匹配的第二个字段之前。实际上，为了我的直接目的，我需要的只是重复的字段内容。

即，我想要示例中的 3 个匹配项:

The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

First field contents:this
Entire match:%t this,%u that,%v this,

First field contents:that
Entire match:%u that,%v this,%t that,

First field contents:the other
Entire match:%x the other,%xx only once,%q the other,%z the other,

最佳答案

提供重叠的一种方法是断言短语其余部分的存在，使用lookahead .然后那部分不被消耗，引擎从它之前继续，所以它可以再次匹配它

use warnings;
use strict;
use feature 'say';

my $s = q(%a astuff,%b bstuff,%t this,%u that,%v this,%t that,)
      . q(%x the other,%xx only once,%q the other,%z the other,%c cstuff); 

my $m = qr/%/;
my $t = qr/(?:t|u|v|w|x|xx|y|z)/; 

while ($s =~ / $m$t \s ([^,]+) , (?=(.*?$m$t\s\g{1},?)) /gx) { 
    say "capture: $1";
    say "  whole: $1,$2";
}

有关前瞻如何帮助捕获重叠模式的更详细说明，请参阅 this post

打印

capture: this
  whole: this,%u that,%v this,
capture: that
  whole: that,%v this,%t that,
capture: the other
  whole: the other,%xx only once,%q the other,%z the other,

关于regex - 查找重复标记的子字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58069662/

regex - 查找重复标记的子字符串

上一篇：ssis - "Buffer Size"和 "Max Rows"与多个源和目标流 "Rows Per Batch"和 "Insert Commit Size"的关系是什么？

下一篇：docker - 使用域范围项目时，云构建中的图像名称无效