我有一个文件,其中的行由以下字段组成:
- 由以特殊字符(在我下面的示例中为“%”)开头的字母数字标记分隔
- 标签文本以空格结束
- 字段内容以','结尾
- 字段内容永远不会包含%或,
示例行:
%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff
标签集对于搜索很重要 -- 这是我的示例的标签集:
%t, %u, %v, %w, %x, %xx, %y, %z
我想找到标签在集合中的字段的内容,并且字段内容在集合中标记的后续字段中重复。这是我尝试失败的代码:
my $tagmrkr='%';
my $line='%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff';
my $searchtags = qr/t|u|v|w|x|xx|y|z/; # excludes q
print qq/The line:$line\n\n/;
for ($line =~ m/
$tagmrkr$searchtags\ ([^\,]*,)
.*?
$tagmrkr$searchtags\ \1
/gx) {
print qq/First field contents:$1\n/;
print qq/Entire match:$&\n/;
print qq/\n/;
}
我期待:
The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff
First field contents:this,
Entire match:%t this,%u that,%v this,
First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,
我得到了:
The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff
First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,
First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,
问题一:
为什么第一个匹配项的 $1
和 $&
被第二个匹配项的值替换?
问题 2:-- 我应该改变什么才能得到我想要(如下)而不是我期望的?
我想要的是能够重新旋转匹配,以便它在重叠的情况下也能找到重复的字段——其中第二个匹配的第一个字段出现在第一个匹配的第二个字段之前。实际上,为了我的直接目的,我需要的只是重复的字段内容。
即,我想要示例中的 3 个匹配项:
The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff
First field contents:this
Entire match:%t this,%u that,%v this,
First field contents:that
Entire match:%u that,%v this,%t that,
First field contents:the other
Entire match:%x the other,%xx only once,%q the other,%z the other,
最佳答案
提供重叠的一种方法是断言短语其余部分的存在,使用lookahead .然后那部分不被消耗,引擎从它之前继续,所以它可以再次匹配它
use warnings;
use strict;
use feature 'say';
my $s = q(%a astuff,%b bstuff,%t this,%u that,%v this,%t that,)
. q(%x the other,%xx only once,%q the other,%z the other,%c cstuff);
my $m = qr/%/;
my $t = qr/(?:t|u|v|w|x|xx|y|z)/;
while ($s =~ / $m$t \s ([^,]+) , (?=(.*?$m$t\s\g{1},?)) /gx) {
say "capture: $1";
say " whole: $1,$2";
}
有关前瞻如何帮助捕获重叠模式的更详细说明,请参阅 this post
打印
capture: this whole: this,%u that,%v this, capture: that whole: that,%v this,%t that, capture: the other whole: the other,%xx only once,%q the other,%z the other,
关于regex - 查找重复标记的子字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58069662/