Perl:如何计算在 N 字窗口中出现 3 字词组(有间隙)的次数

我正在尝试计算文档中 12 个单词的窗口中出现 3 个单词的短语的次数，但困难在于我要搜索的关键字可以分布在整个窗口中。

例如:

我想在一个 12 个单词的短语中找到短语“expect bad weather”，其中可以在 3 个所需单词之间插入其他单词，只要包含 3 个单词的总短语不超过 12 个单词.

有效的短语:

我预计会有坏天气。
他们预料到恶劣多风的天气。
我预计，虽然没有人证实这一点，但坏天气已经开始的方式。

我一直在努力弄清楚如何做到这一点。我知道如何计算两个单词之间可能存在差距的短语的出现次数。例如，如果我计算“expect”和“weather”在 12 个词的短语中出现的频率，我可以这样做:

$mycount =()= $text =~ /\b(?:expect\W+(?:\w+\W+){0,10}?weather)\b/gi;

但是，当我想用 3 个单词执行此操作时，它并不那么简单，因为我最终得到 2 个必须加在一起的间隙，以便我的窗口不超过 12 个单词。理想情况下，我可以做类似的事情:

$mycount =()= $text =~ /\b(?:expect\W+(?:\w+\W+){0,$Gap1}?bad\W+(?:\w+\W+){0,$Gap2}?weather)\b/gi;

其中 $Gap2 = 9 - $Gap1，但我认为没有办法做到这一点。

我还想过创建一个循环，以便在循环的一次迭代中 $Gap1=0 和 $Gap2=9，在第二次迭代中 $Gap1=1 和 $Gap2=8，等等，然后添加计数所有的循环。但是，这样做会重复计算该短语的某些实例。

我很迷茫。有人有什么想法吗？我在任何地方都找不到任何相关示例。

最佳答案

注意这篇文章解决了查找在一个窗口内展开的单词的问题，正如所问的那样。它没有考虑更复杂的一般文本解析或语言分析问题。

下面的代码搜索第一个单词，然后继续使用另一个正则表达式搜索其他两个单词。它在那里逐字扫描文本并保留一个计数器，因此它可以在 12 个字处停止。它使用 pos 来控制在检查窗口后应该从哪里继续。

一旦找到 12 长的窗口，就会以单词 expect 开头，如评论中所述。在完成的短语之后继续搜索下一个。

如果在接下来的 11 个单词中没有找到该短语，引擎将返回到 expect 之后的位置以继续搜索(因为可能还有另一个 expect在选中的 11 个单词内)。

use warnings;
use strict;
use feature 'say';

my $s = q(I expect, although no one confirmed, that bad weather is on the way.)
      . q(  Expect that we cannot expect to escape the bad, bad weather.);

my $word_range = 12;
my ($w1, $w2, $w3) = qw(expect bad weather);

FIRST_WORD: while ($s =~ /\b($w1)\b/gi) {
    #say "SEARCH, at ", pos $s;
    my ($one, $pos_one) = ($1, pos $s);

    my ($two, $three, $cnt);

    while ($s =~ /(\w+)/g) {
        my $word = $1; 
        #say "\t$word  ... (at ", pos $s, ")";

        $two = $1  if $word =~ /\b($w2)\b/i; 
        
        if ( $two and (($three) = $word =~ /\b($w3)\b/i) ) { 
            say "$one + $two + $three  (pos ", pos $s, ')';
            next FIRST_WORD;
        }
        last if ++$cnt == $word_range-1;  # failed (these 11 + 'expect') 
    }
    pos $s = $pos_one;         # return to position in string after 'expect'
}

请注意，不能在循环条件内分配匹配项(对于 $one)，因为这会将匹配项放入列表上下文中，从而扰乱了 /g 所需的行为> 和 pos .

被注释掉的打印可以用来跟踪操作。按照目前的情况打印

expect + bad + weather  (pos 53)
Expect + bad + weather  (pos 128)

I extend the string to test multiple occurrences of the phrase. The operation with failed matches can be tested by crippling keywords and tracking the position in the search.

A possible extra keyword inside of the phrase, as in the second sentence, is ignored and the phrase is accepted if there, as this is unspecified but implicit in the question. This is easily changed.

If there were more words in the phrase they would all be sought in the inner while loop, in the same way as the last two are now, by matching them sequentially (requiring for each word that all preceding words had been found). The outer while loop is needed only to start the window.

After a failed window-scan the outer while continues its search for expect from the position of the window beginning, thus scanning the same 11 words again.

This repeated search through the text can be reduced by checking for expect as well during the window scan. Then scan afresh from that position, with the inner while

# First sentence shortened and now does not contain the phrase
my $s = q(I expect, although no one confirmed, that bad expect.)
      . q( Expect that we cannot expect to escape the bad, bad weather.);    
...
FIRST_WORD: while ($s =~ /\b($w1)\b/gi) {
    my ($one, $pos_one) = ($1, pos $s);

    my ($two, $three, $cnt, $pos_one_new);

    while ($s =~ /(\w+)/g) {
        my $word = $1;
        #say "\t$word  ... (at ", pos $s, ")";

        $pos_one_new = pos $s
            if not $pos_one_new and $word =~ /\b$w1\b/i;

        $two = $1  if $word =~ /\b($w2)\b/i;

        if ( $two and (($three) = $word =~ /\b($w3)\b/i) ) {
            say "$one + $two + $three  (pos ", pos $s, ')';
            next FIRST_WORD;
        } 

        if (++$cnt == $word_range-1) {
            last  if not $pos_one_new;
     
            #say "Scan window anew from $pos_one_new";
            pos $s   = $pos_one_new;
            $pos_one = $pos_one_new;
            $pos_one_new = 0;
            $two = $three = '';
            $cnt = 0;
        }
    }
    pos $s = $pos_one;
}

这打印

expect + bad + weather  (pos 113)

请注意，使用了窗口中 第一次 出现的 expect。

关于Perl:如何计算在 N 字窗口中出现 3 字词组(有间隙)的次数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48454477/

Perl:如何计算在 N 字窗口中出现 3 字词组(有间隙)的次数

上一篇：angularjs - 带有 $http transformResponse 的 angular-http-auth

下一篇：hdfs - Kafka Storm HDFS/S3 数据流