linux - 删除句点后的第一个大写单词

标签 linux bash awk

我希望能够删除句号后第一个大写的单词。目标是删除以大写字母写的第一个单词,即使同一行上的句子是两个。事实上,正如我将在示例中展示的那样,该行的第一个词已被省略,但出现了第二个句子的第一个词。

对于第一行的第一句话,我通过从 2 而不是 1 开始 if 来解决问题:

这是代码

BEGIN { FS="[^[:alpha:]']+"; OFS=" "} 
{
   parola=" "
   max_nr=0

   prec=""

   for (i=2; i<=NF; i++) {
        if ($i ~ /[[:punct:][:digit:]]+[:space:]*[A-Z][']{0,1}[A-Z]{0,1}[a-z]+/){
            continue
        }
        else{
            if ($i ~ /[A-Z][']{0,1}[A-Z]{0,1}[a-z]+/){

                if(!(prec=="")){

                    prec=prec" "$i
                }
                else{
                    prec=$i              
                }
            }     
            else {

                if(!(prec=="")){

                    words[prec]
                    prec=""    
                  }
            }

            if (i==NF) {
                max_nr=max_nr+1  
                for (word1 in words) {
                    for (word2 in words) {
                        if (word1 != word2) {
                            print parola"" word1","word2
                        }
                    }

                    delete words[word1]
                }                
            }
            }
}  
}   
END{
    print FILENAME" "FNR
    print i
    print max_nr
}

这是test.txt的内容:

Today Jonathan played soccer with Martin. After the game, Martin and Jonathan were thirsty and then drank a fresh Lemon Soda. 
Paolo went to Lisbon with an Easyjet plane. During the trip he met two of his dear friends, Peter and John.

这是命令的结果:

awk -f script.awk test.txt > output.csv

Lisbon,During
Lisbon,John
Lisbon,Peter
Lisbon,Easyjet
During,John
During,Peter
During,Easyjet
John,Peter
John,Easyjet
Peter,Easyjet
Jonathan,Martin After
Jonathan,Lemon Soda
Jonathan,Martin
Martin After,Lemon Soda
Martin After,Martin
Lemon Soda,Martin

预期的输出应该是:

Lisbon,John
Lisbon,Peter
Lisbon,Easyjet
John,Peter
John,Easyjet
Peter,Easyjet
Jonathan,Martin
Martin,Lemon Soda
Jonathan,Lemon Soda

有什么建议吗?

最佳答案

不要尝试为您完成所有工作 (I provided a solution for that previously),只需解决您在这个问题中提出的具体问题:

您正在使用 FS="[^[:alpha:]']+" 因此无法判断给定的任何字段(“单词”)之前的分隔符是否为 . 或其他。使用 FS='[.]' 或类似的作为您的起点,然后您就会知道每个字段之前的分隔符是行的开头或 . 然后您可以使用 split($i,f,/[^[:alpha:]']+/) 来隔离该字段(“句子”)中的每个子字段(“单词”)。例如:

$ cat tst.awk
BEGIN { FS="[[:space:]]*[.][[:space:]]*" }
{
    for (sentenceNr=1; sentenceNr<=NF; sentenceNr++) {
        sentence = $sentenceNr
        numWords = split(sentence,words,/[^[:alpha:]\047]+/)
        for (wordNr=2; wordNr<=numWords; wordNr++) {
            word = words[wordNr]
            if ( word ~ /^[[:upper:]]/ ) {
                print NR, sentenceNr, wordNr, word
            }
        }
    }
}

$ awk -f tst.awk file
1 1 2 Jonathan
1 1 6 Martin
1 2 4 Martin
1 2 6 Jonathan
1 2 14 Lemon
1 2 15 Soda
2 1 4 Lisbon
2 1 7 EasyJet
2 2 11 Peter
2 2 13 John

请注意,给定此输入:

$ cat file
Today Jonathan played soccer with Martin. After the game, Martin and Jonathan were thirsty and then drank a fresh Lemon Soda.
Paolo went to Lisbon with an EasyJet plane. During the trip he met two of his dear friends, Peter and John.
May lost her home. 10 Downing St is where the PM lives.

以上将输出:

$ awk -f tst.awk file
1 1 2 Jonathan
1 1 6 Martin
1 2 4 Martin
1 2 6 Jonathan
1 2 14 Lemon
1 2 15 Soda
2 1 4 Lisbon
2 1 7 EasyJet
2 2 11 Peter
2 2 13 John
3 2 2 Downing
3 2 3 St
3 2 7 PM

如果“Downing”不应该存在,则将代码更改为:

$ cat tst.awk
BEGIN { FS="[[:space:]]*[.][[:space:]]*" }
{
    for (sentenceNr=1; sentenceNr<=NF; sentenceNr++) {
        numWords = split($sentenceNr,words,/[^[:alpha:]\047]+/)
        isSubsequent = 0
        for (wordNr=1; wordNr<=numWords; wordNr++) {
            word = words[wordNr]
            if ( word ~ /^[[:upper:]]/ ) {
                if ( isSubsequent++ ) {
                    print NR, sentenceNr, wordNr, word
                }
            }
        }
    }
}

$ awk -f tst.awk file
1 1 2 Jonathan
1 1 6 Martin
1 2 4 Martin
1 2 6 Jonathan
1 2 14 Lemon
1 2 15 Soda
2 1 4 Lisbon
2 1 7 EasyJet
2 2 11 Peter
2 2 13 John
3 2 3 St
3 2 7 PM

关于linux - 删除句点后的第一个大写单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56583235/

相关文章:

linux - 恢复键盘 IRQ

linux - Python PATH .is_file() 将符号链接(symbolic link)评估为文件

bash - 在 shell 脚本中获取 2 个数组中的公共(public)值

linux - 在 bash 中比较文件的更快解决方案

linux - BASH-计算同一行中某些字段的平均值

c - 如何获取Linux系统的正常运行时间?

linux - 在 Linux 中检测系统负载,重点放在 "swap thrashing"

linux - crontab 中的脚本仅在等于或超过某个值时才执行

Bash 终端输出 - 突出显示包含一些文本的行

linux - 为什么要编写一个脚本来检查 Linux 中是否安装了某些软件包,如果没有则安装它?