bash - 如何快速删除文件中包含 BASH 中另一个文件列表中项目的行？

我有一个名为 words.txt 的文件，其中包含一个单词列表。我还有一个名为 file.txt 的文件，每行包含一个句子。我需要快速删除 file.txt 中包含来自 words.txt 的行之一的任何行，但前提是在 { 和 }。

例如文件.txt:

Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.

例如words.txt:

cat
mice

示例输出:

Once upon a time there was a cat.

被删除是因为在这两行中发现了“cat”，并且单词也在 { 和 } 之间。

下面的脚本成功地完成了这个任务:

while read -r line
do
    sed -i "/{.*$line.*}/d" file.txt
done < words.txt

这个脚本很慢。有时 words.txt 包含数千个项目，因此 while 循环需要几分钟。我尝试使用 sed -f 选项，它似乎允许读取文件，但我找不到任何解释如何使用它的手册。

如何提高脚本的速度？

最佳答案

awk 解决方案:

awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt

它直接转换 file.txt 以获得预期的输出。

Once upon a time there was a cat.

未压缩版本:

awk '
    NR == FNR {
        a["{[^{}]*" $0 "[^{}]*}"]++
        next
    }
    {
        for (i in a)
            if ($0 ~ i)
                next
        b[j++] = $0
    }
    END {
        printf "" > FILENAME
        for (i = 0; i in b; ++i)
            print b[i] > FILENAME
    }
' words.txt file.txt

如果预计文件会变得太大以至于 awk 可能无法处理它，我们只能将它重定向到 stdout。我们可能无法直接修改文件:

awk '
    NR == FNR {
        a["{[^{}]*" $0 "[^{}]*}"]++
        next
    }
    {
        for (i in a)
            if ($0 ~ i)
                next
    }
    1
' words.txt file.txt

关于bash - 如何快速删除文件中包含 BASH 中另一个文件列表中项目的行？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24081493/

bash - 如何快速删除文件中包含 BASH 中另一个文件列表中项目的行？

上一篇：bash - Shell:如何用 "Cut"剪切单个字符串？

下一篇：Bash:如何使用 bash 减去两个时间字符串？