ubuntu - AWK 提取列中具有相同单词的前两行

我需要从一个大的多列文件(500mb 到 1gb，\t 分隔符)中提取前两行，其中包含出现在特定列中的 100.000 多个单词的列表。
现在我正在使用这样的循环:

while read GREP
do
grep -m 2 "${GREP}" input.txt > output.txt ;
done < list_of_words.txt

但这需要太多时间(我需要对许多文件执行此操作)，所以我正在寻找替代方案。
一个简单的fgrep -f -m2不起作用，因为 -m 适用于合并命中的整个输出。
我想 awk 可能是一个解决方案，但无法在线找到帮助。
例如，如果输入文件是:

Dog Bird House
Mouse Giraffe Cat
Mouse Rhino House
Lion Horse House
Dog Rat Cat
Dog Mice Cat

我想要一个这样的输出文件(行的顺序不相关):

Dog Bird House
Mouse Rhino House
Mouse Giraffe Cat
Dog Rat Cat

我现在正在使用这样的单词列表:

House
Cat

但是，如果这是一种仅保留前两行的方法，那么第三列上的相同单词会更好!
注意:第三列的单词将是唯一的，不会出现在任何其他列中!

最佳答案

像这样的东西？:

$ awk -F"\t" 'NR==FNR{a[$0]=2;next}($3 in a)&&--a[$3]>=0' list input

输出:

Dog     Bird    House
Mouse   Giraffe Cat
Mouse   Rhino   House
Dog     Rat     Cat

解释:

$ awk -F"\t" '           # yes awk yes, fields tab delimited
NR==FNR {                # process the first file, list of words
    a[$0]=2              # hash every word, set initial value to 2
    next                 # on to the next word
}                        # process second file below this point
($3 in a) && --a[$3]>=0  # if 3rd field word in a and seen max once, output
' list input             # mind the file order

关于ubuntu - AWK 提取列中具有相同单词的前两行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64121772/

ubuntu - AWK 提取列中具有相同单词的前两行

上一篇：Python 找不到 Blender 模块

下一篇：php - 如何跟踪 php-fpm 使用过多 CPU 的原因