bash - 查找并打印列值重复 n 次的行

标签 bash shell awk

我有一个文件:

scaffold_0      11498
scaffold_0      11501
scaffold_0      11728   "RHOH"
scaffold_0      12144   "RHOH"
scaffold_0      20708   "RHOH"
scaffold_0      23579   "RHOH"
scaffold_0      130818
scaffold_0      200485  "NSUN7"
scaffold_0      209928  "NSUN7"
scaffold_0      212965  "NSUN7"
scaffold_0      214055  "APBB2"
scaffold_0      223404
scaffold_0      223686  "APBB2"
scaffold_0      227687  "APBB2"
scaffold_0      306105  "APBB2"
scaffold_0      307000  "APBB2"
scaffold_0      391742
scaffold_0      399332  "UCHL1"
scaffold_0      406726  "UCHL1"
scaffold_0      482215
scaffold_0      484921
scaffold_0      538855  "LIMCH1"
scaffold_0      539051  "LIMCH1"
scaffold_0      539819
scaffold_0      543347  "LIMCH1"
scaffold_0      568182  "LIMCH1"
scaffold_0      570321
scaffold_0      570325
scaffold_0      577502  "LIMCH1"
scaffold_0      578933  "LIMCH1"
scaffold_0      621330  "PHOX2B"
scaffold_0      623303  "PHOX2B"
scaffold_0      640271
scaffold_0      667510  "gene3"
scaffold_0      679096
scaffold_0      698659  "TMEM33"
scaffold_0      700427  "TMEM33"

并且我想打印第三列中的项目重复 3 次或更多次的行。这样这些行就被删除了:

scaffold_0      399332  "UCHL1"
scaffold_0      406726  "UCHL1"
scaffold_0      621330  "PHOX2B"
scaffold_0      623303  "PHOX2B"
scaffold_0      667510  "gene3"
scaffold_0      698659  "TMEM33"
scaffold_0      700427  "TMEM33"

我很高兴保留文件的顺序,并保留第三列为空的行。 我尝试过:

sort -k3 file.txt | awk 'a[$3]++{ if(a[$3]>=2){ print b }; print $0}; {b=$0}'

最佳答案

这个 awk 读取整个文件并将其散列到内存中

$ awk '{
    a[NR]=$0              # hash to a using record number as the key for order
    c[$3]++               # $3 counter
}
END {                     # after file records have been hashed
    for(i=1;i<=NR;i++) {  # iterate in order
        split(a[i],b)     # get the 3rd column
        if(c[b[3]]>=3)    # output if count is right
            print a[i]
    }
}' file

输出示例:

...
scaffold_0      306105  "APBB2"
scaffold_0      307000  "APBB2"
scaffold_0      391742
scaffold_0      482215
scaffold_0      484921
scaffold_0      538855  "LIMCH1"
scaffold_0      539051  "LIMCH1"
...

关于bash - 查找并打印列值重复 n 次的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52462981/

相关文章:

bash - 在 awk 中设置 FNR 值

makefile - 使用 awk 动态设置 makefile 变量

linux - 什么是方便的 LINUX PATH 编辑器?

bash - 并行运行 bash 脚本

shell - 删除除一个以外的所有目录

linux - if语句中的shell条件

shell - 如何使用 zsh 使向上和向下箭头键显示脚本中的历史条目?

string - bash 脚本基本字符串比较

arrays - Bash 参数扩展和数组索引操作

linux - 如何获取以前的日期文件并将 ls 输出传递给 gawk 中的数组