我有一个文件:
scaffold_0 11498
scaffold_0 11501
scaffold_0 11728 "RHOH"
scaffold_0 12144 "RHOH"
scaffold_0 20708 "RHOH"
scaffold_0 23579 "RHOH"
scaffold_0 130818
scaffold_0 200485 "NSUN7"
scaffold_0 209928 "NSUN7"
scaffold_0 212965 "NSUN7"
scaffold_0 214055 "APBB2"
scaffold_0 223404
scaffold_0 223686 "APBB2"
scaffold_0 227687 "APBB2"
scaffold_0 306105 "APBB2"
scaffold_0 307000 "APBB2"
scaffold_0 391742
scaffold_0 399332 "UCHL1"
scaffold_0 406726 "UCHL1"
scaffold_0 482215
scaffold_0 484921
scaffold_0 538855 "LIMCH1"
scaffold_0 539051 "LIMCH1"
scaffold_0 539819
scaffold_0 543347 "LIMCH1"
scaffold_0 568182 "LIMCH1"
scaffold_0 570321
scaffold_0 570325
scaffold_0 577502 "LIMCH1"
scaffold_0 578933 "LIMCH1"
scaffold_0 621330 "PHOX2B"
scaffold_0 623303 "PHOX2B"
scaffold_0 640271
scaffold_0 667510 "gene3"
scaffold_0 679096
scaffold_0 698659 "TMEM33"
scaffold_0 700427 "TMEM33"
并且我想打印第三列中的项目重复 3 次或更多次的行。这样这些行就被删除了:
scaffold_0 399332 "UCHL1"
scaffold_0 406726 "UCHL1"
scaffold_0 621330 "PHOX2B"
scaffold_0 623303 "PHOX2B"
scaffold_0 667510 "gene3"
scaffold_0 698659 "TMEM33"
scaffold_0 700427 "TMEM33"
我很高兴保留文件的顺序,并保留第三列为空的行。 我尝试过:
sort -k3 file.txt | awk 'a[$3]++{ if(a[$3]>=2){ print b }; print $0}; {b=$0}'
最佳答案
这个 awk 读取整个文件并将其散列到内存中
$ awk '{
a[NR]=$0 # hash to a using record number as the key for order
c[$3]++ # $3 counter
}
END { # after file records have been hashed
for(i=1;i<=NR;i++) { # iterate in order
split(a[i],b) # get the 3rd column
if(c[b[3]]>=3) # output if count is right
print a[i]
}
}' file
输出示例:
...
scaffold_0 306105 "APBB2"
scaffold_0 307000 "APBB2"
scaffold_0 391742
scaffold_0 482215
scaffold_0 484921
scaffold_0 538855 "LIMCH1"
scaffold_0 539051 "LIMCH1"
...
关于bash - 查找并打印列值重复 n 次的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52462981/