我有这样的 vcf 文件:
##bcftools_annotateVersion=1.3.1+htslib-1.3.1
##bcftools_annotateCommand=annotate
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG005
chr1 817186 rs3094315 G A 50 PASS platforms=2;platformnames=Illumina,CG;datasets=3;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair;callsets=5;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=IonExome,SolidSE75bp;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2 GT:PS:DP:ADALL:AD:GQ 1/1:.:809:0,363:78,428:237
chr1 817341 rs3131972 A G 50 PASS platforms=3;platformnames=Illumina,CG,Solid;datasets=4;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair,SolidSE75bp;callsets=6;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes,SolidSE75GATKHC;datasetsmissingcall=IonExome;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2 GT:PS:DP:ADALL:AD:GQ 1/1:.:732:1,330:99,391:302
我需要从INFO列中提取ID列和AN才能得到:
ID INFO
rs3094315 2
rs3131972 2
我正在尝试这样的awk '/^[^#]/{ print $3, gsub(/^[^AN=])/,"",$8)}' file.vcf
,但仍然没有得到想要的结果。
最佳答案
你可以试试这个 awk:
awk 'BEGIN{OFS="\t"}
/^##/{next}
/^#/{print $3,$8; next}
{
split($8,a,";")
for(i=1;i<=length(a);i++) if (a[i]~/^AN=/) {sub(/^AN=/,"",a[i]); break}
printf "%s%s%s\n", $3, OFS, a[i]
}
' file
在示例中,打印:
ID INFO
rs3094315 2
rs3131972 2
关于bash - 使用awk提取vcf列子字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76381570/