file - 如何在linux中根据多条代码提取行？

我有一个 sample.txt 文件，如下所示:

chr1    StringTie       transcript      10001   10390   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10179   10390   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      10001   10467   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10173   10224   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       exon    10391   10467   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "3";
chr1    StringTie       transcript      10001   10467   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10173   10249   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       exon    10398   10467   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "3";
chr1    StringTie       transcript      10005   10467   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10005   10178   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10361   10467   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      10011   10467   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10011   10178   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10405   10467   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      57598   58856   .       +       .       transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; gene_name "OR4G11P"; xloc "XLOC_000002"; ref_gene_id "ENSG00000240361.2"; cmp_ref "ENST00000642116.1"; class_code "c"; tss_id "TSS2";
chr1    StringTie       exon    57598   57653   .       +       .       transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; exon_number "1";
chr1    StringTie       exon    58700   58856   .       +       .       transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; exon_number "2";
chr1    StringTie       transcript      65419   71585   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; ref_gene_id "ENSG00000186092.5"; cmp_ref "ENST00000641515.1"; class_code "="; tss_id "TSS3";
chr1    StringTie       exon    65419   65433   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       exon    65520   65573   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "2";
chr1    StringTie       exon    69037   71585   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "3";
chr1    StringTie       transcript      65572   75288   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; cmp_ref "ENST00000641515.1"; class_code "j"; tss_id "TSS4";
chr1    StringTie       exon    65572   65573   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       exon    69037   69093   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "2";
chr1    StringTie       exon    74913   75288   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "3";
chr1    StringTie       transcript      69055   71585   .       +       .       transcript_id "ENST00000335137.4"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; ref_gene_id "ENSG00000186092.5"; contained_in "ENST00000641515.1"; cmp_ref "ENST00000641515.1"; class_code "c"; tss_id "TSS5";
chr1    StringTie       exon    69055   71585   .       +       .       transcript_id "ENST00000335137.4"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       transcript      83779   84926   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; xloc "XLOC_000004"; class_code "u"; tss_id "TSS6";
chr1    StringTie       exon    83779   83829   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "1";
chr1    StringTie       exon    83854   84926   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "2";
chr1    StringTie       transcript      89710   90455   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; gene_name "AL627309.3"; xloc "XLOC_000005"; cmp_ref "ENST00000495576.1"; class_code "s"; tss_id "TSS7";
chr1    StringTie       exon    89710   90050   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "1";
chr1    StringTie       exon    90287   90455   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "2";

我尝试根据名称transcript_id与class_code“u”匹配来提取行转录本及其外显子，如下所示:

awk -F "\t" '/class_code "u"/ {print $0}' sample.txt > new_filename.txt

上面的 awk 命令只给出了在第三列中有转录本的行，它们的外显子在 new_filename.txt 中看不到。我实际上想提取多个 class_codes 转录本及其外显子。如何使用 awk 来实现这一点？

我需要带有 class_codes u, s, j 及其外显子的转录本。

输出应如下所示:

chr1    StringTie       transcript      10001   10390   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10179   10390   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      10001   10467   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10173   10224   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       exon    10391   10467   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "3";
chr1    StringTie       transcript      10001   10467   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10173   10249   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       exon    10398   10467   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "3";
chr1    StringTie       transcript      10005   10467   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10005   10178   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10361   10467   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      10011   10467   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10011   10178   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10405   10467   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      65572   75288   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; cmp_ref "ENST00000641515.1"; class_code "j"; tss_id "TSS4";
chr1    StringTie       exon    65572   65573   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       exon    69037   69093   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "2";
chr1    StringTie       exon    74913   75288   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "3";
chr1    StringTie       transcript      83779   84926   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; xloc "XLOC_000004"; class_code "u"; tss_id "TSS6";
chr1    StringTie       exon    83779   83829   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "1";
chr1    StringTie       exon    83854   84926   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "2";
chr1    StringTie       transcript      89710   90455   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; gene_name "AL627309.3"; xloc "XLOC_000005"; cmp_ref "ENST00000495576.1"; class_code "s"; tss_id "TSS7";
chr1    StringTie       exon    89710   90050   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "1";
chr1    StringTie       exon    90287   90455   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "2";

最佳答案

您可以使用以下命令:

awk -F "\t" '/class_code/{p=/class_code "[usj]"/}p' input > output

作为脚本的解释:

filter.awk

BEGIN {
    FS="\t"
}

# When a line contains 'class_code' ...
/class_code/ {
    # ... set a flag 'p' to 1 or 0 if the regexp
    # /class_code "[usj]"/ matches or not.
    # Note: this flag will remain set / unset for the following
    # exon rows too
    p=/class_code "[usj]"/
}

# If this evaluates to true (  1 or 0 here ), awk will
# print the current line, otherwise not.
p

关于file - 如何在linux中根据多条代码提取行？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64865378/

file - 如何在linux中根据多条代码提取行？

上一篇：opencv - 尝试构建暗网时出现 "opencv2: no such file or directory"？

下一篇：perl - 如何在使用 'use strict' 时将 STDOUT 分配给 var 并打印到该变量？