file - 如何在linux中根据多条代码提取行?

标签 file awk sed extract

我有一个 sample.txt 文件,如下所示:

chr1    StringTie       transcript      10001   10390   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10179   10390   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      10001   10467   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10173   10224   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       exon    10391   10467   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "3";
chr1    StringTie       transcript      10001   10467   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10173   10249   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       exon    10398   10467   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "3";
chr1    StringTie       transcript      10005   10467   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10005   10178   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10361   10467   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      10011   10467   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10011   10178   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10405   10467   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      57598   58856   .       +       .       transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; gene_name "OR4G11P"; xloc "XLOC_000002"; ref_gene_id "ENSG00000240361.2"; cmp_ref "ENST00000642116.1"; class_code "c"; tss_id "TSS2";
chr1    StringTie       exon    57598   57653   .       +       .       transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; exon_number "1";
chr1    StringTie       exon    58700   58856   .       +       .       transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; exon_number "2";
chr1    StringTie       transcript      65419   71585   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; ref_gene_id "ENSG00000186092.5"; cmp_ref "ENST00000641515.1"; class_code "="; tss_id "TSS3";
chr1    StringTie       exon    65419   65433   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       exon    65520   65573   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "2";
chr1    StringTie       exon    69037   71585   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "3";
chr1    StringTie       transcript      65572   75288   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; cmp_ref "ENST00000641515.1"; class_code "j"; tss_id "TSS4";
chr1    StringTie       exon    65572   65573   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       exon    69037   69093   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "2";
chr1    StringTie       exon    74913   75288   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "3";
chr1    StringTie       transcript      69055   71585   .       +       .       transcript_id "ENST00000335137.4"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; ref_gene_id "ENSG00000186092.5"; contained_in "ENST00000641515.1"; cmp_ref "ENST00000641515.1"; class_code "c"; tss_id "TSS5";
chr1    StringTie       exon    69055   71585   .       +       .       transcript_id "ENST00000335137.4"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       transcript      83779   84926   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; xloc "XLOC_000004"; class_code "u"; tss_id "TSS6";
chr1    StringTie       exon    83779   83829   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "1";
chr1    StringTie       exon    83854   84926   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "2";
chr1    StringTie       transcript      89710   90455   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; gene_name "AL627309.3"; xloc "XLOC_000005"; cmp_ref "ENST00000495576.1"; class_code "s"; tss_id "TSS7";
chr1    StringTie       exon    89710   90050   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "1";
chr1    StringTie       exon    90287   90455   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "2";

我尝试根据名称transcript_idclass_code“u”匹配来提取行转录本及其外显子,如下所示:

awk -F "\t" '/class_code "u"/ {print $0}' sample.txt > new_filename.txt

上面的 awk 命令只给出了在第三列中有转录本的行,它们的外显子在 new_filename.txt 中看不到。我实际上想提取多个 class_codes 转录本及其外显子。如何使用 awk 来实现这一点?

我需要带有 class_codes u, s, j 及其外显子的转录本。

输出应如下所示:

chr1    StringTie       transcript      10001   10390   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10179   10390   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      10001   10467   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10173   10224   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       exon    10391   10467   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "3";
chr1    StringTie       transcript      10001   10467   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10173   10249   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       exon    10398   10467   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "3";
chr1    StringTie       transcript      10005   10467   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10005   10178   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10361   10467   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      10011   10467   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10011   10178   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10405   10467   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      65572   75288   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; cmp_ref "ENST00000641515.1"; class_code "j"; tss_id "TSS4";
chr1    StringTie       exon    65572   65573   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       exon    69037   69093   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "2";
chr1    StringTie       exon    74913   75288   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "3";
chr1    StringTie       transcript      83779   84926   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; xloc "XLOC_000004"; class_code "u"; tss_id "TSS6";
chr1    StringTie       exon    83779   83829   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "1";
chr1    StringTie       exon    83854   84926   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "2";
chr1    StringTie       transcript      89710   90455   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; gene_name "AL627309.3"; xloc "XLOC_000005"; cmp_ref "ENST00000495576.1"; class_code "s"; tss_id "TSS7";
chr1    StringTie       exon    89710   90050   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "1";
chr1    StringTie       exon    90287   90455   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "2";

最佳答案

您可以使用以下命令:

awk -F "\t" '/class_code/{p=/class_code "[usj]"/}p' input > output

作为脚本的解释:

filter.awk

BEGIN {
    FS="\t"
}

# When a line contains 'class_code' ...
/class_code/ {
    # ... set a flag 'p' to 1 or 0 if the regexp
    # /class_code "[usj]"/ matches or not.
    # Note: this flag will remain set / unset for the following
    # exon rows too
    p=/class_code "[usj]"/
}

# If this evaluates to true (  1 or 0 here ), awk will
# print the current line, otherwise not.
p

关于file - 如何在linux中根据多条代码提取行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64865378/

相关文章:

git - 将 .gitignore 转换为 rsync merge 过滤器包含文件? (使用 sed 或 awk)

xml - 除 xml 标签外的所有文本小写

linux - 使用 sed 将一个单词替换为两个单词和一个空格

java - java中XML解析器的绝对路径

java - 检查 JFileChooser 是否选择了 1 个或多个文件

java - 如何使用 java FIle i/o 替换、删除或更新文件中的行

c - 你如何将从文件中读取的字符串拆分为C中的数组

regex - 用于检查文件的第一行然后打印其余部分的 AWK 脚本

bash - 并行化 awk 脚本

linux - sed 将 ">"替换为 "/>"bash