我有一个 sample.txt
文件,如下所示:
chr1 StringTie transcript 10001 10390 . + . transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10001 10101 . + . transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10179 10390 . + . transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie transcript 10001 10467 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10001 10101 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10173 10224 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie exon 10391 10467 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "3";
chr1 StringTie transcript 10001 10467 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10001 10101 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10173 10249 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie exon 10398 10467 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "3";
chr1 StringTie transcript 10005 10467 . + . transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10005 10178 . + . transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10361 10467 . + . transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie transcript 10011 10467 . + . transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10011 10178 . + . transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10405 10467 . + . transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie transcript 57598 58856 . + . transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; gene_name "OR4G11P"; xloc "XLOC_000002"; ref_gene_id "ENSG00000240361.2"; cmp_ref "ENST00000642116.1"; class_code "c"; tss_id "TSS2";
chr1 StringTie exon 57598 57653 . + . transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; exon_number "1";
chr1 StringTie exon 58700 58856 . + . transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; exon_number "2";
chr1 StringTie transcript 65419 71585 . + . transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; ref_gene_id "ENSG00000186092.5"; cmp_ref "ENST00000641515.1"; class_code "="; tss_id "TSS3";
chr1 StringTie exon 65419 65433 . + . transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "1";
chr1 StringTie exon 65520 65573 . + . transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "2";
chr1 StringTie exon 69037 71585 . + . transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "3";
chr1 StringTie transcript 65572 75288 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; cmp_ref "ENST00000641515.1"; class_code "j"; tss_id "TSS4";
chr1 StringTie exon 65572 65573 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "1";
chr1 StringTie exon 69037 69093 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "2";
chr1 StringTie exon 74913 75288 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "3";
chr1 StringTie transcript 69055 71585 . + . transcript_id "ENST00000335137.4"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; ref_gene_id "ENSG00000186092.5"; contained_in "ENST00000641515.1"; cmp_ref "ENST00000641515.1"; class_code "c"; tss_id "TSS5";
chr1 StringTie exon 69055 71585 . + . transcript_id "ENST00000335137.4"; gene_id "MSTRG.7563"; exon_number "1";
chr1 StringTie transcript 83779 84926 . + . transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; xloc "XLOC_000004"; class_code "u"; tss_id "TSS6";
chr1 StringTie exon 83779 83829 . + . transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "1";
chr1 StringTie exon 83854 84926 . + . transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "2";
chr1 StringTie transcript 89710 90455 . + . transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; gene_name "AL627309.3"; xloc "XLOC_000005"; cmp_ref "ENST00000495576.1"; class_code "s"; tss_id "TSS7";
chr1 StringTie exon 89710 90050 . + . transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "1";
chr1 StringTie exon 90287 90455 . + . transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "2";
我尝试根据名称transcript_id
与class_code“u”
匹配来提取行转录本及其外显子,如下所示:
awk -F "\t" '/class_code "u"/ {print $0}' sample.txt > new_filename.txt
上面的 awk 命令只给出了在第三列中有转录本的行,它们的外显子在 new_filename.txt
中看不到。我实际上想提取多个 class_codes 转录本及其外显子。如何使用 awk
来实现这一点?
我需要带有 class_codes u, s, j
及其外显子的转录本。
输出应如下所示:
chr1 StringTie transcript 10001 10390 . + . transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10001 10101 . + . transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10179 10390 . + . transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie transcript 10001 10467 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10001 10101 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10173 10224 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie exon 10391 10467 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "3";
chr1 StringTie transcript 10001 10467 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10001 10101 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10173 10249 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie exon 10398 10467 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "3";
chr1 StringTie transcript 10005 10467 . + . transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10005 10178 . + . transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10361 10467 . + . transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie transcript 10011 10467 . + . transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10011 10178 . + . transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10405 10467 . + . transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie transcript 65572 75288 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; cmp_ref "ENST00000641515.1"; class_code "j"; tss_id "TSS4";
chr1 StringTie exon 65572 65573 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "1";
chr1 StringTie exon 69037 69093 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "2";
chr1 StringTie exon 74913 75288 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "3";
chr1 StringTie transcript 83779 84926 . + . transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; xloc "XLOC_000004"; class_code "u"; tss_id "TSS6";
chr1 StringTie exon 83779 83829 . + . transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "1";
chr1 StringTie exon 83854 84926 . + . transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "2";
chr1 StringTie transcript 89710 90455 . + . transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; gene_name "AL627309.3"; xloc "XLOC_000005"; cmp_ref "ENST00000495576.1"; class_code "s"; tss_id "TSS7";
chr1 StringTie exon 89710 90050 . + . transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "1";
chr1 StringTie exon 90287 90455 . + . transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "2";
最佳答案
您可以使用以下命令:
awk -F "\t" '/class_code/{p=/class_code "[usj]"/}p' input > output
作为脚本的解释:
filter.awk
BEGIN {
FS="\t"
}
# When a line contains 'class_code' ...
/class_code/ {
# ... set a flag 'p' to 1 or 0 if the regexp
# /class_code "[usj]"/ matches or not.
# Note: this flag will remain set / unset for the following
# exon rows too
p=/class_code "[usj]"/
}
# If this evaluates to true ( 1 or 0 here ), awk will
# print the current line, otherwise not.
p
关于file - 如何在linux中根据多条代码提取行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64865378/