linux - GREP 或 AWK : Search in the first N characters of each line, 并输出匹配模式的周围行

标签 linux awk grep pattern-matching sequencing

我有一个 RNA-seq 数据,如下所示:

@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--
@J00157:85:HNNJLBBXX:5:1101:12550:15574 1:N:0:ATTACTCG+TATAGCCT
GCTCTTCCGATCTGCTATTGATGACTGTCCTCTGTTCTTTCTTTCACAGTAGACGAGGACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTC
+
AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
--

如果我们把@后面的所有内容都当成一个section,可以看到只有第二行才是真正的测序信息,1,3,4,5是logistic/quality信息。

目标提取每行前N(N=35)个字符中包含“GCTGCA”的序列(第二行信息),并且同时输出周围的行(前面1行,匹配行后面3行)

一个示例答案是

@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--

我试过的是

awk 'substr($0, 1, 35) ~ "GCTGCA"' filename.fastq > newfile.fastq
grep -B 1 -A 2 -E GCTGCA filename.fastq > newfile.fastq
awk '{a[++i]=$0;}{substr(a[++i], 1, 35) ~ "GCTGCA"}{for(j=NR-1;j<=NR+2;j++)print a[j];}' filename.fastq > newfile.fastq

第一个不能输出周围的线。第二个不能限制每行前 35 个字母的模式匹配。第三行应该可以工作,但它给了我有线输出(这显然是不正确的):

@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
+
+
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
--
--

最佳答案

使用 gawk 多字符 RS 支持。

awk -v RS='\n--' -F'\n' 'substr($2,0,35)~"GCTGCA"{print $0 RS}' file

您使用记录分隔符定义记录。

关于linux - GREP 或 AWK : Search in the first N characters of each line, 并输出匹配模式的周围行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50008431/

相关文章:

regex - 特定代码块的 Grep

Linux从文件中取出字符串并将其用作文件名?

c - 使用共享内存以及如何使用 IPC_RMID 正确取消分配空间

python - Bash:WAITING后台python进程

Unix - 第 9 个逗号后分割线

linux - 如何统计某个目录下所有文件中某个单词的出现次数?但每个文件每个单词的计数仅增加一次

Java scsi访问

linux - 如果最后一行(包含字符串的最后一次出现)是否以逗号作为最后一个字符,我该如何检查(并删除)?

linux - awk '/a/{print $4 "\t"$3} abc.txt 是什么意思?

c++ - 有没有办法从 valgrind 中获取泄漏摘要?