我有一个 fastq 文件,其中包含超过 1 亿个读数和长度为 10000 的基因组序列
我想从 fastq 文件中取出序列并在基因组序列中搜索,允许 3 个不匹配
我尝试使用 awk 以这种方式从 fastq 文件中获取序列:
1.fq(几行)
@DH1DQQN1:269:C1UKCACXX:1:1101:1207:2171 1:N:0:TTAGGC NATCCCCATCCTCTGCTTGCTTTTCGGGATATGTTGTAGGATTCTCAGC
+
1=ADBDDHD;F>GF@FFEFGGGIAEEI?D9DDHHIGAAF:BG39?BB
@DH1DQQN1:269:C1UKCACXX:1:1101:1095:2217 1:N:0:TTAGGC TAGGATTTCAAATGGGTCGAGGTGGTCCGTTAGGTATAGGGGCAACAGG
+
??AABDD4C:DDDI+C:C3@:C):1?*):?)?################
$ awk 'NR%4==2' 1.fq
NATCCCCATCCTCTGCTTGCTTTTCGGGATATGTTGTAGGATTCTCAGC TAGGATTTCAAATGGGTCGAGGTGGTCCGTTAGGTATAGGGGCAACAGG
我的文件中有所有序列,现在我想获取每一行序列并在基因组序列中搜索,允许 3 个不匹配,如果找到则打印序列
示例:
基因组序列文件:
GGGGAGGAATATGATTTACAGTTTATTTTTCAACTGTGCAAAATAACCTTAACTGCAGACGTTATGACATACATACATTCTATGAATTCCACTATTTTGGAGGACTGGAATTTTGGTCTACAACCTCCCCCAGGAGGCACACTAGAAGATACTTATAGGTTTGTAACCCAGGCAATTGCTTGTCAAAAACATACA
搜索序列文件:
GGGGAGGAATATGAT
GGGGAGGAATATGAA
GGGGAGGAATATGCC
TCAAAAACATAGG
TCAAAAACATGGG
输出文件:
GGGGAGGAATATGAT 0 # 0 mismatch exact sequence
GGGGAGGAATATGAA 1 # 1 mismatch
GGGGAGGAATATGCC 2 # 2 mismatch
TCAAAAACATAGG 2 # 2 mismatch
TCAAAAACATGGG 3 # 3 mismatch
最佳答案
类似的东西?
use 5.012;
use strict;
use warnings;
use String::Approx qw(aslice);
use File::Slurp;
use Data::Dumper;
my $genseq = "gseq.txt"; #the long sequence
$_ = read_file($genseq);
#read small patterns from stdin
while(my $patt = <>) {
chomp $patt;
my $len = length($patt);
my($index, $size, $distance) = aslice($patt, ["3D0S3", "minimal_distance"]);
say "$patt matched approx. at $index with mismatch $distance" if $distance <= 3;
}
为您输入生成:
GGGGAGGAATATGAT matched approx. at 0 with mismatch 0
GGGGAGGAATATGAA matched approx. at 0 with mismatch 1
GGGGAGGAATATGCC matched approx. at 0 with mismatch 2
TCAAAAACATAGG matched approx. at 179 with mismatch 2
TCAAAAACATGGG matched approx. at 179 with mismatch 3
老实说,不知道如何使用 10000 个字符长的 genseq...
关于python - 搜索基因组中不匹配的序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16343985/