python - 将序列从 FASTA 文件提取到多个文件,文件基于单独文件中的 header_ID

标签 python regex biopython fasta

我正在寻找一个python解决方案,根据与列表的匹配将多个序列从FASTA文件提取到多个文件中 header ID 位于单独的文件中。

这是发布在 Extract sequences from a FASTA file based on entries in a separate file 上的问题的稍微复杂的版本。和 https://www.biostars.org/p/2822/它只输出所有匹配的单个文件。

我是 python 新手,正在尝试找到一种方法:

  • 获取一个包含将出现在 fasta header 中的字符串的文件
  • 将与字符串匹配的所有记录写入单独的 fasta 文件

header_ID_strings 文件如下所示:

CAP357_2030_09WPI、CAP357_2040_11WPI、CAP357_2050_13WPI 等...

我的 fasta 文件示例如下所示:

>CAP357_2030_009wpi_v1v3_1_056_00002_000.4
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGG
>CAP357_2040_011wpi_v1v3_1_008_00006_001.1
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
>CAP357_2040_011wpi_v1v3_1_030_00002_000.4
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
>CAP357_2040_011wpi_v1v3_1_004_00001_000.2
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
>CAP357_2050_013wpi_v1v3_1_047_00002_000.4
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT

预期输出

文件1:CAP357_2030_009wpi_v1v3.fasta
>CAP357_2030_009wpi_v1v3_1_056_00002_000.4
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGG

文件2:CAP357_2040_011wpi_v1v3.fasta
>CAP357_2040_011wpi_v1v3_1_008_00006_001.1
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
>CAP357_2040_011wpi_v1v3_1_030_00002_000.4
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
>CAP357_2040_011wpi_v1v3_1_004_00001_000.2
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
等等...

此代码来自上面的链接,但我想要:
* 匹配写入单独的输出文件
* 如果可能的话,我不必单独指定每个输出文件(我最多有 30 个输出文件)

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)

print "Saved %i records from %s to %s" % (count, input_file, output_file)

到目前为止,这就是我想到的(下面的脚本),但不知道如何手动指定所有输出文件和变量(我在这里只包含了三个)

from Bio import SeqIO
import pandas as pd
import sys

input_file = sys.argv[1]
id_file = sys.argv[2]
output_file2020 = sys.argv[3]
output_file2030 = sys.argv[4]
output_file2040 = sys.argv[5]

colnames = ["2020", "2030", "2040"]
headerlist = pd.read_csv(id_file, names = colnames, header = None)
infile = list(SeqIO.parse(input_file, "fasta"))
2020_seq = tuple(headerlist.2020)
2030_seq = tuple(headerlist.2030)
2040_seq = tuple(headerlist.2040)

count2020 = 0
count2030 = 0
count2040 = 0
for record in infile:
    if record.id in 2020_seq:
        SeqIO.write([record], output_file2020, "fasta")
        countSU += 1
    elif record.id in PI_seq:
        SeqIO.write([record], output_file2030, "fasta")
        countPI += 1
    elif record.id in REC_seq:
        SeqIO.write([record], output_file2040, "fasta")
        countREC += 1
    else:
        print("no matches found")

print("number of SU is", count2020)
print("number of PI is", count2030)
print("number of REC is", count2040)

最佳答案

为了完整起见,这是“最终”脚本:

#!/usr/bin/env python
# a script to extract fasta records from a fasta file to multiple separate fasta files based on a particular ID (time point) in a particular field, for a given delimiter
# to run, navigate to file location with command prompt and enter: python split_fasta_by_collections.py infile.fasta
from Bio import SeqIO
import os
import sys

records = SeqIO.parse(sys.argv[1], "fasta")
collected = {}
for record in records:
    descr = record.description.split("_")[1] # "_" sets the delimeter, "1" sets the field where counting starts at 0 for the first field
    try:
        collected[descr].append(record)
    except KeyError:
        collected[descr] = [record ,]

file_name = "outfile%s.fasta" 
file_path = os.getcwd() #sets the output file path to your current working directory

for (descr, records) in collected.items():  
    with open(os.path.join(file_path, file_name % descr), 'w') as f:
        SeqIO.write(records, f, 'fasta')

关于python - 将序列从 FASTA 文件提取到多个文件,文件基于单独文件中的 header_ID,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27469644/

相关文章:

Javascript 匹配分隔符之间的所有内容,包括换行符

python - 使用 NcbiblastxCommandline 自定义 blast 数据库

python - 计算k度的平均聚类

javascript - 带有捕获括号的 .split() 正则表达式在 IE8 中不起作用

javascript - 正则表达式 使用 OR ( | ) 时非捕获组

Python:如何比较来自 fasta 文件的多个序列?

python - 查找允许某些不匹配的子字符串的快速方法

python - QUIT pygame 事件未定义

python - 跳转到 for 循环内 readlines 的下一行

python - 如何使 Nose 测试使用python3