我对 python 编程非常陌生。我有包含某些植物物种的蛋白质序列的 fasta 文件。
我想根据每个序列包含的氨基酸数量来过滤它们。标准是那些>20 个氨基酸的序列。
我能够利用biopython cookbook上的资源获得超过20个的氨基酸序列。但是,当我尝试将它们写入文件时,它给了我这个 Error 。我无法解决此错误。此外,我还想在输出文件中包含每个序列的 ID。请帮我!
代码:
import Bio
from Bio import SeqIO
for s_record in SeqIO.parse('arabidopsis_thaliana_proteome.ath.tfa','fasta'):
name = s_record.id
seq = s_record.seq
seqLen = len(s_record)
if seqLen >20:
desired_proteins=seq
output_file=SeqIO.write(desired_proteins, "filtered.fasta","fasta")
output_file
输入文件:Arabidopsis Thaliana
>AT5G16970
MTATNKQVILKDYVSGFPTESDFDFTTTTVELRVPEGTNSVLVKNLYLSCDPYMRIRMGKPDPSTAALAQAYTPGQPIQGYGVSRIIESGHPDYKKGDLLWGIVAWEEYSVITPMTHAHFKIQHTDVPLSYYTGLLGMPGMTAYAGFYEVCSPKEGETVYVSAASGAVGQLVGQLAKMMGCYVVGSAGSKEKVDLLKTKFGFDDAFNYKEESDLTAALKRCFPNGIDIYFENVGGKMLDAVLVNMNMHGRIAVCGMISQYNLENQEGVHNLSNIIYKRIRIQGFVVSDFYDKYSKFLEFVLPHIREGKITYVEDVADGLEKAPEALVGLFHGKNVGKQVVVVARE*
>AT4G32100
MATNACKFLCLVLLFAFVTQGYGDDSYSLESLSVIQSKTGNMVENKPEWEVKVLNSSPCYFTHTTLSCVRFKSVTPIDSKVLSKSGDTCLLGNGDSIHDISFKYVWDTSFDLKVVDGYIACS*
提前谢谢你:)
最佳答案
根据此处的 BioPython 教程:
http://biopython.org/wiki/SeqIO
SeqIO.parse()
的第一个参数应该是文件句柄,而不是文件名:
from Bio import SeqIO with open("example.fasta", "rU") as handle: for record in SeqIO.parse(handle, "fasta"): print(record.id)
这应该有效:
import Bio
from Bio import SeqIO
fh=open('arabidopsis_thaliana_proteome.ath.tfa')
for s_record in SeqIO.parse(fh,'fasta'):
name = s_record.id
seq = s_record.seq
seqLen = len(s_record)
if seqLen >20:
desired_proteins=seq
output_file=SeqIO.write(desired_proteins, "filtered.fasta","fasta")
output_file
fh.close()
关于python-2.7 - 使用 Biopython 基于 IDS 过滤 FASTA 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40747196/