python - 在python中解析包含multifasta BLAST结果的xml文件

标签 python xml bioinformatics biopython blast

我正在尝试解析包含 multifasta BLAST 结果的 xml 文件 - 这是 link - 大小约为 400kB。程序应返回四个序列名称。每个下一个结果应该首先位于(包含最佳对齐)“< Iteration_iter-num > n < Iteration_iter-num/>”之后,其中 n = 1,2,3,...

像这样:

< Iteration_iter-num >1< /Iteration_iter-num >

****Alignment****
sequence: gi|171864|gb|AAC04946.1| Yal011wp [Saccharomyces cerevisiae]

< Iteration_iter-num >2< /Iteration_iter-num >

****Alignment****
sequence: gi|330443384|ref|NP_009392.2| 

< Iteration_iter-num >3< /Iteration_iter-num >

****Alignment****
sequence: gi|6319310|ref|NP_009393.1|

< Iteration_iter-num >4< /Iteration_iter-num >

****Alignment****
sequence: gi|6319312|ref|NP_009395.1|

但结果我的程序返回这个:

<Iteration_iter-num>1</Iteration_iter-num>
****Alignment****
sequence: gi|171864|gb|AAC04946.1| Yal011wp [Saccharomyces cerevisiae]

<Iteration_iter-num>2</Iteration_iter-num>
****Alignment****
sequence: gi|171864|gb|AAC04946.1| Yal011wp [Saccharomyces cerevisiae]

<Iteration_iter-num>3</Iteration_iter-num>
****Alignment****
sequence: gi|171864|gb|AAC04946.1| Yal011wp [Saccharomyces cerevisiae]

<Iteration_iter-num>4</Iteration_iter-num>
****Alignment****
sequence: gi|171864|gb|AAC04946.1| Yal011wp [Saccharomyces cerevisiae]

如何从此 xml 文件获取另一个 BLASTA 结果?

这是我的代码:

from Bio.Blast import NCBIXML
from bs4 import BeautifulSoup

result = open ("BLAST_left.xml", "r")
records = NCBIXML.parse(result)
item = next(records)

file = open("BLAST_left.xml")
page = file.read()
soup = BeautifulSoup(page, "xml")
num_xml_array = soup.find_all('Iteration_iter-num')
i = 0
for records in records:
    for itemm in num_xml_array:
        print (itemm)
        for alignment in item.alignments:
            for hsp in alignment.hsps:
                print("\n\n****Alignment****")
                print("sequence:", alignment.title)
            break
        itemm = num_xml_array[i+1]
    break

//我知道我的英语并不完美,但我真的不知道该怎么办,而且我没有人可以要求,所以我选择了你:)

最佳答案

我认为 Biopython 是解析 XML 的更好选择,无需使用 BeautifulSoup:

from Bio.Blast import NCBIXML


result = open("BLAST_left.xml", "r")
records = NCBIXML.parse(result)

for i, record in enumerate(records):
    for align in record.alignments:
        print("Iteration {}".format(i))
        print(align.hit_id)
        break  # Breaking here gives you only the best HSP.

关于python - 在python中解析包含multifasta BLAST结果的xml文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36644110/

相关文章:

python - 从命令行运行 Jupyter Notebook (.ipynb),就好像它是一个 .py 文件一样

xml - 在 XML 中使用 #Required 关键字和枚举值

Perl Inline::C:是否需要 Inline_Stack_Vars 等以避免内存泄漏(生物序列字符匹配)

python - 避免大型 IUPAC 模糊 DNA 搜索中的正则表达式溢出错误

python - 找到特征值的最佳方法?

python - scikit learn 是否包含具有连续输入的朴素贝叶斯分类器?

iphone - NSMutableArray 不获取 XML 值

string - 计算子字符串在文件中出现的次数并将其放在新列中

Python。需要确定连接是从本地机器建立的吗?

xml - XSD 中元素的 ref 属性有什么作用?