Python Dict 和 Forloop 以及 FASTA 文件

标签 python for-loop dictionary fasta

我收到了一个 FASTA 格式的文件(如来自此网站的:http://www.uniprot.org/proteomes/),其中给出了某种细菌内的各种蛋白质编码序列。我被要求提供文件中包含的每个单代码氨基酸的完整计数和相对百分比,并返回如下结果:

L: 139002 (10.7%) 

A: 123885 (9.6%) 

G: 95475 (7.4%) 

V: 91683 (7.1%) 

I: 77836 (6.0%)

到目前为止我所拥有的:

 #!/usr/bin/python
ecoli = open("/home/file_pathway").read()
counts = dict()
for line in ecoli:
    words = line.split()
    for word in words:
        if word in ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]:
            if word not in counts:
                counts[word] = 1
            else:
                counts[word] += 1

for key in counts:
    print key, counts[key]

我相信这样做是检索大写字母的所有实例,而不仅仅是蛋白质氨基酸字符串中包含的实例,我如何将其限制为编码序列?我在编写如何计算每个代码占总数时也遇到困难

最佳答案

唯一不包含您想要的内容的行以 > 开头,只需忽略这些:

with open("input.fasta") as ecoli: # will close your file automatically
    from collections import defaultdict
    counts = defaultdict(int) 
    for line in ecoli: # iterate over file object, no need to read all contents into memory
        if line.startswith(">"): # skip lines that start with >
            continue
        for char in line: # just iterate over the characters in the line
            if char in {"A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"}:
                    counts[char] += 1
    total = float(sum(counts.values()))       
    for key,val in counts.items():
        print("{}: {}, ({:.1%})".format(key,val, val / total))

您还可以使用 collections.Counter 字典,因为这些行仅包含您感兴趣的内容:

with open("input.fasta") as ecoli: # will close your file automatically
    from collections import Counter
    counts = Counter()
    for line in ecoli: # iterate over file object, no need to read all contents onto memory
        if line.startswith(">"): # skip lines that start with >
            continue
        counts.update(line.rstrip())
    total = float(sum(counts.values()))
    for key,val in counts.items():
        print("{}: {}, ({:.1%})".format(key,val, val / total))

关于Python Dict 和 Forloop 以及 FASTA 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27003225/

相关文章:

python - 使用 Numpy 更高效地生成过滤器

python - django 模板 - 如何动态访问变量?

c++ - 为 STL::map 使用 value_type 数组

python - 检查字典的链式实现中的值

python - 如何改变空白字符的颜色?

python - 如何在绝对导入中设置没有点符号的 Pypi 包 - python3

arrays - 当我尝试初始化 100 个元素的一维数组并使用指针填充它时出现段错误(核心已转储)

bash - 如何在bash中使用变量名生成for循环序号?

javascript - 谷歌脚本超时

swift - Swift 中的映射函数将 String 转换为 Int?