我收到了一个 FASTA 格式的文件(如来自此网站的:http://www.uniprot.org/proteomes/),其中给出了某种细菌内的各种蛋白质编码序列。我被要求提供文件中包含的每个单代码氨基酸的完整计数和相对百分比,并返回如下结果:
L: 139002 (10.7%)
A: 123885 (9.6%)
G: 95475 (7.4%)
V: 91683 (7.1%)
I: 77836 (6.0%)
到目前为止我所拥有的:
#!/usr/bin/python
ecoli = open("/home/file_pathway").read()
counts = dict()
for line in ecoli:
words = line.split()
for word in words:
if word in ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
for key in counts:
print key, counts[key]
我相信这样做是检索大写字母的所有实例,而不仅仅是蛋白质氨基酸字符串中包含的实例,我如何将其限制为编码序列?我在编写如何计算每个代码占总数时也遇到困难
最佳答案
唯一不包含您想要的内容的行以 >
开头,只需忽略这些:
with open("input.fasta") as ecoli: # will close your file automatically
from collections import defaultdict
counts = defaultdict(int)
for line in ecoli: # iterate over file object, no need to read all contents into memory
if line.startswith(">"): # skip lines that start with >
continue
for char in line: # just iterate over the characters in the line
if char in {"A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"}:
counts[char] += 1
total = float(sum(counts.values()))
for key,val in counts.items():
print("{}: {}, ({:.1%})".format(key,val, val / total))
您还可以使用 collections.Counter 字典,因为这些行仅包含您感兴趣的内容:
with open("input.fasta") as ecoli: # will close your file automatically
from collections import Counter
counts = Counter()
for line in ecoli: # iterate over file object, no need to read all contents onto memory
if line.startswith(">"): # skip lines that start with >
continue
counts.update(line.rstrip())
total = float(sum(counts.values()))
for key,val in counts.items():
print("{}: {}, ({:.1%})".format(key,val, val / total))
关于Python Dict 和 Forloop 以及 FASTA 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27003225/