我有一个序列如下:
my_file_m= "TCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTA
TCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAA
GATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGG
AGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT"
我想找出具体的三个字母 TAA
、TGA
和 TAG
的位置和数量。如果有的话,我想给它们上色。
我从加载字母开始
my_file = open(my_file_m)
mine = my_file.read()
print(mine)
我无法使用 .count 也无法使用 find,因为我有三个输入。有什么想法如何找到它们并突出显示它们吗?
最佳答案
这是我对你问题的解答:
注意:此代码还会查找重叠序列。根据您是否要允许重叠,您必须删除 '?='
import re
class bcolors:
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
my_file_m= '''TTCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT'''
pat = re.compile(r'(?=(TAA|AAT|TGA|TAG))') # Very important, if you do not need overlaps then remove '?='
matches = re.finditer(pat,my_file_m)
result1 = [int(match.start(1)) for match in matches] # find all the starting positions of the string
result2 = [range(x,x+3) for x in result1 ] # find all the positions of the characters (given that we search for patterns of length 3, can be modified for other lengths too )
result3 = set().union(*result2) # generate a union
for chari in range(len(my_file_m)): # colorize based on if it is in a sequence or not
if(chari in result3):
print bcolors.OKGREEN + my_file_m[chari] + bcolors.ENDC,
else:
print my_file_m[chari],
清洁工:
import re
import sys
my_file_m= '''TAATTCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT'''
pat = re.compile(r'(?=(TAA|TGA|TAG))') # Very important, if you do not need overlaps then remove '?='
lettersToColor = set().union(*[range(m.start(1),m.start(1)+3) for m in re.finditer(pat, my_file_m)])
for chari in range(len(my_file_m)): # colorize based on if it is in a sequence or not
if(chari in lettersToColor):
sys.stdout.write('\033[92m' + my_file_m[chari] +'\033[0m')
else:
sys.stdout.write(my_file_m[chari])
输出:
关于python - 如何找到一个序列中的三个字母?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28481407/