python - 确定多嘌呤束的长度

标签 python skbio

如何确定/找到任何基因组中最长的多聚嘌呤束(连续的 As 和 G,没有散布的 C 或 T,或反之亦然),这需要在大肠杆菌基因组上。 是先找出多嘌呤链再找出最长链吗?或者是将内含子和外显子从DNA上剪接下来?由于大肠杆菌的基因组有 460 万个 BP 长, 我需要一些帮助来解决这个问题吗?

最佳答案

我同意这个问题的方法论方面更适合 https://biology.stackexchange.com/ (即,是否应该删除内含子/外显子等),但简单地说,这完全取决于您要回答的生物学问题。如果您关心这些延伸是否跨越内含子/外显子边界,那么您不应该首先分割它们。然而,我不确定这与大肠杆菌序列相关,因为(据我所知)内含子和外显子是真核生物特有的。

为了解决这个问题的技术方面,这里有一些代码说明了如何使用 scikit-bio 来做到这一点。 (我还将其作为 scikit-bio 食谱 here 发布。)

from __future__ import print_function
import itertools
from skbio import parse_fasta, NucleotideSequence

# Define our character sets of interest. We'll define the set of purines and pyrimidines here. 

purines = set('AG')
pyrimidines = set('CTU')


# Obtain a single sequence from a fasta file. 

id_, seq = list(parse_fasta(open('data/single_sequence1.fasta')))[0]
n = NucleotideSequence(seq, id=id_)


# Define a ``longest_stretch`` function that takes a ``BiologicalSequence`` object and the characters of interest, and returns the length of the longest contiguous stretch of the characters of interest, as well as the start position of that stretch of characters. (And of course you could compute the end position of that stretch by summing those two values, if you were interested in getting the span.)

def longest_stretch(sequence, characters_of_interest):
    # initialize some values
    current_stretch_length = 0
    max_stretch_length = 0
    current_stretch_start_position = 0
    max_stretch_start_position = -1

    # this recipe was developed while reviewing this SO answer:
    # http://stackoverflow.com/a/1066838/3424666
    for is_stretch_of_interest, group in itertools.groupby(sequence, 
                                                           key=lambda x: x in characters_of_interest):
        current_stretch_length = len(list(group))
        current_stretch_start_position += current_stretch_length
        if is_stretch_of_interest:
            if current_stretch_length > max_stretch_length:
                max_stretch_length = current_stretch_length
                max_stretch_start_position = current_stretch_start_position
    return max_stretch_length, max_stretch_start_position


# We can apply this to find the longest stretch of purines...

longest_stretch(n, purines)


# We can apply this to find the longest stretch of pyrimidines...

longest_stretch(n, pyrimidines)


# Or the longest stretch of some other character or characters.

longest_stretch(n, set('N'))


# In this case, we try to find a stretch of a character that doesn't exist in the sequence.

longest_stretch(n, set('X'))

关于python - 确定多嘌呤束的长度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25211905/

相关文章:

python - #通过pip安装scikit-bio时报错 “SSE2 instruction set not enabled”

python - 使Python脚本可执行并具有参数

python - 堆叠从 Xarray 生成的 Dask 数组的有效方法

python - Django 休息框架 : How serialize list of list?

skbio - 打开文件句柄以与 skbio 中的 TabularMSA 一起使用

python - 错误: cannot import name 'SpearmanRConstantInputWarning' from 'scipy.stats'

python - 尝试将灰度图像中的所有白色区域变成黑色

python - Azure 容器实例 Python API - 无法从 Azure 容器注册表获取图像

skbio - 尝试导入 skbio 模块时未找到 future.utils.6