python - 如何使用 Python 和 PIL 库从 DNA 序列创建条形码？

我是一个初学者，想使用 pyhton 代码从这个 DNA 序列中制作一个条形码。它应该读取每个 1024 个核苷酸并检查 mers(4 个核苷酸的组合，例如 AAAA、AAAC、AAAG、AAAT ..... TTTT)。每个 mer 在数组中保存一个索引(大小 = 256)，如果它在前 1024 个中找到 AAAA，则将其计数存储在其索引中，依此类推，然后存储到下一个 1024，直到完成整个序列。这将创建一个 2D 数组，该数组将被转换为灰度的 png。

我的问题是它只使用了前 1024 并将其显示在整个 1024X256 图像上。

DNA:https://1drv.ms/f/s!AuXxv7yqjA_FlS_ujYOMUvikWg8E

#read the DNA sequence
fasta_file = open(r'C:path\Escherichia_coli_ATCC_10798.fasta','r')
SE =fasta_file.read()
fasta_file.close()
seq = SE[177:]
dna_sequence = seq.replace("\n","")

# Sample size and mer length
#sample is the window that will go thorugh the whole sequance
sample_size = 1024
mer_length = 4

# Array to store the counts of each mer
barcode = [0] * 256

# Generate all possible 4-mers
mers = []
for i in range(256):
    mer = ""
    for j in range(4):
        mer += "ACGT"[i % 4]
        i //= 4
    mers.append(mer)


# Loop through the sample and count the occurrences of each mer
for w in range(sample_size):
    mer = dna_sequence[w:w+mer_length]
    barcode[mers.index(mer)] += 2

# Print the counts of each mer
    #print(mers[i], ":", barcode[i])
print(barcode)



# image
    # Python program to convert numpy array to image
    # import pillow library
from PIL import Image
import numpy as np

    # define a main function
    # Create the barcode array with the same shape as the desired image
code = np.array(barcode, dtype=np.uint8)

    # Create an Image object from the barcode array
image = Image.fromarray(code)

    # Reshape the image to the desired size (1024x4000)
image = image.resize((1024, 4000))

    # Save the image
image.save('Escherichia_coli.png')

    # Display the image (optional)
image.show()

深色图像就是我得到的另一个是我应该得到的 enter image description here my output

我不知道如何附加 DNA 序列。一些信息:genome_id =“531534bd23a542ae”atcc_catalog_number =“ATCC 10798”物种=“大肠杆菌”

链接到相似的基因组:

https://www.ncbi.nlm.nih.gov/assembly/GCA_028644645.1

最佳答案

我在这里看到两个问题:

my problem is that it took only the first 1024 and displayed it on the entire 1024X256 image.

您需要嵌套的 for 循环来覆盖每个 4 聚体的完整序列。代码中的 for 循环仅覆盖第一个窗口(1024 个核苷酸)。

the dark image is what i got the other one is what i was supposed to get

解决方案是按某个大值缩放计数数组。这会增加亮度。我在这里使用 20 作为示例。

我还建议使用 Biopython's SeqIO用于将 fasta 序列作为 SeqRecord 对象处理的接口(interface)。这是更新后的代码:

from Bio import SeqIO
from PIL import Image
import numpy as np

#read the DNA sequence
record = SeqIO.read(r'C:path\Escherichia_coli_ATCC_10798.fasta', "fasta")

# Sample size and mer length
#sample is the window that will go thorugh the whole sequance
window      = 1024
mer_length  = 4
n_windows   = int(np.ceil(len(record.seq)/window))
   
mers = []
for i in range(256):
    mer = ""
    for j in range(4):
        mer += "ACGT"[i % 4]
        i //= 4
    mers.append(mer)

# Array to store the counts of each mer 
# Each row represents 1024 nucleotides, last row counts are handled by SeqIO 
# even if the remaining window length is less than 1024
# 256 cols (each col represents a 4-mer)
image_array = np.zeros((n_windows, len(mers)), dtype=int)

# Loop through the sample and count the occurrences of each mer
for w in range(0,n_windows-1):
    for i in range(len(mers)):
        image_array[w,i] = 20*record.seq[1024*w:1024*(w+1)].count(mers[i])

# image
# Create an Image object from the image_array
# image = Image.fromarray(image_array)

image = Image.fromarray(image_array.astype(np.uint8))  # modded by me because of TypeError: Cannot handle this data type: (1, 1), <i8


# Reshape the image to the desired size (1024x4000)
image = image.resize((1024, 4000))
# Save the image
image.save('Escherichia_coli.png')
# Display the image (optional)
image.show()

上面的代码应该会给你类似这样的输出:

E. coli barcode image from genome

关于python - 如何使用 Python 和 PIL 库从 DNA 序列创建条形码？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/76317175/

python - 如何使用 Python 和 PIL 库从 DNA 序列创建条形码？

上一篇：r - 如何根据r中数据帧中的现有参数列添加新的正态分布数据列？

下一篇：powershell - 逐行分割字符串并进行变量化，即将其分配给 GITHUB_OUTPUT - 工作流程