我是一个初学者,想使用 pyhton 代码从这个 DNA 序列中制作一个条形码。它应该读取每个 1024 个核苷酸并检查 mers(4 个核苷酸的组合,例如 AAAA、AAAC、AAAG、AAAT ..... TTTT)。每个 mer 在数组中保存一个索引(大小 = 256),如果它在前 1024 个中找到 AAAA,则将其计数存储在其索引中,依此类推,然后存储到下一个 1024,直到完成整个序列。这将创建一个 2D 数组,该数组将被转换为灰度的 png。
我的问题是它只使用了前 1024 并将其显示在整个 1024X256 图像上。
DNA:https://1drv.ms/f/s!AuXxv7yqjA_FlS_ujYOMUvikWg8E
#read the DNA sequence
fasta_file = open(r'C:path\Escherichia_coli_ATCC_10798.fasta','r')
SE =fasta_file.read()
fasta_file.close()
seq = SE[177:]
dna_sequence = seq.replace("\n","")
# Sample size and mer length
#sample is the window that will go thorugh the whole sequance
sample_size = 1024
mer_length = 4
# Array to store the counts of each mer
barcode = [0] * 256
# Generate all possible 4-mers
mers = []
for i in range(256):
mer = ""
for j in range(4):
mer += "ACGT"[i % 4]
i //= 4
mers.append(mer)
# Loop through the sample and count the occurrences of each mer
for w in range(sample_size):
mer = dna_sequence[w:w+mer_length]
barcode[mers.index(mer)] += 2
# Print the counts of each mer
#print(mers[i], ":", barcode[i])
print(barcode)
# image
# Python program to convert numpy array to image
# import pillow library
from PIL import Image
import numpy as np
# define a main function
# Create the barcode array with the same shape as the desired image
code = np.array(barcode, dtype=np.uint8)
# Create an Image object from the barcode array
image = Image.fromarray(code)
# Reshape the image to the desired size (1024x4000)
image = image.resize((1024, 4000))
# Save the image
image.save('Escherichia_coli.png')
# Display the image (optional)
image.show()
深色图像就是我得到的 另一个是我应该得到的 enter image description here my output
我不知道如何附加 DNA 序列。一些信息:genome_id =“531534bd23a542ae”atcc_catalog_number =“ATCC 10798”物种=“大肠杆菌”
链接到相似的基因组:
最佳答案
我在这里看到两个问题:
my problem is that it took only the first 1024 and displayed it on the entire 1024X256 image.
您需要嵌套的 for 循环来覆盖每个 4 聚体的完整序列。代码中的 for 循环仅覆盖第一个窗口(1024 个核苷酸)。
the dark image is what i got the other one is what i was supposed to get
解决方案是按某个大值缩放计数数组。这会增加亮度。我在这里使用 20 作为示例。
我还建议使用 Biopython's SeqIO用于将 fasta 序列作为 SeqRecord
对象处理的接口(interface)。这是更新后的代码:
from Bio import SeqIO
from PIL import Image
import numpy as np
#read the DNA sequence
record = SeqIO.read(r'C:path\Escherichia_coli_ATCC_10798.fasta', "fasta")
# Sample size and mer length
#sample is the window that will go thorugh the whole sequance
window = 1024
mer_length = 4
n_windows = int(np.ceil(len(record.seq)/window))
mers = []
for i in range(256):
mer = ""
for j in range(4):
mer += "ACGT"[i % 4]
i //= 4
mers.append(mer)
# Array to store the counts of each mer
# Each row represents 1024 nucleotides, last row counts are handled by SeqIO
# even if the remaining window length is less than 1024
# 256 cols (each col represents a 4-mer)
image_array = np.zeros((n_windows, len(mers)), dtype=int)
# Loop through the sample and count the occurrences of each mer
for w in range(0,n_windows-1):
for i in range(len(mers)):
image_array[w,i] = 20*record.seq[1024*w:1024*(w+1)].count(mers[i])
# image
# Create an Image object from the image_array
# image = Image.fromarray(image_array)
image = Image.fromarray(image_array.astype(np.uint8)) # modded by me because of TypeError: Cannot handle this data type: (1, 1), <i8
# Reshape the image to the desired size (1024x4000)
image = image.resize((1024, 4000))
# Save the image
image.save('Escherichia_coli.png')
# Display the image (optional)
image.show()
上面的代码应该会给你类似这样的输出:
关于python - 如何使用 Python 和 PIL 库从 DNA 序列创建条形码?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76317175/