python - 多进程,多个进程读取同一个文件

标签 python multiprocessing biopython pysam

我正在尝试模拟一些 DNA 测序读取,并且为了加快代码速度,我需要并行运行它。

基本上,我想做的是以下内容:我正在对人类基因组的读取进行采样,并且我认为多处理模块中的两个进程之一尝试从同一文件(人类基因组)中获取数据被损坏并且无法获得所需的 DNA 序列。我尝试过不同的方法,但我对并行编程很陌生,无法解决我的问题

当我用一个核心运行脚本时,它工作正常。

这就是我调用函数的方式

if __name__ == '__main__':
    jobs = []
    # init the processes
    for i in range(number_of_cores):
        length= 100
        lock = mp.Manager().Lock()
        p = mp.Process(target=simulations.sim_reads,args=(lock,FastaFile, "/home/inigo/msc_thesis/genome_data/hg38.fa",length,paired,results_dir,spawn_reads[i],temp_file_names[i]))
        jobs.append(p)
        p.start()
    for p in jobs:
        p.join()

这是我用来获取读取数据的函数,每个进程将数据写入不同的文件。

def sim_single_end(lc,fastafile,chr,chr_pos_start,chr_pos_end,read_length, unique_id):

    lc.acquire()
    left_split_read = fastafile.fetch(chr, chr_pos_end - (read_length / 2), chr_pos_end)
    right_split_read = fastafile.fetch(chr, chr_pos_start, chr_pos_start + (read_length / 2))
    reversed_left_split_read = left_split_read[::-1]
    total_read = reversed_left_split_read + right_split_read
    seq_id = "id:%s-%s|left_pos:%s-%s|right:%s-%s " % (unique_id,chr, int(chr_pos_end - (read_length / 2)), int(chr_pos_end), int(chr_pos_start),int(chr_pos_start + (read_length / 2)))
    quality = "I" * read_length
    fastq_string = "@%s\n%s\n+\n%s\n" % (seq_id, total_read, quality)
    lc.release()
    new_record = SeqIO.read(StringIO(fastq_string), "fastq")
    return(new_record)

这是回溯:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/inigo/Dropbox/PycharmProjects/circ_dna/simulations.py", line 107, in sim_ecc_reads
   new_read = sim_single_end(lc,fastafile, chr, chr_pos_start, chr_pos_end, read_length, read_id)
   File "/home/inigo/Dropbox/PycharmProjects/circ_dna/simulations.py", line 132, in sim_single_end
   new_record = SeqIO.read(StringIO(fastq_string), "fastq")
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/__init__.py", line 664, in read
   first = next(iterator)
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/__init__.py", line 600, in parse
for r in i:
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/QualityIO.py", line 1031, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/QualityIO.py", line 951, in FastqGeneralIterator
% (title_line, seq_len, len(quality_string)))

 ValueError: Lengths of sequence and quality values differs  for id:6-chr1_KI270707v1_random|left_pos:50511537-50511587|right:50511214-50511264 (0 and 100).

最佳答案

我是这个答案的OP,这个答案是我大约一年前做的。问题是我用来读取人类基因组文件(pysam)的包失败了。该问题是调用多处理时的拼写错误。

从作者的回复来看,这应该有效:

 p = mp.Process(target=get_fasta, args=(genome_fa,))

注意“,”以确保传递一个元组

参见https://github.com/pysam-developers/pysam/issues/409了解更多详情

关于python - 多进程,多个进程读取同一个文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42355202/

相关文章:

python - 通过 Anaconda 在 Windows 上为 Python 安装 `opencv_contrib`

python 格式化一个 float 来打印 .123 而不是 0.123

Python 多处理 : processes do not start

c - C 中的多线程与多处理

python - 使用 Biopython 的搜索词返回登录号

python - 如何使用Biopython获取残留物中的原子

Python Tornado – 如何修复 'URLhandler takes exactly X arguments' 错误?

Python:使用 pyvisa 或 pyserial 获取设备 "model"

python - 如何使用多处理在Python中运行多个异步进程?

python - 获取与两个 fastq 文件不同的记录