bioinformatics - Nextflow:如何处理多个样本

标签 bioinformatics nextflow

我有几个用于几个示例的 fq.gz 文件。我正在尝试使用 nextflow 一次处理所有样本。但不知何故,我无法一次处理所有样本。但我可以一次处理一个样本。这是数据结构和我处理单个样本的代码。

enter image description here

我的下一个流程代码

params.sampleName="sample1"
params.fastq_path = "data/${params.sampleName}/*{1,2}.fq.gz"

fastq_files = Channel.fromFilePairs(params.fastq_path)

params.ref = "ab.fa"
ref = file(params.ref)

process foo {
    input:
    set pairId, file(reads) from fastq_files

    output:

    file("${pairId}.bam") into bamFiles_ch

    script:
    """
    echo ${reads[0].toRealPath().getParent().baseName}
    bwa-mem2 mem -t 8 ${ref} ${reads[0].toRealPath()} ${reads[1].toRealPath()} | samtools sort -@8 -o ${pairId}.bam
    samtools index -@8 ${pairId}.bam
    """
}

process samToolsMerge {
    publishDir "./aligned_minimap/", mode: 'copy', overwrite: 'false'

    input:
    file bamFile from bamFiles_ch.collect()

    output:
    file("**")

    script:
    """
    samtools merge ${params.sampleName}.bam ${bamFile}
    samtools index -@ 8 ${params.sampleName}.bam
    """
}

所以需要帮助来解决。提前致谢。

最佳答案

看起来您已经使用以下方法构建了设置目标示例名称的方法:

params.sampleName="sample1"
params.fastq_path = "data/${params.sampleName}/*{1,2}.fq.gz"

要使全局模式匹配所有个样本,您可以简单地在命令行上设置通配符,使用:

nextflow run main.nf --sampleName '*'

请注意上面的引号。如果忽略这些,则 glob 星号将在传递给 Nextflow 命令之前由您的 shell 展开。



简短的回答是,您需要一些简单的方法来从父目录中提取示例名称。然后,您需要某种方法按样本名称对坐标排序的 BAM 进行分组。下面,我使用了新的 Nextflow DSL 2但这并不是绝对必要的。我只是发现新的 DSL 2 代码更容易阅读和调试。下面只是一个示例,您需要对其进行调整以适合您的确切用例,但也就是说,它应该做非常相似的事情。它使用特殊的groupKey这样我们就可以在调用 groupTuple 之前动态指定每个元组中的预期元素数量。运算符(operator)。这使我们能够尽快传输收集到的值,以便每个样本在所有读取组对齐后都可以“合并”。如果没有这个,所有输入读取组都需要在合并开始之前完成对齐。

nextflow.config的内容:

process {

  shell = [ '/bin/bash', '-euo', 'pipefail' ]
}

main.nf的内容:

nextflow.enable.dsl=2

params.ref_fasta = "GRCh38.primary_assembly.genome.chr22.fa.gz"
params.fastq_files = "data/*/*.read{1,2}.fastq.gz"


process bwa_index {

    conda 'bwa-mem2'

    input:
    path fasta

    output:
    path "${fasta}.{0123,amb,ann,bwt.2bit.64,pac}"

    """
    bwa-mem2 index "${fasta}"
    """
}


process bwa_mem2 {

    tag { [sample, readgroup].join(':') }

    conda 'bwa-mem2 samtools'

    input:
    tuple val(sample), val(readgroup), path(reads)
    path bwa_index

    output:
    tuple val(sample), val(readgroup), path("${readgroup}.bam{,.bai}")

    script:
    def idxbase = bwa_index.first().baseName
    def out_files = [ "${readgroup}.bam", "${readgroup}.bam.bai" ].join('##idx##')
    def (r1, r2) = reads

    """
    bwa-mem2 mem \\
        -R '@RG\\tID:${readgroup}\\tSM:${sample}' \\
        -t ${task.cpus} \\
        "${idxbase}" \\
        "${r1}" \\
        "${r2}" |
    samtools sort \\
        --write-index \\
        -@ ${task.cpus} \\
        -o "${out_files}"
    """
}


process samtools_merge {

    tag { sample }

    conda 'samtools'

    input:
    tuple val(sample), path(indexed_bam_files)

    output:
    tuple val(sample), path("${sample}.bam{,.bai}")

    script:
    def out_files = [ "${sample}.bam", "${sample}.bam.bai" ].join('##idx##')
    def input_bam_files = indexed_bam_files
        .findAll { it.name.endsWith('.bam') }
        .collect { /"${it}"/ }
        .join(' \\\n'+' '*8)

    """
    samtools merge \\
        --write-index \\
        -o "${out_files}" \\
        ${input_bam_files}
    """
}


workflow {

    ref_fasta = file( params.ref_fasta )
    bwa_index( ref_fasta )

    Channel.fromFilePairs( params.fastq_files ) \
        | map { readgroup, reads ->
            def (sample_name) = reads*.parent.baseName as Set

            tuple( sample_name, readgroup, reads )
        } \
        | groupTuple() \
        | map { sample, readgroups, reads ->
            tuple( groupKey(sample, readgroups.size()), readgroups, reads )
        } \
        | transpose() \
        | set { sample_readgroups }

    bwa_mem2( sample_readgroups, bwa_index.out )

    sample_readgroups \
        | join( bwa_mem2.out, by: [0,1] ) \
        | map { sample_key, readgroup, reads, indexed_bam ->
            tuple( sample_key, indexed_bam )
        } \
        | groupTuple() \
        | map { sample_key, indexed_bam_files ->
            tuple( sample_key.toString(), indexed_bam_files.flatten() )
        } \
        | samtools_merge
}

运行方式:

nextflow run -ansi-log false main.nf

结果:

N E X T F L O W  ~  version 21.04.3
Launching `main.nf` [zen_gautier] - revision: dcde9efc8a
Creating Conda env: bwa-mem2 [cache /home/steve/working/stackoverflow/69702077/work/conda/env-8cc153b2eb20a5374bf435019a61c21a]
[63/73c96b] Submitted process > bwa_index
Creating Conda env: bwa-mem2 samtools [cache /home/steve/working/stackoverflow/69702077/work/conda/env-5c358e413a5318c53a45382790eecbd4]
[52/6a92d3] Submitted process > bwa_mem2 (HBR:HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22)
[8b/535b21] Submitted process > bwa_mem2 (UHR:UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[dc/03d949] Submitted process > bwa_mem2 (UHR:UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[e4/bfd08b] Submitted process > bwa_mem2 (HBR:HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22)
[d5/e2aa27] Submitted process > bwa_mem2 (UHR:UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[c2/23ce8a] Submitted process > bwa_mem2 (HBR:HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22)
Creating Conda env: samtools [cache /home/steve/working/stackoverflow/69702077/work/conda/env-912cee20caec78e112a5718bb0c00e1c]
[28/006c03] Submitted process > samtools_merge (HBR)
[3b/51311c] Submitted process > samtools_merge (UHR)

关于bioinformatics - Nextflow:如何处理多个样本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69702077/

相关文章:

docker - 为什么 docker hub 上的某些存储库没有 dockerfile?

python-3.x - nextflow 没有找到我所有的 python 模块

bioinformatics - 使用 Nextflow 时出现 "{"的语法冲突

docker - Nextflow配置文件问题

bash - 如何制作 wget 'alias' ?

Bash:尝试在函数输出中附加变量名

perl - 生成具有取代率的合成 DNA 序列

linux conda 安装 log4j-2.10.0 包问题