我有几个用于几个示例的 fq.gz
文件。我正在尝试使用 nextflow
一次处理所有样本。但不知何故,我无法一次处理所有样本。但我可以一次处理一个样本。这是数据结构和我处理单个样本的代码。
我的下一个流程代码
params.sampleName="sample1"
params.fastq_path = "data/${params.sampleName}/*{1,2}.fq.gz"
fastq_files = Channel.fromFilePairs(params.fastq_path)
params.ref = "ab.fa"
ref = file(params.ref)
process foo {
input:
set pairId, file(reads) from fastq_files
output:
file("${pairId}.bam") into bamFiles_ch
script:
"""
echo ${reads[0].toRealPath().getParent().baseName}
bwa-mem2 mem -t 8 ${ref} ${reads[0].toRealPath()} ${reads[1].toRealPath()} | samtools sort -@8 -o ${pairId}.bam
samtools index -@8 ${pairId}.bam
"""
}
process samToolsMerge {
publishDir "./aligned_minimap/", mode: 'copy', overwrite: 'false'
input:
file bamFile from bamFiles_ch.collect()
output:
file("**")
script:
"""
samtools merge ${params.sampleName}.bam ${bamFile}
samtools index -@ 8 ${params.sampleName}.bam
"""
}
所以需要帮助来解决。提前致谢。
最佳答案
看起来您已经使用以下方法构建了设置目标示例名称的方法:
params.sampleName="sample1"
params.fastq_path = "data/${params.sampleName}/*{1,2}.fq.gz"
要使全局模式匹配所有个样本,您可以简单地在命令行上设置通配符,使用:
nextflow run main.nf --sampleName '*'
请注意上面的引号。如果忽略这些,则 glob 星号将在传递给 Nextflow 命令之前由您的 shell 展开。
简短的回答是,您需要一些简单的方法来从父目录中提取示例名称。然后,您需要某种方法按样本名称对坐标排序的 BAM 进行分组。下面,我使用了新的 Nextflow DSL 2但这并不是绝对必要的。我只是发现新的 DSL 2 代码更容易阅读和调试。下面只是一个示例,您需要对其进行调整以适合您的确切用例,但也就是说,它应该做非常相似的事情。它使用特殊的groupKey这样我们就可以在调用 groupTuple 之前动态指定每个元组中的预期元素数量。运算符(operator)。这使我们能够尽快传输收集到的值,以便每个样本在所有读取组对齐后都可以“合并”。如果没有这个,所有输入读取组都需要在合并开始之前完成对齐。
nextflow.config
的内容:
process {
shell = [ '/bin/bash', '-euo', 'pipefail' ]
}
main.nf
的内容:
nextflow.enable.dsl=2
params.ref_fasta = "GRCh38.primary_assembly.genome.chr22.fa.gz"
params.fastq_files = "data/*/*.read{1,2}.fastq.gz"
process bwa_index {
conda 'bwa-mem2'
input:
path fasta
output:
path "${fasta}.{0123,amb,ann,bwt.2bit.64,pac}"
"""
bwa-mem2 index "${fasta}"
"""
}
process bwa_mem2 {
tag { [sample, readgroup].join(':') }
conda 'bwa-mem2 samtools'
input:
tuple val(sample), val(readgroup), path(reads)
path bwa_index
output:
tuple val(sample), val(readgroup), path("${readgroup}.bam{,.bai}")
script:
def idxbase = bwa_index.first().baseName
def out_files = [ "${readgroup}.bam", "${readgroup}.bam.bai" ].join('##idx##')
def (r1, r2) = reads
"""
bwa-mem2 mem \\
-R '@RG\\tID:${readgroup}\\tSM:${sample}' \\
-t ${task.cpus} \\
"${idxbase}" \\
"${r1}" \\
"${r2}" |
samtools sort \\
--write-index \\
-@ ${task.cpus} \\
-o "${out_files}"
"""
}
process samtools_merge {
tag { sample }
conda 'samtools'
input:
tuple val(sample), path(indexed_bam_files)
output:
tuple val(sample), path("${sample}.bam{,.bai}")
script:
def out_files = [ "${sample}.bam", "${sample}.bam.bai" ].join('##idx##')
def input_bam_files = indexed_bam_files
.findAll { it.name.endsWith('.bam') }
.collect { /"${it}"/ }
.join(' \\\n'+' '*8)
"""
samtools merge \\
--write-index \\
-o "${out_files}" \\
${input_bam_files}
"""
}
workflow {
ref_fasta = file( params.ref_fasta )
bwa_index( ref_fasta )
Channel.fromFilePairs( params.fastq_files ) \
| map { readgroup, reads ->
def (sample_name) = reads*.parent.baseName as Set
tuple( sample_name, readgroup, reads )
} \
| groupTuple() \
| map { sample, readgroups, reads ->
tuple( groupKey(sample, readgroups.size()), readgroups, reads )
} \
| transpose() \
| set { sample_readgroups }
bwa_mem2( sample_readgroups, bwa_index.out )
sample_readgroups \
| join( bwa_mem2.out, by: [0,1] ) \
| map { sample_key, readgroup, reads, indexed_bam ->
tuple( sample_key, indexed_bam )
} \
| groupTuple() \
| map { sample_key, indexed_bam_files ->
tuple( sample_key.toString(), indexed_bam_files.flatten() )
} \
| samtools_merge
}
运行方式:
nextflow run -ansi-log false main.nf
结果:
N E X T F L O W ~ version 21.04.3
Launching `main.nf` [zen_gautier] - revision: dcde9efc8a
Creating Conda env: bwa-mem2 [cache /home/steve/working/stackoverflow/69702077/work/conda/env-8cc153b2eb20a5374bf435019a61c21a]
[63/73c96b] Submitted process > bwa_index
Creating Conda env: bwa-mem2 samtools [cache /home/steve/working/stackoverflow/69702077/work/conda/env-5c358e413a5318c53a45382790eecbd4]
[52/6a92d3] Submitted process > bwa_mem2 (HBR:HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22)
[8b/535b21] Submitted process > bwa_mem2 (UHR:UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[dc/03d949] Submitted process > bwa_mem2 (UHR:UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[e4/bfd08b] Submitted process > bwa_mem2 (HBR:HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22)
[d5/e2aa27] Submitted process > bwa_mem2 (UHR:UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22)
[c2/23ce8a] Submitted process > bwa_mem2 (HBR:HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22)
Creating Conda env: samtools [cache /home/steve/working/stackoverflow/69702077/work/conda/env-912cee20caec78e112a5718bb0c00e1c]
[28/006c03] Submitted process > samtools_merge (HBR)
[3b/51311c] Submitted process > samtools_merge (UHR)
关于bioinformatics - Nextflow:如何处理多个样本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69702077/