python - 手动创建的snakemake通配符未使用/识别

标签 python wildcard snakemake wildcard-expansion

我一直在尝试通过导入制表符分隔的文件来手动创建 Snakemake 通配符,如下所示:

dataset sample species frr

PRJNA493818_GSE120639_SRP162872 SRR7942395_GSM3406786_sAML_Control_1 Homo_sapiens 1 PRJNA493818_GSE120639_SRP162872 SRR7942395_GSM3406786_sAML_Control_1 Homo_sapiens 2 PRJNA362883_GSE93946_SRP097621 SRR5195524_GSM2465521_KrasT_45649_NoDox Mus_musculus 1 PRJNA362883_GSE93946_SRP097621 SRR5195524_GSM2465521_KrasT_45649_NoDox Mus_musculus 2

这就是我的 Snakemake 文件的样子(最小示例):

import pandas as pd
import os

# --- Importing Configuration Files --- #
configfile: "/DATA/config/config.yaml"

table_cols = ['dataset','sample','species','frr']
table_samples = pd.read_table('/DATA/config/samples.tsv', header=0, sep='\t', names=table_cols)
DATASET = table_samples.dataset.values.tolist()
SAMPLE = table_samples['sample'].values.tolist()
SPECIES = table_samples.species.values.tolist()
FRR = table_samples.frr.values.tolist()
print(DATASET,SAMPLE,SPECIES,FRR)

rule all:
        input:
                expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, frr=FRR)

## fastq files quality control
rule rawFastqc:
        input:
                rawread=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_{frr}.fastq.gz"
        output:
                zip=config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.zip",
                html=config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html"
        threads:
                12
        params:
                path=config["project_path"]+"results/{dataset}/rawQC/"
        conda:
                "envs/bulkRNAseq.yaml"
        shell:
                """
                fastqc {input.rawread} --threads {threads} -o {params.path}
                """

当我运行时:

snakemake -s test --use-conda -n -p

这是输出:

['PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872', 'PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621'] ['SRR7942395_GSM3406786_sAML_Control_1', 'SRR7942395_GSM3406786_sAML_Control_1', 'SRR5195524_GSM2465521_KrasT_45649_NoDox', 'SRR5195524_GSM2465521_KrasT_45649_NoDox'] ['Homo_sapiens', 'Homo_sapiens', 'Mus_musculus', 'Mus_musculus'] [1, 2, 1, 2]
Building DAG of jobs...
Job counts:
    count   jobs
    1   all
    4   rawFastqc
    5

[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
    input: /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz
    output: /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1_fastqc.zip, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1_fastqc.html
    jobid: 3
    wildcards: dataset=PRJNA362883_GSE93946_SRP097621, sample=SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus, species=musculus, frr=1
    threads: 12


        fastqc /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz --threads 12 -o /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/
        

[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
    input: /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1.fastq.gz
    output: /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1_fastqc.zip, /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1_fastqc.html
    jobid: 1
    wildcards: dataset=PRJNA493818_GSE120639_SRP162872, sample=SRR7942395_GSM3406786_sAML_Control_1_Homo, species=sapiens, frr=1
    threads: 12


        fastqc /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1.fastq.gz --threads 12 -o /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/
        

[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
    input: /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz
    output: /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2_fastqc.zip, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2_fastqc.html
    jobid: 4
    wildcards: dataset=PRJNA362883_GSE93946_SRP097621, sample=SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus, species=musculus, frr=2
    threads: 12


        fastqc /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz --threads 12 -o /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/
        

[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
    input: /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2.fastq.gz
    output: /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2_fastqc.zip, /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2_fastqc.html
    jobid: 2
    wildcards: dataset=PRJNA493818_GSE120639_SRP162872, sample=SRR7942395_GSM3406786_sAML_Control_1_Homo, species=sapiens, frr=2
    threads: 12


        fastqc /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2.fastq.gz --threads 12 -o /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/
        

[Thu Aug 11 00:57:30 2022]
localrule all:
    input: /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1_fastqc.html, /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2_fastqc.html, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1_fastqc.html, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2_fastqc.html
    jobid: 0

Job counts:
    count   jobs
    1   all
    4   rawFastqc
    5
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

很明显,print(DATASET,SAMPLE,SPECIES,FRR) 生成了我想要的通配符值:

['PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872', 'PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621'] ['SRR7942395_GSM3406786_sAML_Control_1', 'SRR7942395_GSM3406786_sAML_Control_1', 'SRR5195524_GSM2465521_KrasT_45649_NoDox', 'SRR5195524_GSM2465521_KrasT_45649_NoDox'] ['Homo_sapiens', 'Homo_sapiens', 'Mus_musculus', 'Mus_musculus'] [1, 2, 1, 2]

然而,尽管我没有使用 glob_wildcards,但随后 Snakemake 并没有考虑这些并产生错误的通配符值。

我显然错过了一些东西,但我不知道我做错了什么。我还研究了以下帖子:Manually create snakemake wildcards

非常感谢您的帮助!

最佳答案

这是试图解释为什么snakemake 生成与用户值不匹配的通配符值。考虑这个小例子。

我们要创建文件a_b.A_B.txt这个 Snakefile 将完成这项工作:

FOO = ['a_b']
BAR = ['A_B']

rule all:
    input:
        expand('{foo}.{bar}.txt', zip, foo=FOO, bar=BAR),

rule one:
    output:
        '{x}_{y}.txt',
    shell:
        r"""
        touch {output}
        """

尝试使用 snakemake -p -j 1 -n :

Building DAG of jobs...
...
[Thu Aug 11 09:43:30 2022]
rule one:
    output: a_b.A_B.txt
    jobid: 1
    reason: Missing output files: a_b.A_B.txt
    wildcards: x=a_b.A, y=B
    resources: tmpdir=/tmp


        touch a_b.A_B.txt
...

它有效,但发生了一些违反直觉的事情:

  • 规则all中的通配符名称,{foo}{bar} ,不匹配规则one中的那些,{x}{y} 。 Snakemake 并不关心它,它只是看到可以取任何值的通配符。

  • 规则all要求文件 {wildcard1}.{wildcard2} (注意点 . )但规则 one 输出 {wildcard1}_{wildcard2} (注意下划线 _ )。似乎存在不匹配,脚本应该失败。相反,它之所以有效,是因为...

  • Snakemake 可以自由地查找将输入与输出相匹配的正则表达式值。在本例中,分配通配符:x=a_b.Ay=B将匹配请求的 a_b.A_B.txt 。 (在我看来这是违反直觉的)。

使用约束重写此蛇文件会导致不太令人惊讶的行为:

FOO = ['a_b']
BAR = ['A_B']

wildcard_constraints:
    x='|'.join([re.escape(x) for x in FOO]),
    y='|'.join([re.escape(x) for x in BAR]),

rule all:
    input:
        expand('{foo}.{bar}.txt', zip, foo=FOO, bar=BAR),

rule one:
    output:
        '{x}_{y}.txt',
    shell:
        r"""
        touch {output}
        """

失败:

snakemake -p -j 1 -n

Building DAG of jobs...
MissingInputException in line 8 of /home/dario/Downloads/Snakefile:
Missing input files for rule all:
    affected files:
        a_b.A_B.txt

此版本可以正常工作并使用预期的通配符值:

FOO = ['a_b']
BAR = ['A_B']

wildcard_constraints:
    x='|'.join([re.escape(x) for x in FOO]),
    y='|'.join([re.escape(x) for x in BAR]),

rule all:
    input:
        expand('{foo}.{bar}.txt', zip, foo=FOO, bar=BAR),

rule one:
    output:
        '{x}.{y}.txt', # Use dot, not underscore
    shell:
        r"""
        touch {output}
        """
snakemake -p -j 1 -n
...
rule one:
    output: a_b.A_B.txt
    jobid: 1
    reason: Missing output files: a_b.A_B.txt
    wildcards: x=a_b, y=A_B
    resources: tmpdir=/tmp


        touch a_b.A_B.txt
...

请注意,我们得到了预期的通配符值 x=a_by=A_B 。我还会使用一致的通配符命名(foo、bar 或 x、y),除非有充分的理由不这样做。

命令x = '|'.join([re.escape(x) for x in FOO])告诉snakemake通配符 x只能采用与正则表达式 '|'.join([re.escape(x) for x in FOO]) 匹配的值( | 是管道符号,而不是 I )。所以这个值a_b.A将不匹配并且脚本将失败。 escape是为了确保像 . 这样的特殊字符和*不被解释为正则表达式。


这是我的理解 - 我认为最好有一个专门的文档...

关于python - 手动创建的snakemake通配符未使用/识别,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73313592/

相关文章:

python - snakemake 临时目录

python - Python Struct.Unpack 中的对齐/打包

java - 带菱形运算符的通配符

snakemake --use-conda 与一个已经创建的环境

python - 通过文件名通配符打开文件

java - 如何将通用子类型列表转换为特定子类型列表?

snakemake - 在snakemake中使用awk时出现NameError

python - 使用空字段和空白字段创建 Django 表单

python - Scrapy - 不会爬行

python - RestructuredText 中的文字 "*"