Running MPRAsnakeflow on HPC Cluster

Snakemake gives us the opportunity to run MPRAsnakeflow in a cluster environment. Please check the Snakemake documentation for more information on how to set up a cluster environment. We use snakemake resources to set the main resources per rule. Most resources are generic and can be used on multipe clusters, environments or even local. We have a preddefined workflow profile with resources: config.yaml:

---
configfile: config/example_config.yaml
software-deployment-method: conda
default-resources:
  slurm_partition: debug
  mem: 2G
  runtime: 60
# error: "logs/%x_%j_%N.err"
# output: "logs/%x_%j_%N.log"
##################
### ASSIGNMENT ###
##################
set-threads:
  assignment_mapping_bwa: 30
  assignment_mapping_bbmap: 30
  assignment_collect: 30
  assignment_collectBCs: 20
  assignment_merge: 10
set-resources:
  assigned_counts_combined_replicates_barcode_output:
    runtime: 60
  assignment_check_design:
    runtime: 240
    slurm_partition: medium
  assignment_hybridFWRead_get_reads_by_length:
    runtime: 1140
    mem: 2G
    slurm_partition: medium
  assignment_hybridFWRead_get_reads_by_cutadapt:
    runtime: 1200

We used this workflow successfully in a SLURM environment using the slurm excecutor plugin from snakemake. Therfore the partition is set with slurm_partition and has to be renamed or removed to fith with your own SLURM configuration.

Running with resources

Having 30 cores and 10GB of memory.

snakemake --sdm conda --configfile config/config.yaml -c 30 --resources mem_mb=10000  --workflow-profile profiles/default

Performance tweaks: Running specific rules with different resources

Some of the rule swill benefit from multithreading or more memory. This can be specified within your profile, worflow profile or in the command line interface using --set-resources RULE_NAME:RESOURCE_NAME=VALUE or ---set-threads RULE_NAME=VALUE. Before changing resources make sure that you really need the rule by running a dry run getting the list of executed rules only:snakemamake -n --quiet rules.

Possible rules to tweaks:

Assignment:
assignment_hybridFWRead_get_reads_by_cutadapt:

Only needed when using linker option in config. You can add more threads using --set-threads assignment_hybridFWRead_get_reads_by_cutadapt=4. Default is always 1 thread.

assignment_mapping_bbmap:

Only needed when using bbmap for mapping. Memory and threads can be optimized e.g. via --set-threads assignment_mapping_bbmap=30 --set-resources assignment_mapping_bbmap:mem_mb=10000. Default is 1 thread and 4GB memory but we recommend to use 30 threads and 10GB if available.

assignment_mapping_bwa:

Only needed when using bwa for mapping. Memory and threads can be optimized e.g. via --set-threads assignment_mapping_bwa=30 --set-resources assignment_mapping_bwa:mem_mb:10000. Default is 1 thread but we recommend to use 30 threads and 10GB if available.

assignment_collectBCs:

Threads can be optimized e.g. via --set-threads assignment_collectBCs=30. Default is 1 thread but we recommend to use 30 threads if available.

Experiment:
counts_onlyFW_raw_counts_by_cutadapt:

Only needed when you have only FW reads and use the adapter option. Threads can be optimized e.g. via --set-threads counts_onlyFW_raw_counts_by_cutadapt=30. Default is 1 thread.

Running on an HPC using SLURM

Using the slurm excecutor plugin running 300 jobs in parallel.

snakemake --sdm conda --configfile config/config.yaml -j 300  --workflow-profile profiles/default --executor slurm

Snakemake 7 (not supported anymore)

Here we used the --cluster option which is not available in snakemake 8. You can also use the predefined config/sbatch.yaml but this might be outdated and we highly recommend to use resources with the workfloe profile.

snakemake --use-conda --configfile config/config.yaml --cluster "sbatch --nodes=1 --ntasks={cluster.threads} --mem={cluster.mem} -t {cluster.time} -p {cluster.queue} -o {cluster.output}" --jobs 100 --cluster-config config/sbatch.yaml

Please note that with this --cluster option the log folder of the cluster environment (see :code:` -o {cluster.output}`) has to be generated first, e.g:

mkdir -p logs

Note

Please consult your cluster’s wiki page for cluster specific commands and change cluster Options to reflect these specifications. Additionally, for large libraries, more memory can be specified in this location.