Running MPRAsnakeflow on HPC Cluster

Snakemake gives us the opportunity to run MPRAsnakeflow in a cluster environment. Please check the Snakemake documentation for more information on how to set up a cluster environment. We use snakemake resources to set the main resources per rule. Most resources are generic and can be used on multiple clusters, environments, or even locally. We have a predefined workflow profile with resources: config.yaml:

---
configfile: config/example_config.yaml
software-deployment-method: conda
default-resources:
  slurm_partition: debug
  mem: 2G
  runtime: 60
# error: "logs/%x_%j_%N.err"
# output: "logs/%x_%j_%N.log"
##################
### ASSIGNMENT ###
##################
set-threads:
  assignment_mapping_bwa: 30
  assignment_mapping_bbmap: 30
  assignment_collect: 30
  assignment_collectBCs: 20
  assignment_merge: 10
  assignment_hybridFWRead_get_reads_by_cutadapt: 10
  assignment_3prime_remove: 2
  assignment_5prime_remove: 2
set-resources:
  assigned_counts_combined_replicates_barcode_output:
    runtime: 60
  assignment_check_design:
    runtime: 240
    slurm_partition: medium
  assignment_hybridFWRead_get_reads_by_length:
    runtime: 1140
    mem: 2G

We used this workflow successfully in a SLURM environment using the slurm executor plugin from Snakemake. Therefore, the partition is set with slurm_partition and has to be renamed or removed to fit with your own SLURM configuration.

Running with resources

Example: Using 30 cores and 10GB of memory.

snakemake --sdm conda --configfile config/config.yaml -c 30 --resources mem_mb=10000 --workflow-profile profiles/default

Performance tweaks: Running specific rules with different resources

Some rules will benefit from multithreading or more memory. This can be specified within your profile, workflow profile, or in the command line interface using --set-resources RULE_NAME:RESOURCE_NAME=VALUE or --set-threads RULE_NAME=VALUE. Before changing resources, make sure that you really need the rule by running a dry run to get the list of executed rules only:

snakemake -n --quiet rules

Possible rules to tweak:

Assignment:
assignment_hybridFWRead_get_reads_by_cutadapt:

Only needed when using the linker option in the config. You can add more threads using --set-threads assignment_hybridFWRead_get_reads_by_cutadapt=4. Default is always 1 thread.

assignment_mapping_bbmap:

Only needed when using bbmap for mapping. Memory and threads can be optimized, e.g., via --set-threads assignment_mapping_bbmap=30 --set-resources assignment_mapping_bbmap:mem_mb=10000. Default is 1 thread and 4GB memory, but we recommend using 30 threads and 10GB if available.

assignment_mapping_bwa:

Only needed when using bwa for mapping. Memory and threads can be optimized, e.g., via --set-threads assignment_mapping_bwa=30 --set-resources assignment_mapping_bwa:mem_mb=10000. Default is 1 thread, but we recommend using 30 threads and 10GB if available.

assignment_collectBCs:

Threads can be optimized, e.g., via --set-threads assignment_collectBCs=30. Default is 1 thread, but we recommend using 30 threads if available.

Experiment:
counts_onlyFW_raw_counts_by_cutadapt:

Only needed when you have only FW reads and use the adapter option. Threads can be optimized, e.g., via --set-threads experiment_counts_onlyFW_raw_counts_by_cutadapt=30. Default is 1 thread.

Running on an HPC using SLURM

Using the SLURM executor plugin to run 300 jobs in parallel:

snakemake --sdm conda --configfile config/config.yaml -j 300 --workflow-profile profiles/default --executor slurm

Snakemake 7 (not supported anymore)

In Snakemake 7, we used the --cluster option, which is not available in Snakemake 8. You can also use the predefined config/sbatch.yaml, but this might be outdated. We highly recommend using resources with the workflow profile.

snakemake --use-conda --configfile config/config.yaml --cluster "sbatch --nodes=1 --ntasks={cluster.threads} --mem={cluster.mem} -t {cluster.time} -p {cluster.queue} -o {cluster.output}" --jobs 100 --cluster-config config/sbatch.yaml

Please note that with this --cluster option, the log folder of the cluster environment (see -o {cluster.output}) has to be generated first, e.g.:

mkdir -p logs

Note

Please consult your cluster’s wiki page for cluster-specific commands and change cluster options to reflect these specifications. Additionally, for large libraries, more memory can be specified in this location.