Getting started
We highly recommend as first start the MPRAsnakeflow Tutorial or the Basic assignment workflow and Basic Experiment workflow examples. Here we provide a quick overview what you need to start the workflow.
MPRAsnakeflow exoists of two subworkflows, Assignment and Experiment (Count). This quickstart shows the configuration for both and you have to leave out the respective part for if you only want to run one of them.
Experiment workflow only: Create an
experiment.csvin the format below, including the header. DNA_BC_F or RNA_BC_F is the name of the gzipped fastq of the forward read of the DNA or RNA from the defined condition and replicate. DNA_UMI or RNA_UMI is the corresponding index read with UMIs (excluding sample barcodes), and DNA_BC_R or RNA_BC_R of the reverse read.Multiple fastq files can be used for each column by separating them with
;.Right now a UMI has to be used. If you want to use MPRAsnakeflow without a UMI please switch to MPRAflow or contact us.
Here is an example of an
experiment.csvfile and it can be downloadedexperiment.csv:
experiment.csv Condition
Replicate
DNA_BC_F
DNA_UMI
DNA_BC_R
RNA_BC_F
RNA_UMI
RNA_BC_R
HEPG2
1
SRR10800881_1.fastq.gz
SRR10800881_2.fastq.gz
SRR10800881_3.fastq.gz
SRR10800882_1.fastq.gz
SRR10800882_2.fastq.gz
SRR10800882_3.fastq.gz
HEPG2
2
SRR10800883_1.fastq.gz
SRR10800883_2.fastq.gz
SRR10800883_3.fastq.gz
SRR10800884_1.fastq.gz
SRR10800884_2.fastq.gz
SRR10800884_3.fastq.gz
HEPG2
3
SRR10800885_1.fastq.gz
SRR10800885_2.fastq.gz
SRR10800885_3.fastq.gz
SRR10800886_1.fastq.gz
SRR10800886_2.fastq.gz
SRR10800886_3.fastq.gz
Experiment workflow only: If you would like each designed sequence to be coloured based on different user-specified categories, such as positive control, negative control, shuffled control, and putative enhancer. To assess the overall quality, you can create a
label.tsvin the format below that maps the name to the category as shown here:
oligo_name_1 label1 oligo_name_2 label1 oligo_name_3 label2The oligo_name_X must exactly match the header in the design FASTA file.
Set up the config file
The config file is the heart of MPRAsnakflow. Here different runs can be configured. We recommend using one config file per MPRA experiment or MPRA project. But in theory, many different experiments can be configured in only one file. It is divided into version (used MPRAsnakeflow version), assignments (assigment workflow), and experiments (count workflow).
See Config File for more details about the config file. Here is an example running only the count experiments and using a provided assignment file.
---
version: "0.3"
assignments:
exampleAssignment: # name of an example assignment (can be any string)
bc_length: 15
alignment_tool:
split_number: 1 # number of files fastq should be split for parallelization
tool: exact # bbmap, bwa or exact
configs:
sequence_length: 171 # sequence length of design excluding adapters.
alignment_start: 1 # start of the alignment in the reference/design_file
FW:
- resources/assoc_basic/data/SRR10800986_1.fastq.gz
BC:
- resources/assoc_basic/data/SRR10800986_2.fastq.gz
REV:
- resources/assoc_basic/data/SRR10800986_3.fastq.gz
design_file: resources/assoc_basic/design.fa
configs:
default: {} # name of an example filtering config
experiments:
exampleCount:
bc_length: 15
umi_length: 10
data_folder: resources/count_basic/data
experiment_file: resources/count_basic/experiment.csv
demultiplex: false
assignments:
fromFile:
type: file
assignment_file: resources/count_basic/SRR10800986_barcodes_to_coords.tsv.gz
# label_file: resources/labels.tsv # optional
configs:
default: {}
Run MPRAsnakeflow
conda activate snakemake snakemake --software-deployment-method conda --configfile config/example_config.yaml -p --cores 4Note
This will run in local mode using 4 cores. Please submit this command to your cluster’s queue if you would like to run a highly parallelized version.
Be sure that the files,
experiment.csvand theexample_config.yamlare correct. All fastq files for the count/experiment part must be in the same folder given by thedata_folderoption. Please specify your barcode length and umi-length (if available) withbc_lengthandumi_length.The assignment files generated by the workflow, are named:
assignment_barcodes.<config>.tsv.gzand can be found in theresults/assignment/<assignment>/folder.The count files generated by the experiment workflow, are named:
<condition>_<replicate>_merged_assigned_counts.tsv.gzand can be found in theresults/experiments/<project>/assigned_counts/<assignment>/<config>/folder.