Getting started

Create an experiment.csv in the format below, including the header. DNA_BC_F or RNA_BC_F is name of the gzipped fastq of the forward read of the DNA or RNA from the defined condition and replicate. DNA_UMI or RNA_UMI is the corresponding index read with UMIs (excluding sample barcodes), and DNA_BC_R or RNA_BC_R of the reverse read.

Multiple fastq files can be used for each column by seperating them with ;.

Right now an UMI have to be used. If you want to use MPRAsnakeflow without an UMI please sitch to MPRAflow or contact us.

Here is an example of an experiment.csv file and it can be downloaded experiment.csv:

experiment.csv

Condition

Replicate

DNA_BC_F

DNA_UMI

DNA_BC_R

RNA_BC_F

RNA_UMI

RNA_BC_R

HEPG2

1

SRR10800881_1.fastq.gz

SRR10800881_2.fastq.gz

SRR10800881_3.fastq.gz

SRR10800882_1.fastq.gz

SRR10800882_2.fastq.gz

SRR10800882_3.fastq.gz

HEPG2

2

SRR10800883_1.fastq.gz

SRR10800883_2.fastq.gz

SRR10800883_3.fastq.gz

SRR10800884_1.fastq.gz

SRR10800884_2.fastq.gz

SRR10800884_3.fastq.gz

HEPG2

3

SRR10800885_1.fastq.gz

SRR10800885_2.fastq.gz

SRR10800885_3.fastq.gz

SRR10800886_1.fastq.gz

SRR10800886_2.fastq.gz

SRR10800886_3.fastq.gz

If you would like each designed sequence to be colored based on different user-specified categories, such as positive control, negative control, shuffled control, and putative enhancer. To assess the overall quality, you can create a label.tsv in the format below that maps the name to category as shown here:

oligo_name_1 label1
oligo_name_2 label1
oligo_name_3 label2
The oligo_name_X must exactly match the header in the design FASTA file.

Set up the config file

The config file is the heart of MPRAsnakflow. Here different runs can be configured. We recommend to use one config file per MPRA experiment or MPRA roject. But in theory many different experiments can be configured in only one file. It is divided into global (generell settings), assignments (assigment workflow), and experiments (count workflow including variants).

See Config File for more details about the config file. Here is an example running only the count experiments and using a provided assignment file.

---
global: # generall configs effecting one or multiple parts
  assignments:
    split_number: 1 # number of files fastq should be split for parallelization
assignments:
  exampleAssignment: # name of an example assignment (can be any string)
    bc_length: 15
    alignment_tool:
      tool: exact # bbbmap, bwa or exact
      configs:
        sequence_length: 170 # sequence length of design excluding adapters.
        alignment_start: 1 # start of the alignment in the reference/design_file
    FW:
      - resources/assoc_basic/data/SRR10800986_1.fastq.gz
    BC:
      - resources/assoc_basic/data/SRR10800986_2.fastq.gz
    REV:
      - resources/assoc_basic/data/SRR10800986_3.fastq.gz
    design_file: resources/assoc_basic/design.fa
    configs:
      default: {} # name of an example filtering config
experiments:
  exampleCount:
    bc_length: 15
    umi_length: 10
    data_folder: resources/count_basic/data
    experiment_file: resources/count_basic/experiment.csv
    demultiplex: false
    assignments:
      fromFile:
        type: file
        assignment_file: resources/count_basic/SRR10800986_barcodes_to_coords.tsv.gz
    # label_file: resources/labels.tsv # optional
    configs:
      default: {}

Run MPRAsnakeflow

conda activate snakemake
snakemake --software-deployment-method conda --configfile config/example_config.yaml -p --cores 4
Note

This will run in local mode using 4 cores. Please submit this command to your cluster’s queue if you would like to run a highly parallelized version.

Be sure that the files, experiment.csv and the example_config.yaml are correct. All fastq files for the count/experiment part must be in the same folder given by the data_folder option. Please specify your barcode length and umi-length (if available) with bc_length and umi_length.

The assignment files generated by the workflow, are named: assignment_barcodes.<config>.tsv.gz and can be found in the results/assignment/<assignment>/ folder.

The count files generated by the experiment workflow, are named: <condition>_<replicate>_merged_assigned_counts.tsv.gz and can be found in the results/experiments/<project>/assigned_counts/<assignment>/<config>/ folder.