Getting Started

We highly recommend starting with the MPRAsnakeflow Tutorial or the Basic assignment workflow and Basic Experiment Workflow examples. Below, we provide a quick overview of what you need to start the workflow.

MPRAsnakeflow consists of two subworkflows: Assignment and Experiment (Count). This quickstart shows the configuration for both. If you only want to run one of them, leave out the respective part.

  1. Experiment Workflow Only: Create an experiment.csv file in the format below, including the header. - DNA_BC_F or RNA_BC_F: The name of the gzipped FASTQ file of the forward read of the DNA or RNA from the defined condition and replicate. - DNA_UMI or RNA_UMI: The corresponding index read with UMIs (excluding sample barcodes). - DNA_BC_R or RNA_BC_R: The reverse read.

    Multiple FASTQ files can be used for each column by separating them with ;.

    Note: Currently, a UMI is required. If you want to use MPRAsnakeflow without a UMI, please switch to MPRAflow or contact us.

    Here is an example of an experiment.csv file, which can be downloaded here: experiment.csv.

    experiment.csv

    Condition

    Replicate

    DNA_BC_F

    DNA_UMI

    DNA_BC_R

    RNA_BC_F

    RNA_UMI

    RNA_BC_R

    HEPG2

    1

    SRR10800881_1.fastq.gz

    SRR10800881_2.fastq.gz

    SRR10800881_3.fastq.gz

    SRR10800882_1.fastq.gz

    SRR10800882_2.fastq.gz

    SRR10800882_3.fastq.gz

    HEPG2

    2

    SRR10800883_1.fastq.gz

    SRR10800883_2.fastq.gz

    SRR10800883_3.fastq.gz

    SRR10800884_1.fastq.gz

    SRR10800884_2.fastq.gz

    SRR10800884_3.fastq.gz

    HEPG2

    3

    SRR10800885_1.fastq.gz

    SRR10800885_2.fastq.gz

    SRR10800885_3.fastq.gz

    SRR10800886_1.fastq.gz

    SRR10800886_2.fastq.gz

    SRR10800886_3.fastq.gz

  2. Experiment Workflow Only: If you would like each designed sequence to be colored based on different user-specified categories (e.g., positive control, negative control, shuffled control, putative enhancer), you can create a label.tsv file in the format below. This file maps the name to the category to assess the overall quality:

    oligo_name_1 label1
    oligo_name_2 label1
    oligo_name_3 label2
    

    The oligo_name_X must exactly match the header in the design FASTA file.

  3. Set Up the Config File:

    The config file is the heart of MPRAsnakeflow. Here, different runs can be configured. We recommend using one config file per MPRA experiment or MPRA project. However, in theory, many different experiments can be configured in a single file. The config file is divided into: - version: Specifies the MPRAsnakeflow version used. - assignments: Configures the assignment workflow. - experiments: Configures the count workflow.

    See Config File for more details about the config file. Below is an example of running only the count experiments and using a provided assignment file:

    ---
    version: "0.5"
    assignments:
      exampleAssignment: # name of an example assignment (can be any string)
        bc_length: 15
        alignment_tool:
          split_number: 1 # number of files fastq should be split for parallelization
          tool: exact # bbmap, bwa or exact
          configs:
            sequence_length: 171 # sequence length of design excluding adapters.
            alignment_start: 1 # start of the alignment in the reference/design_file
        FW:
          - resources/assoc_basic/data/SRR10800986_1.fastq.gz
        BC:
          - resources/assoc_basic/data/SRR10800986_2.fastq.gz
        REV:
          - resources/assoc_basic/data/SRR10800986_3.fastq.gz
        design_file: resources/assoc_basic/design.fa
        configs:
          default: {} # name of an example filtering config
    experiments:
      exampleCount:
        bc_length: 15
        umi_length: 10
        data_folder: resources/count_basic/data
        experiment_file: resources/count_basic/experiment.csv
        demultiplex: false
        assignments:
          fromFile:
            type: file
            assignment_file: resources/count_basic/SRR10800986_barcodes_to_coords.tsv.gz
        # label_file: resources/labels.tsv # optional
        configs:
          default: {}
    
  4. Run MPRAsnakeflow:

    Use the following command to run the workflow:

    conda activate snakemake
    snakemake --software-deployment-method conda --configfile config/example_config.yaml -p --cores 4
    

    Note

    This will run in local mode using 4 cores. Please submit this command to your cluster’s queue if you would like to run a highly parallelized version.

    Ensure that the files experiment.csv and example_config.yaml are correct. All FASTQ files for the count/experiment part must be in the same folder specified by the data_folder option. Please specify your barcode length and UMI length (if available) with bc_length and umi_length.

    • The assignment files generated by the workflow are named assignment_barcodes.<config>.tsv.gz and can be found in the results/assignment/<assignment>/ folder.

    • The count files generated by the experiment workflow are named <condition>_<replicate>_merged_assigned_counts.tsv.gz and can be found in the results/experiments/<project>/assigned_counts/<assignment>/<config>/ folder.