Basic Experiment Workflow

This example runs the count workflow on 5’/5’ WT MPRA data in the HepG2 cell line from Klein J., Agarwal, V., Keith, A., et al. 2019.

Prerequisites

This example depends on the following data and software:

Installing MPRAsnakeflow

Please install conda, the MPRAsnakeflow environment, and clone the actual MPRAsnakeflow master branch. You will find more help under Installation.

Producing an Association (.tsv.gz) File

This workflow requires a Python dictionary of candidate regulatory sequences (CRS) mapped to their barcodes in a tab-separated (.tsv.gz) format. For this example, the file can be generated using Basic assignment workflow or found in the resources/count_basic folder in MPRAsnakeflow (file: SRR10800986_barcodes_to_coords.tsv.gz).

Alternatively, if the association file is in pickle (.pickle) format because you used MPRAflow, you can convert it to .tsv.gz format using the in-built function in MPRAsnakeflow with the following code:

conda activate mprasnakeflow
python assignment_pickle_to_tsv.py --input <assignment_file>.pickle | sort | uniq | gzip -c > <assignment_file>.tsv.gz

Design (.fa) File

The design file can be generated using the Basic assignment workflow or downloaded from the resources/count_basic folder in MPRAsnakeflow.

Reads

There is one condition (HepG2) with three technical replicates. Each replicate contains a forward (barcode-forward), reverse (barcode-reverse), and index (unique molecular identifier) read for DNA and RNA. These data must be downloaded. All data is publicly available on the Short Read Archive (SRA). We will use SRA-toolkit to obtain the data.

Note

You need 9 GB of disk space to download the data and upwards of 50 GB to process it!

conda install sra-tools
mkdir -p count_basic/data
cd count_basic/data
fastq-dump --gzip --split-files SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886
cd ..

For large files and unstable internet connections, we recommend the command prefetch from SRA tools before running fastq-dump. This command provides better warnings when something goes wrong.

conda install sra-tools
cd count_basic/data
prefetch SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886
fastq-dump --gzip --split-files SRR10800986
cd ..

Note

Please ensure that all files are downloaded completely without errors! Depending on your internet connection, this can take a while. If you just want some data to run MPRAsnakeflow, you can limit yourself to one condition and/or just one replicate.

The data folder view can be seen with the following command:

tree data

The folder should look like this:

data
├── design.fa
├── SRR10800881_1.fastq.gz
├── SRR10800881_2.fastq.gz
├── SRR10800881_3.fastq.gz
├── SRR10800882_1.fastq.gz
├── SRR10800882_2.fastq.gz
├── SRR10800882_3.fastq.gz
├── SRR10800883_1.fastq.gz
├── SRR10800883_2.fastq.gz
├── SRR10800883_3.fastq.gz
├── SRR10800884_1.fastq.gz
├── SRR10800884_2.fastq.gz
├── SRR10800884_3.fastq.gz
├── SRR10800885_1.fastq.gz
├── SRR10800885_2.fastq.gz
├── SRR10800885_3.fastq.gz
├── SRR10800886_1.fastq.gz
├── SRR10800886_2.fastq.gz
├── SRR10800886_3.fastq.gz
└── SRR10800986_barcodes_to_coords.tsv.gz

Here is an overview of the files:

HepG2 Data
Condition	GEO Accession	SRA Accession	SRA Runs
HepG2-DNA-1: HepG2 DNA replicate 1	GSM4237863	SRX7474781	SRR10800881
HepG2-RNA-1: HepG2 RNA replicate 1	GSM4237864	SRX7474782	SRR10800882
HepG2-DNA-2: HepG2 DNA replicate 2	GSM4237865	SRX7474783	SRR10800883
HepG2-RNA-2: HepG2 RNA replicate 2	GSM4237866	SRX7474784	SRR10800884
HepG2-DNA-3: HepG2 DNA replicate 3	GSM4237867	SRX7474785	SRR10800885
HepG2-RNA-3: HepG2 RNA replicate 3	GSM4237868	SRX7474786	SRR10800886

Run MPRAsnakeflow

Now we are ready to start MPRAsnakeflow and count the number of barcodes. First, we need to generate an experiment file (experiment.csv) to specify the conditions, replicates, and corresponding reads.

Creating experiment.csv

Our experiment file looks like this:

Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
HepG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
HepG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
HepG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz

Save it into the count_basic/data folder as experiment.csv.

MPRAsnakeflow

Now we have everything ready to run the count workflow of MPRAsnakeflow. Run the pipeline directly in the count_basic folder. The MPRAsnakeflow workflow can be in a different directory. Let’s assume /home/user/MPRAsnakeflow.

First, configure the config file and save it to the count_basic folder. The config file is a simple text file with the following content:

---
version: "0.5"
experiments:
  exampleCount:
    bc_length: 15
    umi_length: 10
    data_folder: data
    experiment_file: experiment.csv
    demultiplex: false
    assignments:
      fromFile:
        type: file
        assignment_file: SRR10800986_barcodes_to_coords.tsv.gz
    design_file: design.fa
    label_file: labels.tsv
    configs:
      default: {}
      outlierZscore:
        filter:
          outlier_detection:
            method: rna_counts_zscore

Perform a dry-run using Snakemake’s -n option:

cd count_basic
conda activate mprasnakeflow
snakemake -c 1 --use-conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile /home/user/MPRAsnakeflow/resources/count_basic/config.yml -n -q

If the dry-run does not give any errors, run the workflow using 30 threads:

snakemake -c 30 --use-conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile /home/user/MPRAsnakeflow/resources/count_basic/config.yml

Note

Please modify your code when running in a cluster environment. We have an example SLURM config file here: config/sbatch.yml.

Results

All output files will be in the results/experiments/countBasic folder.

To generate a final report, use the following command:

snakemake --config config.yml --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --report report.html

This HTML report contains information about the Snakemake run and integrates statistics tables and plots.

Total file tree of the results folder:

results
└── experiments
    └── countBasic
        ├── assigned_counts
        │   └── fromWorkflow
        │       ├── default
        │       │   ├── HEPG2_1_merged_assigned_counts.tsv.gz
        │       │   ├── HEPG2_2_merged_assigned_counts.tsv.gz
        │       │   ├── HEPG2_3_merged_assigned_counts.tsv.gz
        │       │   ├── HEPG2_allreps_merged.tsv.gz
        │       │   └── HEPG2_allreps_minThreshold_merged.tsv.gz
        │       ├── HEPG2_1_DNA_final_counts.config.default.tsv.gz
        │       ├── HEPG2_1.merged.config.default.tsv.gz
        │       ├── HEPG2_1_RNA_final_counts.config.default.tsv.gz
        │       ├── HEPG2_2_DNA_final_counts.config.default.tsv.gz
        │       ├── HEPG2_2.merged.config.default.tsv.gz
        │       ├── HEPG2_2_RNA_final_counts.config.default.tsv.gz
        │       ├── HEPG2_3_DNA_final_counts.config.default.tsv.gz
        │       ├── HEPG2_3.merged.config.default.tsv.gz
        │       └── HEPG2_3_RNA_final_counts.config.default.tsv.gz
        ├── assignment
        │   └── fromWorkflow.tsv.gz
        ├── counts
        │   ├── HEPG2_1_DNA.bam
        │   ├── HEPG2_1_DNA_filtered_counts.tsv.gz
        │   ├── HEPG2_1_DNA_final_counts.tsv.gz
        │   ├── HEPG2_1_DNA_raw_counts.tsv.gz
        │   ├── HEPG2_1.merged.config.default.tsv.gz
        │   ├── HEPG2_1_RNA.bam
        │   ├── HEPG2_1_RNA_filtered_counts.tsv.gz
        │   ├── HEPG2_1_RNA_final_counts.tsv.gz
        │   ├── HEPG2_1_RNA_raw_counts.tsv.gz
        │   ├── HEPG2_2_DNA.bam
        │   ├── HEPG2_2_DNA_filtered_counts.tsv.gz
        │   ├── HEPG2_2_DNA_final_counts.tsv.gz
        │   ├── HEPG2_2_DNA_raw_counts.tsv.gz
        │   ├── HEPG2_2.merged.config.default.tsv.gz
        │   ├── HEPG2_2_RNA.bam
        │   ├── HEPG2_2_RNA_filtered_counts.tsv.gz
        │   ├── HEPG2_2_RNA_final_counts.tsv.gz
        │   ├── HEPG2_2_RNA_raw_counts.tsv.gz
        │   ├── HEPG2_3_DNA.bam
        │   ├── HEPG2_3_DNA_filtered_counts.tsv.gz
        │   ├── HEPG2_3_DNA_final_counts.tsv.gz
        │   ├── HEPG2_3_DNA_raw_counts.tsv.gz
        │   ├── HEPG2_3.merged.config.default.tsv.gz
        │   ├── HEPG2_3_RNA.bam
        │   ├── HEPG2_3_RNA_filtered_counts.tsv.gz
        │   ├── HEPG2_3_RNA_final_counts.tsv.gz
        │   └── HEPG2_3_RNA_raw_counts.tsv.gz
        └── statistic
            ├── assigned_counts
            │   └── fromWorkflow
            │       ├── default
            │       │   ├── combined
            │       │   │   └── HEPG2_merged_assigned_counts.statistic.tsv.gz
            │       │   ├── HEPG2_1_merged_assigned_counts.statistic.tsv.gz
            │       │   ├── HEPG2_2_merged_assigned_counts.statistic.tsv.gz
            │       │   ├── HEPG2_3_merged_assigned_counts.statistic.tsv.gz
            │       │   ├── HEPG2_average_allreps_merged.tsv.gz
            │       │   ├── HEPG2_barcodesPerInsert.png
            │       │   ├── HEPG2_correlation_minThreshold.tsv
            │       │   ├── HEPG2_correlation.tsv
            │       │   ├── HEPG2_DNA_pairwise_minThreshold.png
            │       │   ├── HEPG2_DNA_pairwise.png
            │       │   ├── HEPG2_group_barcodesPerInsert_box_minThreshold.png
            │       │   ├── HEPG2_group_barcodesPerInsert_box.png
            │       │   ├── HEPG2_Ratio_pairwise_minThreshold.png
            │       │   ├── HEPG2_Ratio_pairwise.png
            │       │   ├── HEPG2_RNA_pairwise_minThreshold.png
            │       │   └── HEPG2_RNA_pairwise.png
            │       ├── HEPG2_1_DNA_default.statistic.tsv.gz
            │       ├── HEPG2_1_RNA_default.statistic.tsv.gz
            │       ├── HEPG2_2_DNA_default.statistic.tsv.gz
            │       ├── HEPG2_2_RNA_default.statistic.tsv.gz
            │       ├── HEPG2_3_DNA_default.statistic.tsv.gz
            │       └── HEPG2_3_RNA_default.statistic.tsv.gz
            ├── barcode
            │   ├── assigned_counts
            │   │   └── fromWorkflow
            │   │       ├── HEPG2_default_barcode_correlation.tsv
            │   │       ├── HEPG2_default_barcode_DNA_pairwise.png
            │   │       ├── HEPG2_default_barcode_Ratio_pairwise.png
            │   │       ├── HEPG2_default_barcode_RNA_pairwise.png
            │   │       ├── HEPG2_default_DNA_perBarcode.png
            │   │       └── HEPG2_default_RNA_perBarcode.png
            │   └── counts
            │       ├── HEPG2_default_barcode_correlation.tsv
            │       ├── HEPG2_default_barcode_DNA_pairwise.png
            │       ├── HEPG2_default_barcode_Ratio_pairwise.png
            │       ├── HEPG2_default_barcode_RNA_pairwise.png
            │       ├── HEPG2_default_DNA_perBarcode.png
            │       └── HEPG2_default_RNA_perBarcode.png
            ├── bc_overlap.assigned_counts.default.fromWorkflow.tsv
            ├── bc_overlap.counts.default.tsv
            ├── counts
            │   ├── BCNucleotideComposition.HEPG2_1_DNA.tsv.gz
            │   ├── BCNucleotideComposition.HEPG2_1_RNA.tsv.gz
            │   ├── BCNucleotideComposition.HEPG2_2_DNA.tsv.gz
            │   ├── BCNucleotideComposition.HEPG2_2_RNA.tsv.gz
            │   ├── BCNucleotideComposition.HEPG2_3_DNA.tsv.gz
            │   └── BCNucleotideComposition.HEPG2_3_RNA.tsv.gz
            ├── counts.filtered.tsv
            ├── counts.freqUMIs.HEPG2_1_DNA.txt
            ├── counts.freqUMIs.HEPG2_1_RNA.txt
            ├── counts.freqUMIs.HEPG2_2_DNA.txt
            ├── counts.freqUMIs.HEPG2_2_RNA.txt
            ├── counts.freqUMIs.HEPG2_3_DNA.txt
            ├── counts.freqUMIs.HEPG2_3_RNA.txt
            ├── counts.raw.tsv
            ├── statistic_assigned_bc_correlation_merged_fromWorkflow_default.tsv
            ├── statistic_assigned_counts_merged_fromWorkflow_default.tsv
            ├── statistic_assigned_counts_single_fromWorkflow_default.tsv
            ├── statistic_bc_correlation_merged_default.tsv
            └── statistic_oligo_correlation_merged_fromWorkflow_default.tsv