Basic Experiment Workflow

This example runs the count workflow on 5’/5’ WT MPRA data in the HepG2 cell line from Klein J., Agarwal, V., Keith, A., et al. 2019.

Prerequisites

This example depends on the following data and software:

Installing MPRAsnakeflow

Please install conda, the MPRAsnakeflow environment, and clone the actual MPRAsnakeflow master branch. You will find more help under Installation.

Producing an Association (.tsv.gz) File

This workflow requires a Python dictionary of candidate regulatory sequences (CRS) mapped to their barcodes in a tab-separated (.tsv.gz) format. For this example, the file can be generated using Basic assignment workflow or found in the resources/count_basic folder in MPRAsnakeflow (file: SRR10800986_barcodes_to_coords.tsv.gz).

Alternatively, if the association file is in pickle (.pickle) format because you used MPRAflow, you can convert it to .tsv.gz format using the in-built function in MPRAsnakeflow with the following code:

conda activate mprasnakeflow
python assignment_pickle_to_tsv.py --input <assignment_file>.pickle | sort | uniq | gzip -c > <assignment_file>.tsv.gz

Design (.fa) File

The design file can be generated using the Basic assignment workflow or downloaded from the resources/count_basic folder in MPRAsnakeflow.

Reads

There is one condition (HepG2) with three technical replicates. Each replicate contains a forward (barcode-forward), reverse (barcode-reverse), and index (unique molecular identifier) read for DNA and RNA. These data must be downloaded. All data is publicly available on the Short Read Archive (SRA). We will use SRA-toolkit to obtain the data.

Note

You need 9 GB of disk space to download the data and upwards of 50 GB to process it!

conda install sra-tools
mkdir -p count_basic/data
cd count_basic/data
fastq-dump --gzip --split-files SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886
cd ..

For large files and unstable internet connections, we recommend the command prefetch from SRA tools before running fastq-dump. This command provides better warnings when something goes wrong.

conda install sra-tools
cd count_basic/data
prefetch SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886
fastq-dump --gzip --split-files SRR10800986
cd ..

Note

Please ensure that all files are downloaded completely without errors! Depending on your internet connection, this can take a while. If you just want some data to run MPRAsnakeflow, you can limit yourself to one condition and/or just one replicate.

The data folder view can be seen with the following command:

tree data

The folder should look like this:

data
├── design.fa
├── SRR10800881_1.fastq.gz
├── SRR10800881_2.fastq.gz
├── SRR10800881_3.fastq.gz
├── SRR10800882_1.fastq.gz
├── SRR10800882_2.fastq.gz
├── SRR10800882_3.fastq.gz
├── SRR10800883_1.fastq.gz
├── SRR10800883_2.fastq.gz
├── SRR10800883_3.fastq.gz
├── SRR10800884_1.fastq.gz
├── SRR10800884_2.fastq.gz
├── SRR10800884_3.fastq.gz
├── SRR10800885_1.fastq.gz
├── SRR10800885_2.fastq.gz
├── SRR10800885_3.fastq.gz
├── SRR10800886_1.fastq.gz
├── SRR10800886_2.fastq.gz
├── SRR10800886_3.fastq.gz
└── SRR10800986_barcodes_to_coords.tsv.gz

Here is an overview of the files:

HepG2 Data

Condition

GEO Accession

SRA Accession

SRA Runs

HepG2-DNA-1: HepG2 DNA replicate 1

GSM4237863

SRX7474781

SRR10800881

HepG2-RNA-1: HepG2 RNA replicate 1

GSM4237864

SRX7474782

SRR10800882

HepG2-DNA-2: HepG2 DNA replicate 2

GSM4237865

SRX7474783

SRR10800883

HepG2-RNA-2: HepG2 RNA replicate 2

GSM4237866

SRX7474784

SRR10800884

HepG2-DNA-3: HepG2 DNA replicate 3

GSM4237867

SRX7474785

SRR10800885

HepG2-RNA-3: HepG2 RNA replicate 3

GSM4237868

SRX7474786

SRR10800886

Run MPRAsnakeflow

Now we are ready to start MPRAsnakeflow and count the number of barcodes. First, we need to generate an experiment file (experiment.csv) to specify the conditions, replicates, and corresponding reads.

Creating experiment.csv

Our experiment file looks like this:

Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
HepG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
HepG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
HepG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz

Save it into the count_basic/data folder as experiment.csv.

MPRAsnakeflow

Now we have everything ready to run the count workflow of MPRAsnakeflow. Run the pipeline directly in the count_basic folder. The MPRAsnakeflow workflow can be in a different directory. Let’s assume /home/user/MPRAsnakeflow.

First, configure the config file and save it to the count_basic folder. The config file is a simple text file with the following content:

---
version: "0.5"
experiments:
  exampleCount:
    bc_length: 15
    umi_length: 10
    data_folder: data
    experiment_file: experiment.csv
    demultiplex: false
    assignments:
      fromFile:
        type: file
        assignment_file: SRR10800986_barcodes_to_coords.tsv.gz
    design_file: design.fa
    label_file: labels.tsv
    configs:
      default: {}
      outlierZscore:
        filter:
          outlier_detection:
            method: rna_counts_zscore

Perform a dry-run using Snakemake’s -n option:

cd count_basic
conda activate mprasnakeflow
snakemake -c 1 --use-conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile /home/user/MPRAsnakeflow/resources/count_basic/config.yml -n -q

If the dry-run does not give any errors, run the workflow using 30 threads:

snakemake -c 30 --use-conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile /home/user/MPRAsnakeflow/resources/count_basic/config.yml

Note

Please modify your code when running in a cluster environment. We have an example SLURM config file here: config/sbatch.yml.

Results

All output files will be in the results/experiments/countBasic folder.

To generate a final report, use the following command:

snakemake --config config.yml --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --report report.html

This HTML report contains information about the Snakemake run and integrates statistics tables and plots.

Total file tree of the results folder:

results
└── experiments
    └── countBasic
        ├── assigned_counts
        │   └── fromWorkflow
        │       ├── default
        │       │   ├── HEPG2_1_merged_assigned_counts.tsv.gz
        │       │   ├── HEPG2_2_merged_assigned_counts.tsv.gz
        │       │   ├── HEPG2_3_merged_assigned_counts.tsv.gz
        │       │   ├── HEPG2_allreps_merged.tsv.gz
        │       │   └── HEPG2_allreps_minThreshold_merged.tsv.gz
        │       ├── HEPG2_1_DNA_final_counts.config.default.tsv.gz
        │       ├── HEPG2_1.merged.config.default.tsv.gz
        │       ├── HEPG2_1_RNA_final_counts.config.default.tsv.gz
        │       ├── HEPG2_2_DNA_final_counts.config.default.tsv.gz
        │       ├── HEPG2_2.merged.config.default.tsv.gz
        │       ├── HEPG2_2_RNA_final_counts.config.default.tsv.gz
        │       ├── HEPG2_3_DNA_final_counts.config.default.tsv.gz
        │       ├── HEPG2_3.merged.config.default.tsv.gz
        │       └── HEPG2_3_RNA_final_counts.config.default.tsv.gz
        ├── assignment
        │   └── fromWorkflow.tsv.gz
        ├── counts
        │   ├── HEPG2_1_DNA.bam
        │   ├── HEPG2_1_DNA_filtered_counts.tsv.gz
        │   ├── HEPG2_1_DNA_final_counts.tsv.gz
        │   ├── HEPG2_1_DNA_raw_counts.tsv.gz
        │   ├── HEPG2_1.merged.config.default.tsv.gz
        │   ├── HEPG2_1_RNA.bam
        │   ├── HEPG2_1_RNA_filtered_counts.tsv.gz
        │   ├── HEPG2_1_RNA_final_counts.tsv.gz
        │   ├── HEPG2_1_RNA_raw_counts.tsv.gz
        │   ├── HEPG2_2_DNA.bam
        │   ├── HEPG2_2_DNA_filtered_counts.tsv.gz
        │   ├── HEPG2_2_DNA_final_counts.tsv.gz
        │   ├── HEPG2_2_DNA_raw_counts.tsv.gz
        │   ├── HEPG2_2.merged.config.default.tsv.gz
        │   ├── HEPG2_2_RNA.bam
        │   ├── HEPG2_2_RNA_filtered_counts.tsv.gz
        │   ├── HEPG2_2_RNA_final_counts.tsv.gz
        │   ├── HEPG2_2_RNA_raw_counts.tsv.gz
        │   ├── HEPG2_3_DNA.bam
        │   ├── HEPG2_3_DNA_filtered_counts.tsv.gz
        │   ├── HEPG2_3_DNA_final_counts.tsv.gz
        │   ├── HEPG2_3_DNA_raw_counts.tsv.gz
        │   ├── HEPG2_3.merged.config.default.tsv.gz
        │   ├── HEPG2_3_RNA.bam
        │   ├── HEPG2_3_RNA_filtered_counts.tsv.gz
        │   ├── HEPG2_3_RNA_final_counts.tsv.gz
        │   └── HEPG2_3_RNA_raw_counts.tsv.gz
        └── statistic
            ├── assigned_counts
            │   └── fromWorkflow
            │       ├── default
            │       │   ├── combined
            │       │   │   └── HEPG2_merged_assigned_counts.statistic.tsv.gz
            │       │   ├── HEPG2_1_merged_assigned_counts.statistic.tsv.gz
            │       │   ├── HEPG2_2_merged_assigned_counts.statistic.tsv.gz
            │       │   ├── HEPG2_3_merged_assigned_counts.statistic.tsv.gz
            │       │   ├── HEPG2_average_allreps_merged.tsv.gz
            │       │   ├── HEPG2_barcodesPerInsert.png
            │       │   ├── HEPG2_correlation_minThreshold.tsv
            │       │   ├── HEPG2_correlation.tsv
            │       │   ├── HEPG2_DNA_pairwise_minThreshold.png
            │       │   ├── HEPG2_DNA_pairwise.png
            │       │   ├── HEPG2_group_barcodesPerInsert_box_minThreshold.png
            │       │   ├── HEPG2_group_barcodesPerInsert_box.png
            │       │   ├── HEPG2_Ratio_pairwise_minThreshold.png
            │       │   ├── HEPG2_Ratio_pairwise.png
            │       │   ├── HEPG2_RNA_pairwise_minThreshold.png
            │       │   └── HEPG2_RNA_pairwise.png
            │       ├── HEPG2_1_DNA_default.statistic.tsv.gz
            │       ├── HEPG2_1_RNA_default.statistic.tsv.gz
            │       ├── HEPG2_2_DNA_default.statistic.tsv.gz
            │       ├── HEPG2_2_RNA_default.statistic.tsv.gz
            │       ├── HEPG2_3_DNA_default.statistic.tsv.gz
            │       └── HEPG2_3_RNA_default.statistic.tsv.gz
            ├── barcode
            │   ├── assigned_counts
            │   │   └── fromWorkflow
            │   │       ├── HEPG2_default_barcode_correlation.tsv
            │   │       ├── HEPG2_default_barcode_DNA_pairwise.png
            │   │       ├── HEPG2_default_barcode_Ratio_pairwise.png
            │   │       ├── HEPG2_default_barcode_RNA_pairwise.png
            │   │       ├── HEPG2_default_DNA_perBarcode.png
            │   │       └── HEPG2_default_RNA_perBarcode.png
            │   └── counts
            │       ├── HEPG2_default_barcode_correlation.tsv
            │       ├── HEPG2_default_barcode_DNA_pairwise.png
            │       ├── HEPG2_default_barcode_Ratio_pairwise.png
            │       ├── HEPG2_default_barcode_RNA_pairwise.png
            │       ├── HEPG2_default_DNA_perBarcode.png
            │       └── HEPG2_default_RNA_perBarcode.png
            ├── bc_overlap.assigned_counts.default.fromWorkflow.tsv
            ├── bc_overlap.counts.default.tsv
            ├── counts
            │   ├── BCNucleotideComposition.HEPG2_1_DNA.tsv.gz
            │   ├── BCNucleotideComposition.HEPG2_1_RNA.tsv.gz
            │   ├── BCNucleotideComposition.HEPG2_2_DNA.tsv.gz
            │   ├── BCNucleotideComposition.HEPG2_2_RNA.tsv.gz
            │   ├── BCNucleotideComposition.HEPG2_3_DNA.tsv.gz
            │   └── BCNucleotideComposition.HEPG2_3_RNA.tsv.gz
            ├── counts.filtered.tsv
            ├── counts.freqUMIs.HEPG2_1_DNA.txt
            ├── counts.freqUMIs.HEPG2_1_RNA.txt
            ├── counts.freqUMIs.HEPG2_2_DNA.txt
            ├── counts.freqUMIs.HEPG2_2_RNA.txt
            ├── counts.freqUMIs.HEPG2_3_DNA.txt
            ├── counts.freqUMIs.HEPG2_3_RNA.txt
            ├── counts.raw.tsv
            ├── statistic_assigned_bc_correlation_merged_fromWorkflow_default.tsv
            ├── statistic_assigned_counts_merged_fromWorkflow_default.tsv
            ├── statistic_assigned_counts_single_fromWorkflow_default.tsv
            ├── statistic_bc_correlation_merged_default.tsv
            └── statistic_oligo_correlation_merged_fromWorkflow_default.tsv