Basic Experiment workflow
This example runs the count workflow on 5’/5’ WT MPRA data in the HepG2 cell line from Klein J., Agarwal, V., Keith, A., et al. 2019.
Prerequirements
This example depends on the following data and software:
Installing MPRAsnakeflow
Please install conda, the MPRAsnakeflow environment, and clone the actual MPRAsnakeflow
master branch. You will find more help under Installation.
Producing an association (.tsv.gz) file
This workflow requires a python dictionary of candidate regulatory sequence (CRS) mapped to their barcodes in a tab separated (.tsv) format. For this example the file can be generated using Basic assignment workflow or it can be found in resources/count_basic
folder in MPRAsnakelfow <https://github.com/kircherlab/MPRAsnakeflow/>`_(file :code:`SRR10800986_barcodes_to_coords.tsv.gz).
Alternatively, if the association file is in pickle (.pickle) format because you used MPRAflow, you can convert the same file to .tsv.gz format with the in-built function in MPRsnakeflow with the following code:
conda activate mprasnakeflow
python assignment_pickle_to_tsv.py --input <assignment_file>.pickle | sort | uniq | gzip -c > <assignment_file>.tsv.gz
Design (.fa) file
File can be generated using the Basic assignment workflow or downloaded from the
resources/count_basic
folder in MPRAsnakelfow.
Reads
There is one condition (HEPG2) with three technical replicates. Each replicate contains a forward (barcode-forward), reverse (barcode-reverse), and index (unique molecular identifier) read for DNA and RNA. These data must be downloaded. All data is publically available on the short read archive (SRA). We will use SRA-toolkit to obtain the data.
Note
You need 9 GB disk space to download the data and upwards of 50 GB to proccess it!
conda install sra-tools
mkdir -p count_basic/data
cd count_basic/data
fastq-dump --gzip --split-files SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886
cd ..
For large files and unstable internet connection we reccommend the comand prefetch
from SRA tools before running fastq-dump
. This command is much smarter in warnings when something went wrong.
conda install sra-tools
cd count_basic/data
prefetch SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886
fastq-dump --gzip --split-files SRR10800986
cd ..
Note
Please be sure that all files are downloaded completely without errors! Depending on your internet connection this can take a while. If you just want some data to run MPRAsnakeflow, you can just limit yourself to one condition and/or just one replicate.
The data folder view can be seen with the following command:
tree data
The folder should look like this:
data
├── design.fa
├── SRR10800881_1.fastq.gz
├── SRR10800881_2.fastq.gz
├── SRR10800881_3.fastq.gz
├── SRR10800882_1.fastq.gz
├── SRR10800882_2.fastq.gz
├── SRR10800882_3.fastq.gz
├── SRR10800883_1.fastq.gz
├── SRR10800883_2.fastq.gz
├── SRR10800883_3.fastq.gz
├── SRR10800884_1.fastq.gz
├── SRR10800884_2.fastq.gz
├── SRR10800884_3.fastq.gz
├── SRR10800885_1.fastq.gz
├── SRR10800885_2.fastq.gz
├── SRR10800885_3.fastq.gz
├── SRR10800886_1.fastq.gz
├── SRR10800886_2.fastq.gz
├── SRR10800886_3.fastq.gz
└── SRR10800986_barcodes_to_coords.tsv.gz
Here is an overview of the files:
Condition |
GEO Accession |
SRA Accession |
SRA Runs |
---|---|---|---|
HEPG2-DNA-1: HEPG2 DNA replicate 1 |
GSM4237863 |
SRX7474781 |
SRR10800881 |
HEPG2-RNA-1: HEPG2 RNA replicate 1 |
GSM4237864 |
SRX7474782 |
SRR10800882 |
HEPG2-DNA-2: HEPG2 DNA replicate 2 |
GSM4237865 |
SRX7474783 |
SRR10800883 |
HEPG2-RNA-2: HEPG2 RNA replicate 2 |
GSM4237866 |
SRX7474784 |
SRR10800884 |
HEPG2-DNA-3: HEPG2 DNA replicate 3 |
GSM4237867 |
SRX7474785 |
SRR10800885 |
HEPG2-RNA-3: HEPG2 RNA replicate 3 |
GSM4237868 |
SRX7474786 |
SRR10800886 |
Run MPRAsnakeflow
Now we are close to starting MPRAsnakeflow and count the number of barcodes. But before we need to generate an environment (.csv) file to tell snakemake the conditions, replicates and the corresponding reads.
Creating experiment.csv
Our experiment file looks exactly like this:
Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
HEPG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
HEPG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
HEPG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz
Save it into the count_basic/data
folder under experiment.csv
.
MPRAsnakeflow
Now we have everything at hand to run the count MPRAsnakeflow pipeline. We will run the pipeline directly in the count_basic
folder. The MPRAsnakeflow workflow can be in a different directory. Let’s assume /home/user/MPRAsnakeflow
. The MPRAsnakeflow count command is:
First we have to configure the config file and save it to the count_basic
folder. The config file is a simple text file with the following content:
---
experiments:
exampleCount:
bc_length: 15
umi_length: 10
data_folder: data
experiment_file: data/experiment.csv
demultiplex: false
assignments:
fromFile:
type: file
assignment_file: data/SRR10800986_barcodes_to_coords.tsv.gz
design_file: data/design.fa
configs:
exampleConfig:
filter:
bc_threshold: 10
DNA:
min_counts: 1
RNA:
min_counts: 1
First we do a try run using snakemake -n
option. The MPRAsnakeflow command is:
cd count_basic
conda activate mprasnakeflow
snakemake -c 1 --use-conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yml -n
You should see a list of rules that will be executed. This is the summary:
Job stats:
job count min threads max threads
------------------------------------------------------------ ------- ------------- -------------
all 1 1 1
assigned_counts_assignBarcodes 6 1 1
assigned_counts_dna_rna_merge 3 1 1
assigned_counts_filterAssignment 1 1 1
assigned_counts_make_master_tables 1 1 1
counts_create_BAM_umi 6 1 1
counts_dna_rna_merge_counts 6 1 1
counts_filter_counts 6 1 1
counts_final_counts_umi 6 1 1
counts_raw_counts_umi 6 1 1
statistic_assigned_counts_combine_BC_assignment_stats 1 1 1
statistic_assigned_counts_combine_BC_assignment_stats_helper 1 1 1
statistic_assigned_counts_combine_stats_dna_rna_merge 1 1 1
statistic_assigned_counts_combine_stats_dna_rna_merge_all 1 1 1
statistic_bc_overlap_combine_assigned_counts 1 1 1
statistic_bc_overlap_combine_counts 1 1 1
statistic_bc_overlap_run 4 1 1
statistic_correlation_bc_counts 2 1 1
statistic_correlation_calculate 1 1 1
statistic_correlation_combine_bc_assigned 1 1 1
statistic_correlation_combine_bc_raw 1 1 1
statistic_correlation_combine_oligo 1 1 1
statistic_counts_BC_in_RNA_DNA 6 1 1
statistic_counts_BC_in_RNA_DNA_merge 2 1 1
statistic_counts_barcode_base_composition 6 1 1
statistic_counts_final 2 1 1
statistic_counts_frequent_umis 6 1 1
statistic_counts_stats_merge 2 1 1
statistic_counts_table 12 1 1
total 94 1 1
When dry-drun does not give any errors we will run the workflow. We use a machine with 30 threads/cores to run the workflow. The MPRAsnakeflow command is:
snakemake -c 30 --use-conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yml
Note
Please modify your code when running in a cluster environment. We have an example SLURM config file here config/sbatch.yml
.
If everything works fine the 29 rules showed above will run. Everything starting with counts_
beolngs to raw count rules, with assigned_counts_
to counts assigned to the assignment and statistic_
to statistics. Here is a brief description of the rules.
- all
The overall all rule. Here is defined what final output files are expected.
- counts_create_BAM_umi
Create a BAM file from FASTQ input, merge FW and REV read and save UMI in XI flag.
- counts_raw_counts_umi
Counting BCsxUMIs from the BAM files.
- counts_filter_counts
Filter the counts to BCs only of the correct length (defined in the config file).
- counts_final_counts_umi
Discarding PCR duplicates (taking BCxUMI only one time). Final result of counts can be found here:
results/experiments/exampleCount/counts/HepG2_<1,2,3>_<DNA/RNA>_filtered_counts.tsv.gz
.- counts_dna_rna_merge_counts
Merge DNA and RNA counts together. This is done in two ways. First no not allow zeros in DNA or RNA BCs (when
min_counts
is not zero for DNA and RNA). Second with zeros, so a BC can be defined only in the DNA or RNA (whenmin_counts
is zero for DNA or RNA)- assigned_counts_filterAssignment
Use only unique assignments.
- assigned_counts_assignBarcodes
Assign RNA and DNA barcodes seperately to make the statistic for assigned.
- assigned_counts_dna_rna_merge
Assign merged RNA/DNA barcodes. Filter BC depending on the min_counts option. Output for each replicate is here:
results/experiments/exampleCount/assigned_counts/fromFile/exampleConfig/HepG2_<1,2,3>_merged_assigned_counts.tsv.gz
.- assigned_counts_make_master_tables
Final master table with all replicates combined. Output is here:
results/experiments/exampleCount/assigned_counts/fromFile/exampleConfig/HepG2_allreps_merged.tsv.gz
and using thebc-threshold
hereresults/experiments/exampleCount/assigned_counts/fromFile/exampleConfig/HepG2_allreps_minThreshold_merged.tsv.gz
.- statistic_assigned_counts_combine_BC_assignment_stats
TODO
- statistic_assigned_counts_combine_BC_assignment_stats_helper
TODO
- statistic_assigned_counts_combine_stats_dna_rna_merge
TODO
- statistic_assigned_counts_combine_stats_dna_rna_merge_all
TODO
- statistic_bc_overlap_combine_assigned_counts
TODO
- statistic_bc_overlap_combine_counts
TODO
- statistic_bc_overlap_run
TODO
- statistic_correlation_bc_counts
TODO
- statistic_correlation_calculate
TODO
- statistic_correlation_combine_bc_assigned
TODO
- statistic_correlation_combine_bc_raw
TODO
- statistic_correlation_combine_oligo
TODO
- statistic_counts_BC_in_RNA_DNA
TODO
- statistic_counts_BC_in_RNA_DNA_merge
TODO
- statistic_counts_barcode_base_composition
TODO
- statistic_counts_final
TODO
- statistic_counts_frequent_umis
TODO
- statistic_counts_stats_merge
TODO
- statistic_counts_table
TODO
Todo
Rules not correct in example experiment workflow
Results
All needed output files will be in the :code:`results/assignment/countBasic`folder.
To generate a final report, the following code can be used
snakemake --config config.yml --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --report report.html
This html contains als information about the snakemake run and integrates statistic tables and plots.
Total file tree of the results folder: