GSE316891 (Yan et al. L1a1 MPRA)
This example demonstrates running the assignment and experiment workflows on the L1a1 MPRA dataset from Yan et al. Data is available on GEO:GSE316891. No preprint or publication is available at the moment, but the data is already public and can be used for testing MPRAsnakeflow.
Prerequisites
This example depends on the following data and software:
Installation of MPRAsnakeflow
Please install conda, set up the MPRAsnakeflow environment, and clone the current MPRAsnakeflow master branch. See Installation for details.
Assignment Workflow
Download and Prepare Data
mkdir -p data/assignment
cd data/assignment
curl https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9462nnn/GSM9462019/suppl/GSM9462019%5FL1a1%2Efasta%2Egz | zcat > GSM9462019_L1a1.fa
cat GSM9462019_L1a1.fa | sed 's/\[/|/g' | sed 's/\]/|/g' > GSM9462019_L1a1.fix.fa
prefetch --max-size 20GB SRR36893758
fastq-dump --gzip --split-files SRR36893758
The folder should look like this:
data/assignment/
├── GSM9462019_L1a1.fa
├── GSM9462019_L1a1.fix.fa
├── SRR36893758_1.fastq.gz
├── SRR36893758_2.fastq.gz
└── SRR36893758_3.fastq.gz
Configure Assignment Workflow
Create a config file (e.g. config_assignment.yaml) with the following content:
version: "0.7.0"
assignments:
GSE316891L1a1:
bc_length: 15
alignment_tool:
split_number: 30
tool: bbmap
FWD:
- data/assignment/SRR36893758_1.fastq.gz
REV:
- data/assignment/SRR36893758_2.fastq.gz
BC:
- data/assignment/SRR36893758_3.fastq.gz
design_check:
sequence_start: 1
sequence_length: 270
design_file: data/assignment/GSM9462019_L1a1.fix.fa
configs:
default: {}
Run Assignment Workflow
snakemake -c all --sdm apptainer conda \
--snakefile /home/user/MPRAsnakeflow/workflow/Snakefile \
--configfile config_assignment.yaml
Experiment Workflow
Download and Prepare Data
mkdir -p data/experiment
cd data/
wget -O GSM9462019_L1a1_oligos_to_barcodes.txt.gz https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9462nnn/GSM9462019/suppl/GSM9462019%5FL1a1%5Foligos%5Fto%5Fbarcodes%2Etxt%2Egz
zcat GSM9462019_L1a1_oligos_to_barcodes.txt.gz | tail -n +2 | awk -F',' -v "OFS=\t" '{print $2,$1}' | sed 's/\[/|/g' | sed 's/\]/|/g' | sort | gzip -c > GSM9462019_L1a1_oligos_to_barcodes.tsv.gz
cd experiment
prefetch --max-size 30GB SRR36893764 SRR36893763 SRR36893762 SRR36893761 SRR36893760 SRR36893759
fastq-dump --gzip --split-files SRR36893764
fastq-dump --gzip --split-files SRR36893763
fastq-dump --gzip --split-files SRR36893762
fastq-dump --gzip --split-files SRR36893761
fastq-dump --gzip --split-files SRR36893760
fastq-dump --gzip --split-files SRR36893759
The folder should look like this:
data/experiment/
├── SRR36893759_1.fastq.gz
├── SRR36893759_2.fastq.gz
├── SRR36893759_3.fastq.gz
├── SRR36893760_1.fastq.gz
├── SRR36893760_2.fastq.gz
├── SRR36893760_3.fastq.gz
├── SRR36893761_1.fastq.gz
├── SRR36893761_2.fastq.gz
├── SRR36893761_3.fastq.gz
├── SRR36893762_1.fastq.gz
├── SRR36893762_2.fastq.gz
├── SRR36893762_3.fastq.gz
├── SRR36893763_1.fastq.gz
├── SRR36893763_2.fastq.gz
├── SRR36893763_3.fastq.gz
├── SRR36893764_1.fastq.gz
├── SRR36893764_2.fastq.gz
└── SRR36893764_3.fastq.gz
Create experiment file (experiments.csv):
Condition,Replicate,DNA_BC_F,DNA_BC_R,DNA_UMI,RNA_BC_F,RNA_BC_R,RNA_UMI
L1a1,1,SRR36893764_1.fastq.gz,SRR36893764_2.fastq.gz,SRR36893764_3.fastq.gz,SRR36893763_1.fastq.gz,SRR36893763_2.fastq.gz,SRR36893763_3.fastq.gz
L1a1,2,SRR36893762_1.fastq.gz,SRR36893762_2.fastq.gz,SRR36893762_3.fastq.gz,SRR36893761_1.fastq.gz,SRR36893761_2.fastq.gz,SRR36893761_3.fastq.gz
L1a1,3,SRR36893760_1.fastq.gz,SRR36893760_2.fastq.gz,SRR36893760_3.fastq.gz,SRR36893759_1.fastq.gz,SRR36893759_2.fastq.gz,SRR36893759_3.fastq.gz
Configure Experiment Workflow
Create a config file (e.g. config_experiment.yaml):
version: "0.7"
experiments:
GSE316891L1a1:
bc_length: 15
umi_length: 16
data_folder: data/experiment
experiment_file: experiments.csv
assignments:
L1a1Assignment:
type: file
assignment_file: results/assignment/GSE316891L1a1/assignment_barcodes.default.tsv.gz
GEOAssignment:
type: file
assignment_file: data/GSM9462019_L1a1_oligos_to_barcodes.tsv.gz
configs:
default: {}
Run Experiment Workflow
snakemake -c all --sdm apptainer conda \
--snakefile /home/user/MPRAsnakeflow/workflow/Snakefile \
--configfile config_experiment.yaml
Results
Assignment output files will be in results/assignment/GSE316891L1a1. The final assignment is results/assignment/GSE316891L1a1/assignment_barcodes.default.tsv.gz and the QC report is results/assignment/GSE316891L1a1/qc_report.default.html. We also provide the Assignment QC report.
Inspecting their assignment file, they retrieve 77795 out of 79812 oligos with 113 mean and 93 median barcodes per oligo. Our default assignment retrieves 77208 oligos with 87 mean and 71 median barcodes per oligo. We are very strict with the majority vote (75% of the reads need to support the same oligo), which leads to a lower number of assigned oligos, but also a higher confidence in the assignment. You can adjust the config to be less strict and retrieve more oligos, but we recommend inspecting the QC report and assignment statistics before doing so.
Experiment output files will be in results/experiments/GSE316891L1a1. The main count files are results/experiments/GSE316891L1a1/reporter_experiment.barcode.L1a1.GEOAssignment.default.all.tsv.gz and results/experiments/GSE316891L1a1/reporter_experiment.barcode.L1a1.L1a1Assignment.default.all.tsv.gz. QC reports are in the same folder and can be inspected for our L1a1Assignment here and for their GEOAssignment here. QC reports are very similar for both assignments with slightly more barcodes assigned in the GEOAssignment due to the richer assignment file. The final count files are very similar between the two assignments. Here we report the correlations of the normalized DNA, normalized RNA, and the log2 fold change between RNA and DNA of the two assignments, separated by replicate:
replicate |
number of shared oligos |
pearson log2FC |
spearman log2FC |
pearson RNA |
spearman RNA |
pearson DNA |
spearman DNA |
|---|---|---|---|---|---|---|---|
1 |
75388 |
0.88 |
0.89 |
0.89 |
0.92 |
0.82 |
0.83 |
2 |
75462 |
0.88 |
0.9 |
0.91 |
0.93 |
0.81 |
0.83 |
3 |
75151 |
0.87 |
0.88 |
0.9 |
0.91 |
0.82 |
0.84 |