GSE293036 (Granitto et al. GWAS variant MPRA)
This example runs the assignment and experiment workflows on data from Granitto et al. Genome-wide discovery of multiple sclerosis genetic risk variant allelic regulatory activity. G3 Genes|Genomes|Genetics (2025). The data were published in GEO:GSE293036.
The authors used a massively parallel reporter assay (MPRA) to identify multiple sclerosis (MS) genetic risk variants with allele-specific regulatory activity, testing oligos in three conditions: GM12878, MS-1, and MS-2.
Note
The authors shared their experimental design and data through GEO. For the barcode library, they aligned reads to a custom reference including both forward and reverse complement versions of the oligo sequences. For simplicity, we just included the forward versions in the design file, achieving similar oligo coverage (number of barcodes per oligo).
Prerequisites
This example depends on the following data and software:
Installation of MPRAsnakeflow
Please install conda, set up the MPRAsnakeflow environment, and clone the current MPRAsnakeflow master branch. You can find more help under Installation.
In addition, install sra-toolkit and openpyxl to download and process the data (for example via conda):
conda install -c bioconda sra-tools
conda install -c conda-forge openpyxl
Design file
The oligo library sequences are provided in Supplementary Table 3 of the paper (Supplementary_Table_3_G3-2025-406100.xlsx, columns seq_id and twist_seq). Download the supplementary data zip, extract the table, and convert it to a gzipped FASTA, filtering out the reverse-complement (_RC) entries:
mkdir -p data
cd data
wget -O jkaf192_supplementary_data.zip \
https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/g3journal/15/11/10.1093_g3journal_jkaf192/1/jkaf192_supplementary_data.zip
unzip -j jkaf192_supplementary_data.zip Supplementary_Table_3_G3-2025-406100.xlsx
python3 - << 'PYEOF'
import openpyxl, gzip
wb = openpyxl.load_workbook("Supplementary_Table_3_G3-2025-406100.xlsx", read_only=True)
ws = wb.active
with gzip.open("design.fa.gz", "wt") as fh:
for row in ws.iter_rows(min_row=2, values_only=True):
seq_id, twist_seq = row[0], row[1]
if seq_id and not str(seq_id).endswith("_RC"):
fh.write(f">{seq_id}\n{twist_seq}\n")
PYEOF
cd ..
Reads assignment data
The barcode library sequencing (GEO:GSM8874361) is available under SRA accession SRR32870551. This paired-end run has the barcode in read 1 and the oligo sequence in read 2.
mkdir -p data
cd data
prefetch SRR32870551
fastq-dump --gzip --split-files SRR32870551 && \
mv SRR32870551_1.fastq.gz SRR32870551_Barcode_1.fastq.gz && \
mv SRR32870551_2.fastq.gz SRR32870551_Barcode_2.fastq.gz
cd ..
Note
Please be sure that all files are downloaded completely without errors!
With
tree data
The folder should look like this:
data
├── SRR32870551_Barcode_1.fastq.gz
├── SRR32870551_Barcode_2.fastq.gz
└── design.fa.gz
Reads count data
In this example we run just one sequencing lane for two replicates of the GM12878 condition, using the corresponding control replicates as input.
Condition |
Replicate |
GSM |
SRR |
|---|---|---|---|
Control |
2 |
GSM8874363 |
SRR32870549 |
Control |
3 |
GSM8874364 |
SRR32870548 |
GM12878 |
2 |
GSM8874366 |
SRR32870540 |
GM12878 |
3 |
GSM8874367 |
SRR32870536 |
Note
Each GSM has multiple sequencing runs in SRA. This example uses only the first run (r1) of each replicate. For a complete analysis, download and include all runs per replicate.
mkdir -p data
cd data
prefetch SRR32870549 SRR32870548 SRR32870540 SRR32870536
fastq-dump --gzip --split-files SRR32870549 && mv SRR32870549_1.fastq.gz SRR32870549_Control.fastq.gz && \
fastq-dump --gzip --split-files SRR32870548 && mv SRR32870548_1.fastq.gz SRR32870548_Control.fastq.gz && \
fastq-dump --gzip --split-files SRR32870540 && mv SRR32870540_1.fastq.gz SRR32870540_GM12878.fastq.gz && \
fastq-dump --gzip --split-files SRR32870536 && mv SRR32870536_1.fastq.gz SRR32870536_GM12878.fastq.gz
cd ..
With
tree data
The folder should look like this:
data
├── SRR32870536_GM12878.fastq.gz
├── SRR32870540_GM12878.fastq.gz
├── SRR32870548_Control.fastq.gz
├── SRR32870549_Control.fastq.gz
├── SRR32870551_Barcode_1.fastq.gz
├── SRR32870551_Barcode_2.fastq.gz
└── design.fa.gz
MPRAsnakeflow
We run assignment and count workflows together. It is also possible to run them separately using different config files. In that case, use assignment fromFile instead of fromConfig.
First, define the config file and the experiment CSV file to map DNA/RNA counts to the correct replicates.
Important details for the config file: oligos are 200 bp long and barcodes are 20 bp. The linker sequence separating the barcode (read 1) from the oligo is TCTAGAGGTTCGTCGACGCGATCGCAGGAGCCGCAGTG.
Create config files
cat << 'EOF' > config.yaml
---
version: "0.7"
assignments:
GSE293036Assignment:
bc_length: 20
BC_rev_comp: false
linker: TCTAGAGGTTCGTCGACGCGATCGCAGGAGCCGCAGTG
alignment_tool:
split_number: 16
tool: bbmap
configs:
min_mapping_quality: 30
design_check:
sequence_start: 1
sequence_length: 200
FWD:
- data/SRR32870551_Barcode_1.fastq.gz
REV:
- data/SRR32870551_Barcode_2.fastq.gz
design_file: data/design.fa.gz
configs:
default: {}
experiments:
GSE293036Experiment:
bc_length: 20
split_number: 16
data_folder: data
experiment_file: experiments.csv
assignments:
GSE293036Assignment:
type: config
assignment_name: GSE293036Assignment
assignment_config: default
configs:
default:
filter:
bc_threshold: 30
EOF
Create the experiments.csv file to map DNA/RNA counts to replicates. The control samples serve as gDNA input (DNA_BC_F) for all conditions.
cat << 'EOF' > experiments.csv
Condition,Replicate,DNA_BC_F,RNA_BC_F
GM12878,3,SRR32870548_Control.fastq.gz,SRR32870536_GM12878.fastq.gz
GM12878,2,SRR32870549_Control.fastq.gz,SRR32870540_GM12878.fastq.gz
EOF
Run snakemake
Now we are ready to run MPRAsnakeflow. We run this example on one node with 60 GB memory and 16 cores.
We run the pipeline directly in the working folder. The MPRAsnakeflow workflow can be located in a different directory. Here we assume /work/${USER}/MPRAsnakeflow.
First, do a dry run with snakemake using -n:
conda activate mprasnakeflow
snakemake -c 1 --sdm apptainer conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -n --quiet rules
If the dry run finishes without errors, run the workflow. We use a machine with 16 threads/cores and 60 GB memory. Therefore, split_number is set to 16 for parallelization. We also use 8 threads for mapping (bbmap), and snakemake ensures that no more than 16 threads are used in total.
snakemake -c 16 --sdm apptainer conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -q --set-threads assignment_mapping_bbmap=8 --resources mem_mb=60000
Note
Please adapt your setup when running in a cluster environment. An example SLURM profile is available in MPRAsnakeflow under profiles/default/config.yaml. You can use it with snakemake via --workflow-profile $PIPELINE/profiles/default, but adjust it first, especially the slurm_partition setting.
Results
For assignment, all output files are written to results/assignment/GSE293036Assignment. The final assignment file is results/assignment/GSE293036Assignment/assignment_barcodes.default.tsv.gz. You should also inspect the QC report: results/assignment/GSE293036Assignment/qc_report.default_GSE293036.html.
An example assignment QC report is available here: Example assignment QC report.
For experiments, all output files are written to results/experiments/GSE293036Experiment.
Final count file:
results/experiments/GSE293036Experiment/reporter_experiment.barcode.GM12878.GSE293036Assignment.default.all.tsv.gz
You should also inspect the QC report: results/experiments/GSE293036Experiment/qc_report.GM12878.GSE293036Assignment.default.html.
An example QC report is available here: GM12878 QC report.