GSE306816 (Zhang et al. dpSTR MPRA)
This example runs the assignment and experiment workflows on data from the preprint by Zhang et al. Systematic Evaluation of the Impact of Promoter Proximal Short Tandem Repeats on Expression. bioRxiv (2025).. The data were published in GEO:GSE306816.
The authors used a custom processing pipeline, available on GitHub, but here we run the analysis in MPRAsnakeflow. We only use their deep perturbation STR data (dpSTR). They also generated two additional MPRA datasets, h1STR and h2STR.
Note
This work measures perturbations of short tandem repeats, which are difficult to sequence because of homopolymer stretches and repeats. Zhang et al. used a specific stutter-correction method during assignment. That method is highly specific to their data and cannot be generalized. Even without stutter correction, we obtain assignment results comparable to those reported in their manuscript. The counting/experiment workflow is very similar, and we can also use their published assignment files.
Prerequisites
This example depends on the following data and software:
Installation of MPRAsnakeflow
Please install conda, set up the MPRAsnakeflow environment, and clone the current MPRAsnakeflow master branch. You can find more help under Installation.
In addition, install sra-toolkit to download the data from GEO (for example via conda):
conda install -c bioconda sra-tools
Design file
We need the design file and must modify it by trimming non-sequenced sequence parts and deduplicating the oligos used for assignment.
mkdir -p data/assignment
wget -O data/assignment/uber_seq.fa https://raw.githubusercontent.com/gymreklab/str-mpra/refs/heads/master/design-uber/uber_seq.fa
# Use only the first 135 bp of each oligo and remove duplicates.
# The full oligo design is 231 bp, but only 135 bp are sequenced.
# Duplicates can exist because one oligo can map to multiple barcodes.
awk -v N=135 '
function flush( s,key) {
if (hdr == "") return
gsub(/[ \t\r\n]/, "", seq)
s = substr(seq, 1, N)
if (s == "") return
if (!(s in seen)) {
seen[s] = 1
order[++n] = s
headers[s] = hdr
} else {
headers[s] = headers[s] "|" hdr
}
seq = ""; hdr = ""
}
/^>/ { flush(); hdr = substr($0,2); next }
{ seq = seq $0 }
END { flush();
for (i=1; i<=n; i++) {
key = order[i]
print ">" headers[key]
print key
}
}
' data/assignment/uber_seq.fa > data/assignment/uber_seq_135pb_dedup.fa
Reads assignment data
There are two sequencing sets: one generated on an Illumina sequencer and one on an Element sequencer. Here we process both together in parallel (SRA IDs: SRR35184122 and SRR35184121).
mkdir -p data/assignment
cd data/assignment
prefetch --max-size 130GB SRR35184122 SRR35184121
fastq-dump --gzip --split-files SRR35184122
fastq-dump --gzip --split-files SRR35184121
cd ../../
Note
Please be sure that all files are downloaded completely without errors!
With
tree data
The folder should look like this:
data
└── assignment
├── SRR35184121_1.fastq.gz
├── SRR35184121_2.fastq.gz
├── SRR35184122_1.fastq.gz
├── SRR35184122_2.fastq.gz
├── uber_seq.fa
└── uber_seq_135pb_dedup.fa
Reads count data
We have three replicates of DNA and RNA counts each. These data must be downloaded.
Replicate - gDNA - cDNA |
|---|
1 - SRR35184099 - SRR35184102 |
2 - SRR35184098 - SRR35184101 |
3 - SRR35184097 - SRR35184100 |
mkdir -p data/experiment
cd data/experiment
# download data
prefetch --max-size 30GB SRR35184102 SRR35184101 SRR35184100 SRR35184099 SRR35184098 SRR35184097;
fastq-dump --gzip --split-files SRR35184102 && \
fastq-dump --gzip --split-files SRR35184101 && \
fastq-dump --gzip --split-files SRR35184100 && \
fastq-dump --gzip --split-files SRR35184099 && \
fastq-dump --gzip --split-files SRR35184098 && \
fastq-dump --gzip --split-files SRR35184097
cd ../../
With
tree data
The folder should look like this:
data
├── assignment
│ ├── SRR35184121_1.fastq.gz
│ ├── SRR35184121_2.fastq.gz
│ ├── SRR35184122_1.fastq.gz
│ ├── SRR35184122_2.fastq.gz
│ ├── uber_seq.fa
│ └── uber_seq_135pb_dedup.fa
└── experiment
├── SRR35184097_1.fastq.gz
├── SRR35184098_1.fastq.gz
├── SRR35184099_1.fastq.gz
├── SRR35184100_1.fastq.gz
├── SRR35184101_1.fastq.gz
└── SRR35184102_1.fastq.gz
Their assignment data
From the GEO dataset, we can also download the assignment files they used for counting (Element and Illumina separately). We use these files for the experiment workflow. In addition, we run the assignment workflow on the raw FASTQ files to show that we obtain comparable results.
mkdir -p data
cd data
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9209nnn/GSM9209852/suppl/GSM9209852%5FdpSTR%5Fassociation%5FIllumina%5FBC%5FSTRs%2Etsv%2Egz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9209nnn/GSM9209853/suppl/GSM9209853%5FdpSTR%5Fassociation%5FElement%5FBC%5FSTRs%2Etsv%2Egz
zcat GSM9209852_dpSTR_association_Illumina_BC_STRs.tsv.gz | tail -n+4 | gzip -c > GSM9209852_dpSTR_association_Illumina_BC_STRs_raw.tsv.gz
zcat GSM9209853_dpSTR_association_Element_BC_STRs.tsv.gz | tail -n+4 | gzip -c > GSM9209853_dpSTR_association_Element_BC_STRs_raw.tsv.gz
cd ../
MPRAsnakeflow
We run assignment and count workflows together. It is also possible to run them separately using different config files. In that case, use assignment fromFile instead of fromConfig.
First, define the config file and the experiment CSV file to map DNA/RNA counts to the correct replicates.
Important details for the config file: the oligo design is 135 bp and barcodes are 20 bp long. Remove the 5-bp sequence (TCTAG) at the 3-prime end of the barcode read.
For counting, reads are longer than the 20-bp barcode. There is also a 3-prime adapter sequence that we remove using cutadapt.
Create config files
cat << 'EOF' > config.yaml
---
version: "0.7"
assignments:
dpSTR:
bc_length: 20
adapters:
BC:
three_prime:
- TCTAG
alignment_tool:
split_number: 30
tool: bbmap
configs:
min_mapping_quality: 30
design_check:
sequence_start: 1
sequence_length: 135
FWD:
- data/assignment/SRR35184121_2.fastq.gz
- data/assignment/SRR35184122_2.fastq.gz
BC:
- data/assignment/SRR35184121_1.fastq.gz
- data/assignment/SRR35184122_1.fastq.gz
design_file: data/assignment/uber_seq_135pb_dedup.fa
configs:
default: {}
experiments:
dpSTR:
bc_length: 20
adapters:
FWD:
three_prime:
- TCTAGAGGTTCGTCGACGCGATTATTATCATTACTTGTACAGCTCGTCCATGCCGAGAGTGATCCCGGCGGCGGTCACGAA
data_folder: data/experiment
experiment_file: experiments.csv
assignments:
dpSTRAssignment:
type: config
assignment_name: dpSTR
assignment_config: default
GSM9209852Illumina:
type: file
assignment_file: data/GSM9209852_dpSTR_association_Illumina_BC_STRs_raw.tsv.gz
GSM9209853Element:
type: file
assignment_file: data/GSM9209853_dpSTR_association_Element_BC_STRs_raw.tsv.gz
configs:
default: {}
EOF
Create the experiments.csv file to map DNA/RNA counts to replicates. The experiment file is a simple CSV file with the following content:
cat << 'EOF' > experiments.csv
Condition,Replicate,DNA_BC_F,RNA_BC_F
dpSTRHelaRNHWT,1,SRR35184099_1.fastq.gz,SRR35184102_1.fastq.gz
dpSTRHelaRNHWT,2,SRR35184098_1.fastq.gz,SRR35184101_1.fastq.gz
dpSTRHelaRNHWT,3,SRR35184097_1.fastq.gz,SRR35184100_1.fastq.gz
EOF
Run snakemake
Now we are ready to run MPRAsnakeflow. We run this example on one node with 60 GB memory and 30 cores.
We run the pipeline directly in the working folder. The MPRAsnakeflow workflow can be located in a different directory. Here we assume /home/user/MPRAsnakeflow.
First, do a dry run with snakemake using -n:
conda activate mprasnakeflow
snakemake -c 1 --sdm apptainer conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -n --quiet rules
You should see a list of rules that will be executed. Example summary:
Job stats:
job count
----------------------------------------------------------------------- -------
assignment_check_design 1
assignment_preprocessing_adapter_remove 1
experiment_assigned_counts_filterAssignment 3
experiment_preprocessing_trim_reads 6
assignment_fastq_split 2
experiment_counts_onlyFWD_raw_counts 6
assignment_attach_idx 30
experiment_counts_filter_counts 6
experiment_statistic_counts_BC_in_RNA_DNA 6
experiment_statistic_counts_table 12
assignment_mapping_bbmap 30
experiment_counts_final_counts 6
experiment_statistic_counts_BC_in_RNA_DNA_merge 2
experiment_statistic_counts_frequent_umis 6
experiment_statistic_counts_stats_merge 2
assignment_collect 1
assignment_mapping_bbmap_getBCs 30
experiment_assigned_counts_assignBarcodes 18
experiment_counts_dna_rna_merge_counts 12
experiment_statistic_bc_overlap_run 8
experiment_statistic_counts_barcode_base_composition 6
experiment_statistic_counts_final 2
assignment_collectBCs 1
assignment_idx_bam 1
experiment_assigned_counts_dna_rna_merge 9
experiment_statistic_assigned_counts_combine_BC_assignment_stats_helper 3
experiment_statistic_bc_overlap_combine_counts 1
experiment_statistic_correlation_bc_counts 4
experiment_statistic_correlation_bc_counts_hist 4
assignment_filter 1
assignment_flagstat 1
assignment_statistic_totalCounts 1
experiment_assigned_counts_combine_replicates_barcode_output 3
experiment_assigned_counts_make_master_tables 3
experiment_statistic_assigned_counts_combine_BC_assignment_stats 3
experiment_statistic_assigned_counts_combine_stats_dna_rna_merge 3
experiment_statistic_bc_overlap_combine_assigned_counts 3
experiment_statistic_correlation_calculate 3
experiment_statistic_correlation_combine_bc_raw 1
experiment_statistic_correlation_hist_box_plots 3
assignment_statistic_assignedCounts 1
assignment_statistic_assignment 1
assignment_statistic_quality_metric 1
experiment_assigned_counts_combine_replicates 6
experiment_assigned_counts_copy_final_all_files 3
experiment_assigned_counts_copy_final_thresh_files 3
experiment_statistic_assigned_counts_combine_stats_dna_rna_merge_all 3
experiment_statistic_correlation_combine_bc_assigned 3
experiment_statistic_correlation_combine_oligo 3
experiment_statistic_quality_metric 3
qc_report_assoc 1
qc_report_count 3
all 1
total 276
If the dry run finishes without errors, run the workflow. We use a machine with 30 threads/cores and 60 GB memory. Therefore, split_number is set to 30 for parallelization. We also use 10 threads for mapping (bbmap), and snakemake ensures that no more than 30 threads are used in total.
snakemake -c 30 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000
Note
Please adapt your setup when running in a cluster environment. An example SLURM profile is available in MPRAsnakeflow under profiles/default/config.yaml. You can use it with snakemake via --workflow-profile $PIPELINE/profiles/default, but adjust it first, especially the slurm_partition setting.
Results
For assignment, all output files are written to results/assignment/dpSTR. The final assignment file is results/assignment/dpSTR/assignment_barcodes.default.tsv.gz. You should also inspect the QC report: results/assignment/dpSTR/qc_report.default.html.
An example assignment QC report is available here: Example assignment QC report.
For experiments, all output files are written to results/experiments/dpSTR.
Final count files:
results/experiments/dpSTR/reporter_experiment.barcode.dpSTRHelaRNHWT.dpSTRAssignment.default.all.tsv.gz(assignment generated in this workflow)results/experiments/dpSTR/reporter_experiment.barcode.dpSTRHelaRNHWT.GSM9209852Illumina.default.all.tsv.gz(published Illumina assignment)results/experiments/dpSTR/reporter_experiment.barcode.dpSTRHelaRNHWT.GSM9209853Element.default.all.tsv.gz(published Element assignment)
You should also inspect the QC reports, for example results/experiments/dpSTR/qc_report.dpSTRHelaRNHWT.GSM9209852Illumina.default.html.
Example QC reports are available here: Experiment QC report, Illumina QC report, and Element QC report.