GSE306816 (Zhang et al. dpSTR MPRA)

This example runs the assignment and experiment workflows on data from the preprint by Zhang et al. Systematic Evaluation of the Impact of Promoter Proximal Short Tandem Repeats on Expression. bioRxiv (2025).. The data were published in GEO:GSE306816.

The authors used a custom processing pipeline, available on GitHub, but here we run the analysis in MPRAsnakeflow. We only use their deep perturbation STR data (dpSTR). They also generated two additional MPRA datasets, h1STR and h2STR.

Note

This work measures perturbations of short tandem repeats, which are difficult to sequence because of homopolymer stretches and repeats. Zhang et al. used a specific stutter-correction method during assignment. That method is highly specific to their data and cannot be generalized. Even without stutter correction, we obtain assignment results comparable to those reported in their manuscript. The counting/experiment workflow is very similar, and we can also use their published assignment files.

Prerequisites

This example depends on the following data and software:

Installation of MPRAsnakeflow

Please install conda, set up the MPRAsnakeflow environment, and clone the current MPRAsnakeflow master branch. You can find more help under Installation.

In addition, install sra-toolkit to download the data from GEO (for example via conda):

conda install -c bioconda sra-tools

Design file

We need the design file and must modify it by trimming non-sequenced sequence parts and deduplicating the oligos used for assignment.

mkdir -p data/assignment

wget -O data/assignment/uber_seq.fa https://raw.githubusercontent.com/gymreklab/str-mpra/refs/heads/master/design-uber/uber_seq.fa

# Use only the first 135 bp of each oligo and remove duplicates.
# The full oligo design is 231 bp, but only 135 bp are sequenced.
# Duplicates can exist because one oligo can map to multiple barcodes.
awk -v N=135 '
function flush(   s,key) {
    if (hdr == "") return
    gsub(/[ \t\r\n]/, "", seq)
    s = substr(seq, 1, N)
    if (s == "") return

    if (!(s in seen)) {
    seen[s] = 1
    order[++n] = s
    headers[s] = hdr
    } else {
    headers[s] = headers[s] "|" hdr
    }
    seq = ""; hdr = ""
}

/^>/ { flush(); hdr = substr($0,2); next }
{ seq = seq $0 }
END { flush();
        for (i=1; i<=n; i++) {
        key = order[i]
        print ">" headers[key]
        print key
        }
    }
' data/assignment/uber_seq.fa > data/assignment/uber_seq_135pb_dedup.fa

Reads assignment data

There are two sequencing sets: one generated on an Illumina sequencer and one on an Element sequencer. Here we process both together in parallel (SRA IDs: SRR35184122 and SRR35184121).

mkdir -p data/assignment
cd data/assignment
prefetch --max-size 130GB SRR35184122 SRR35184121
fastq-dump --gzip --split-files SRR35184122
fastq-dump --gzip --split-files SRR35184121

cd ../../

Note

Please be sure that all files are downloaded completely without errors!

With

tree data

The folder should look like this:

data
└── assignment
   ├── SRR35184121_1.fastq.gz
   ├── SRR35184121_2.fastq.gz
   ├── SRR35184122_1.fastq.gz
   ├── SRR35184122_2.fastq.gz
   ├── uber_seq.fa
   └── uber_seq_135pb_dedup.fa

Reads count data

We have three replicates of DNA and RNA counts each. These data must be downloaded.

Replicate - gDNA - cDNA

1 - SRR35184099 - SRR35184102

2 - SRR35184098 - SRR35184101

3 - SRR35184097 - SRR35184100

mkdir -p data/experiment
cd data/experiment
# download data
prefetch --max-size 30GB SRR35184102 SRR35184101 SRR35184100 SRR35184099 SRR35184098 SRR35184097;
fastq-dump --gzip --split-files SRR35184102 && \
fastq-dump --gzip --split-files SRR35184101 && \
fastq-dump --gzip --split-files SRR35184100 && \
fastq-dump --gzip --split-files SRR35184099 && \
fastq-dump --gzip --split-files SRR35184098 && \
fastq-dump --gzip --split-files SRR35184097

cd ../../

With

tree data

The folder should look like this:

data
├── assignment
│   ├── SRR35184121_1.fastq.gz
│   ├── SRR35184121_2.fastq.gz
│   ├── SRR35184122_1.fastq.gz
│   ├── SRR35184122_2.fastq.gz
│   ├── uber_seq.fa
│   └── uber_seq_135pb_dedup.fa
└── experiment
    ├── SRR35184097_1.fastq.gz
    ├── SRR35184098_1.fastq.gz
    ├── SRR35184099_1.fastq.gz
    ├── SRR35184100_1.fastq.gz
    ├── SRR35184101_1.fastq.gz
    └── SRR35184102_1.fastq.gz

Their assignment data

From the GEO dataset, we can also download the assignment files they used for counting (Element and Illumina separately). We use these files for the experiment workflow. In addition, we run the assignment workflow on the raw FASTQ files to show that we obtain comparable results.

mkdir -p data
cd data

wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9209nnn/GSM9209852/suppl/GSM9209852%5FdpSTR%5Fassociation%5FIllumina%5FBC%5FSTRs%2Etsv%2Egz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9209nnn/GSM9209853/suppl/GSM9209853%5FdpSTR%5Fassociation%5FElement%5FBC%5FSTRs%2Etsv%2Egz

zcat GSM9209852_dpSTR_association_Illumina_BC_STRs.tsv.gz  | tail -n+4 | gzip -c > GSM9209852_dpSTR_association_Illumina_BC_STRs_raw.tsv.gz
zcat GSM9209853_dpSTR_association_Element_BC_STRs.tsv.gz | tail -n+4 | gzip -c > GSM9209853_dpSTR_association_Element_BC_STRs_raw.tsv.gz

cd ../

MPRAsnakeflow

We run assignment and count workflows together. It is also possible to run them separately using different config files. In that case, use assignment fromFile instead of fromConfig.

First, define the config file and the experiment CSV file to map DNA/RNA counts to the correct replicates.

Important details for the config file: the oligo design is 135 bp and barcodes are 20 bp long. Remove the 5-bp sequence (TCTAG) at the 3-prime end of the barcode read.

For counting, reads are longer than the 20-bp barcode. There is also a 3-prime adapter sequence that we remove using cutadapt.

Create config files

cat << 'EOF' >  config.yaml
---
version: "0.7"
assignments:
    dpSTR:
        bc_length: 20
        adapters:
            BC:
                three_prime:
                    - TCTAG
        alignment_tool:
            split_number: 30
            tool: bbmap
            configs:
                min_mapping_quality: 30
        design_check:
            sequence_start: 1
            sequence_length: 135
        FWD:
            - data/assignment/SRR35184121_2.fastq.gz
            - data/assignment/SRR35184122_2.fastq.gz
        BC:
            - data/assignment/SRR35184121_1.fastq.gz
            - data/assignment/SRR35184122_1.fastq.gz
        design_file: data/assignment/uber_seq_135pb_dedup.fa
        configs:
            default: {}
experiments:
    dpSTR:
        bc_length: 20
        adapters:
            FWD:
                three_prime:
                    - TCTAGAGGTTCGTCGACGCGATTATTATCATTACTTGTACAGCTCGTCCATGCCGAGAGTGATCCCGGCGGCGGTCACGAA
        data_folder: data/experiment
        experiment_file: experiments.csv
        assignments:
            dpSTRAssignment:
                type: config
                assignment_name: dpSTR
                assignment_config: default
            GSM9209852Illumina:
                type: file
                assignment_file: data/GSM9209852_dpSTR_association_Illumina_BC_STRs_raw.tsv.gz
            GSM9209853Element:
                type: file
                assignment_file: data/GSM9209853_dpSTR_association_Element_BC_STRs_raw.tsv.gz
        configs:
            default: {}
EOF

Create the experiments.csv file to map DNA/RNA counts to replicates. The experiment file is a simple CSV file with the following content:

cat << 'EOF' >  experiments.csv
Condition,Replicate,DNA_BC_F,RNA_BC_F
dpSTRHelaRNHWT,1,SRR35184099_1.fastq.gz,SRR35184102_1.fastq.gz
dpSTRHelaRNHWT,2,SRR35184098_1.fastq.gz,SRR35184101_1.fastq.gz
dpSTRHelaRNHWT,3,SRR35184097_1.fastq.gz,SRR35184100_1.fastq.gz
EOF

Run snakemake

Now we are ready to run MPRAsnakeflow. We run this example on one node with 60 GB memory and 30 cores.

We run the pipeline directly in the working folder. The MPRAsnakeflow workflow can be located in a different directory. Here we assume /home/user/MPRAsnakeflow.

First, do a dry run with snakemake using -n:

conda activate mprasnakeflow
snakemake -c 1 --sdm apptainer conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -n --quiet rules

You should see a list of rules that will be executed. Example summary:

Job stats:
job                                                                        count
-----------------------------------------------------------------------  -------
assignment_check_design                                                        1
assignment_preprocessing_adapter_remove                                        1
experiment_assigned_counts_filterAssignment                                    3
experiment_preprocessing_trim_reads                                            6
assignment_fastq_split                                                         2
experiment_counts_onlyFWD_raw_counts                                           6
assignment_attach_idx                                                         30
experiment_counts_filter_counts                                                6
experiment_statistic_counts_BC_in_RNA_DNA                                      6
experiment_statistic_counts_table                                             12
assignment_mapping_bbmap                                                      30
experiment_counts_final_counts                                                 6
experiment_statistic_counts_BC_in_RNA_DNA_merge                                2
experiment_statistic_counts_frequent_umis                                      6
experiment_statistic_counts_stats_merge                                        2
assignment_collect                                                             1
assignment_mapping_bbmap_getBCs                                               30
experiment_assigned_counts_assignBarcodes                                     18
experiment_counts_dna_rna_merge_counts                                        12
experiment_statistic_bc_overlap_run                                            8
experiment_statistic_counts_barcode_base_composition                           6
experiment_statistic_counts_final                                              2
assignment_collectBCs                                                          1
assignment_idx_bam                                                             1
experiment_assigned_counts_dna_rna_merge                                       9
experiment_statistic_assigned_counts_combine_BC_assignment_stats_helper        3
experiment_statistic_bc_overlap_combine_counts                                 1
experiment_statistic_correlation_bc_counts                                     4
experiment_statistic_correlation_bc_counts_hist                                4
assignment_filter                                                              1
assignment_flagstat                                                            1
assignment_statistic_totalCounts                                               1
experiment_assigned_counts_combine_replicates_barcode_output                   3
experiment_assigned_counts_make_master_tables                                  3
experiment_statistic_assigned_counts_combine_BC_assignment_stats               3
experiment_statistic_assigned_counts_combine_stats_dna_rna_merge               3
experiment_statistic_bc_overlap_combine_assigned_counts                        3
experiment_statistic_correlation_calculate                                     3
experiment_statistic_correlation_combine_bc_raw                                1
experiment_statistic_correlation_hist_box_plots                                3
assignment_statistic_assignedCounts                                            1
assignment_statistic_assignment                                                1
assignment_statistic_quality_metric                                            1
experiment_assigned_counts_combine_replicates                                  6
experiment_assigned_counts_copy_final_all_files                                3
experiment_assigned_counts_copy_final_thresh_files                             3
experiment_statistic_assigned_counts_combine_stats_dna_rna_merge_all           3
experiment_statistic_correlation_combine_bc_assigned                           3
experiment_statistic_correlation_combine_oligo                                 3
experiment_statistic_quality_metric                                            3
qc_report_assoc                                                                1
qc_report_count                                                                3
all                                                                            1
total                                                                        276

If the dry run finishes without errors, run the workflow. We use a machine with 30 threads/cores and 60 GB memory. Therefore, split_number is set to 30 for parallelization. We also use 10 threads for mapping (bbmap), and snakemake ensures that no more than 30 threads are used in total.

snakemake -c 30 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000

Note

Please adapt your setup when running in a cluster environment. An example SLURM profile is available in MPRAsnakeflow under profiles/default/config.yaml. You can use it with snakemake via --workflow-profile $PIPELINE/profiles/default, but adjust it first, especially the slurm_partition setting.

Results

For assignment, all output files are written to results/assignment/dpSTR. The final assignment file is results/assignment/dpSTR/assignment_barcodes.default.tsv.gz. You should also inspect the QC report: results/assignment/dpSTR/qc_report.default.html.

An example assignment QC report is available here: Example assignment QC report.

For experiments, all output files are written to results/experiments/dpSTR.

Final count files:

  • results/experiments/dpSTR/reporter_experiment.barcode.dpSTRHelaRNHWT.dpSTRAssignment.default.all.tsv.gz (assignment generated in this workflow)

  • results/experiments/dpSTR/reporter_experiment.barcode.dpSTRHelaRNHWT.GSM9209852Illumina.default.all.tsv.gz (published Illumina assignment)

  • results/experiments/dpSTR/reporter_experiment.barcode.dpSTRHelaRNHWT.GSM9209853Element.default.all.tsv.gz (published Element assignment)

You should also inspect the QC reports, for example results/experiments/dpSTR/qc_report.dpSTRHelaRNHWT.GSM9209852Illumina.default.html.

Example QC reports are available here: Experiment QC report, Illumina QC report, and Element QC report.