.. _GSE306816 example: .. role:: bash(code) :language: bash =================================== GSE306816 (Zhang et al. dpSTR MPRA) =================================== This example runs the assignment and experiment workflows on data from the preprint by `Zhang et al. Systematic Evaluation of the Impact of Promoter Proximal Short Tandem Repeats on Expression. bioRxiv (2025). `_. The data were published in `GEO:GSE306816 `_. The authors used a custom processing pipeline, available on `GitHub `_, but here we run the analysis in MPRAsnakeflow. We only use their deep perturbation STR data (dpSTR). They also generated two additional MPRA datasets, h1STR and h2STR. .. note:: This work measures perturbations of short tandem repeats, which are difficult to sequence because of homopolymer stretches and repeats. Zhang et al. used a specific stutter-correction method during assignment. That method is highly specific to their data and cannot be generalized. Even without stutter correction, we obtain assignment results comparable to those reported in their manuscript. The counting/experiment workflow is very similar, and we can also use their published assignment files. Prerequisites ============= This example depends on the following data and software: Installation of MPRAsnakeflow ------------------------------ Please install conda, set up the MPRAsnakeflow environment, and clone the current MPRAsnakeflow master branch. You can find more help under :ref:`Installation`. In addition, install `sra-toolkit` to download the data from GEO (for example via conda): .. code-block:: bash conda install -c bioconda sra-tools Design file ----------- We need the design file and must modify it by trimming non-sequenced sequence parts and deduplicating the oligos used for assignment. .. code-block:: bash mkdir -p data/assignment wget -O data/assignment/uber_seq.fa https://raw.githubusercontent.com/gymreklab/str-mpra/refs/heads/master/design-uber/uber_seq.fa # Use only the first 135 bp of each oligo and remove duplicates. # The full oligo design is 231 bp, but only 135 bp are sequenced. # Duplicates can exist because one oligo can map to multiple barcodes. awk -v N=135 ' function flush( s,key) { if (hdr == "") return gsub(/[ \t\r\n]/, "", seq) s = substr(seq, 1, N) if (s == "") return if (!(s in seen)) { seen[s] = 1 order[++n] = s headers[s] = hdr } else { headers[s] = headers[s] "|" hdr } seq = ""; hdr = "" } /^>/ { flush(); hdr = substr($0,2); next } { seq = seq $0 } END { flush(); for (i=1; i<=n; i++) { key = order[i] print ">" headers[key] print key } } ' data/assignment/uber_seq.fa > data/assignment/uber_seq_135pb_dedup.fa Reads assignment data ---------------------- There are two sequencing sets: one generated on an Illumina sequencer and one on an Element sequencer. Here we process both together in parallel (SRA IDs: `SRR35184122` and `SRR35184121`). .. code-block:: bash mkdir -p data/assignment cd data/assignment prefetch --max-size 130GB SRR35184122 SRR35184121 fastq-dump --gzip --split-files SRR35184122 fastq-dump --gzip --split-files SRR35184121 cd ../../ .. note:: Please be sure that all files are downloaded completely without errors! With .. code-block:: bash tree data The folder should look like this: .. code-block:: text data └── assignment ├── SRR35184121_1.fastq.gz ├── SRR35184121_2.fastq.gz ├── SRR35184122_1.fastq.gz ├── SRR35184122_2.fastq.gz ├── uber_seq.fa └── uber_seq_135pb_dedup.fa Reads count data ---------------- We have three replicates of DNA and RNA counts each. These data must be downloaded. .. list-table:: :header-rows: 1 * - Replicate - gDNA - cDNA * - 1 - SRR35184099 - SRR35184102 * - 2 - SRR35184098 - SRR35184101 * - 3 - SRR35184097 - SRR35184100 .. code-block:: bash mkdir -p data/experiment cd data/experiment # download data prefetch --max-size 30GB SRR35184102 SRR35184101 SRR35184100 SRR35184099 SRR35184098 SRR35184097; fastq-dump --gzip --split-files SRR35184102 && \ fastq-dump --gzip --split-files SRR35184101 && \ fastq-dump --gzip --split-files SRR35184100 && \ fastq-dump --gzip --split-files SRR35184099 && \ fastq-dump --gzip --split-files SRR35184098 && \ fastq-dump --gzip --split-files SRR35184097 cd ../../ With .. code-block:: bash tree data The folder should look like this: .. code-block:: text data ├── assignment │ ├── SRR35184121_1.fastq.gz │ ├── SRR35184121_2.fastq.gz │ ├── SRR35184122_1.fastq.gz │ ├── SRR35184122_2.fastq.gz │ ├── uber_seq.fa │ └── uber_seq_135pb_dedup.fa └── experiment ├── SRR35184097_1.fastq.gz ├── SRR35184098_1.fastq.gz ├── SRR35184099_1.fastq.gz ├── SRR35184100_1.fastq.gz ├── SRR35184101_1.fastq.gz └── SRR35184102_1.fastq.gz Their assignment data --------------------- From the GEO dataset, we can also download the assignment files they used for counting (Element and Illumina separately). We use these files for the experiment workflow. In addition, we run the assignment workflow on the raw FASTQ files to show that we obtain comparable results. .. code-block:: bash mkdir -p data cd data wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9209nnn/GSM9209852/suppl/GSM9209852%5FdpSTR%5Fassociation%5FIllumina%5FBC%5FSTRs%2Etsv%2Egz wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9209nnn/GSM9209853/suppl/GSM9209853%5FdpSTR%5Fassociation%5FElement%5FBC%5FSTRs%2Etsv%2Egz zcat GSM9209852_dpSTR_association_Illumina_BC_STRs.tsv.gz | tail -n+4 | gzip -c > GSM9209852_dpSTR_association_Illumina_BC_STRs_raw.tsv.gz zcat GSM9209853_dpSTR_association_Element_BC_STRs.tsv.gz | tail -n+4 | gzip -c > GSM9209853_dpSTR_association_Element_BC_STRs_raw.tsv.gz cd ../ MPRAsnakeflow ============= We run assignment and count workflows together. It is also possible to run them separately using different config files. In that case, use assignment ``fromFile`` instead of ``fromConfig``. First, define the config file and the experiment CSV file to map DNA/RNA counts to the correct replicates. Important details for the config file: the oligo design is 135 bp and barcodes are 20 bp long. Remove the 5-bp sequence (``TCTAG``) at the 3-prime end of the barcode read. For counting, reads are longer than the 20-bp barcode. There is also a 3-prime adapter sequence that we remove using cutadapt. Create config files ------------------- .. code-block:: bash cat << 'EOF' > config.yaml --- version: "0.7" assignments: dpSTR: bc_length: 20 adapters: BC: three_prime: - TCTAG alignment_tool: split_number: 30 tool: bbmap configs: min_mapping_quality: 30 design_check: sequence_start: 1 sequence_length: 135 FWD: - data/assignment/SRR35184121_2.fastq.gz - data/assignment/SRR35184122_2.fastq.gz BC: - data/assignment/SRR35184121_1.fastq.gz - data/assignment/SRR35184122_1.fastq.gz design_file: data/assignment/uber_seq_135pb_dedup.fa configs: default: {} experiments: dpSTR: bc_length: 20 adapters: FWD: three_prime: - TCTAGAGGTTCGTCGACGCGATTATTATCATTACTTGTACAGCTCGTCCATGCCGAGAGTGATCCCGGCGGCGGTCACGAA data_folder: data/experiment experiment_file: experiments.csv assignments: dpSTRAssignment: type: config assignment_name: dpSTR assignment_config: default GSM9209852Illumina: type: file assignment_file: data/GSM9209852_dpSTR_association_Illumina_BC_STRs_raw.tsv.gz GSM9209853Element: type: file assignment_file: data/GSM9209853_dpSTR_association_Element_BC_STRs_raw.tsv.gz configs: default: {} EOF Create the :code:`experiments.csv` file to map DNA/RNA counts to replicates. The experiment file is a simple CSV file with the following content: .. code-block:: bash cat << 'EOF' > experiments.csv Condition,Replicate,DNA_BC_F,RNA_BC_F dpSTRHelaRNHWT,1,SRR35184099_1.fastq.gz,SRR35184102_1.fastq.gz dpSTRHelaRNHWT,2,SRR35184098_1.fastq.gz,SRR35184101_1.fastq.gz dpSTRHelaRNHWT,3,SRR35184097_1.fastq.gz,SRR35184100_1.fastq.gz EOF Run snakemake ------------- Now we are ready to run MPRAsnakeflow. We run this example on one node with 60 GB memory and 30 cores. We run the pipeline directly in the working folder. The MPRAsnakeflow workflow can be located in a different directory. Here we assume :code:`/home/user/MPRAsnakeflow`. First, do a dry run with snakemake using :code:`-n`: .. code-block:: bash conda activate mprasnakeflow snakemake -c 1 --sdm apptainer conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -n --quiet rules You should see a list of rules that will be executed. Example summary: .. code-block:: text Job stats: job count ----------------------------------------------------------------------- ------- assignment_check_design 1 assignment_preprocessing_adapter_remove 1 experiment_assigned_counts_filterAssignment 3 experiment_preprocessing_trim_reads 6 assignment_fastq_split 2 experiment_counts_onlyFWD_raw_counts 6 assignment_attach_idx 30 experiment_counts_filter_counts 6 experiment_statistic_counts_BC_in_RNA_DNA 6 experiment_statistic_counts_table 12 assignment_mapping_bbmap 30 experiment_counts_final_counts 6 experiment_statistic_counts_BC_in_RNA_DNA_merge 2 experiment_statistic_counts_frequent_umis 6 experiment_statistic_counts_stats_merge 2 assignment_collect 1 assignment_mapping_bbmap_getBCs 30 experiment_assigned_counts_assignBarcodes 18 experiment_counts_dna_rna_merge_counts 12 experiment_statistic_bc_overlap_run 8 experiment_statistic_counts_barcode_base_composition 6 experiment_statistic_counts_final 2 assignment_collectBCs 1 assignment_idx_bam 1 experiment_assigned_counts_dna_rna_merge 9 experiment_statistic_assigned_counts_combine_BC_assignment_stats_helper 3 experiment_statistic_bc_overlap_combine_counts 1 experiment_statistic_correlation_bc_counts 4 experiment_statistic_correlation_bc_counts_hist 4 assignment_filter 1 assignment_flagstat 1 assignment_statistic_totalCounts 1 experiment_assigned_counts_combine_replicates_barcode_output 3 experiment_assigned_counts_make_master_tables 3 experiment_statistic_assigned_counts_combine_BC_assignment_stats 3 experiment_statistic_assigned_counts_combine_stats_dna_rna_merge 3 experiment_statistic_bc_overlap_combine_assigned_counts 3 experiment_statistic_correlation_calculate 3 experiment_statistic_correlation_combine_bc_raw 1 experiment_statistic_correlation_hist_box_plots 3 assignment_statistic_assignedCounts 1 assignment_statistic_assignment 1 assignment_statistic_quality_metric 1 experiment_assigned_counts_combine_replicates 6 experiment_assigned_counts_copy_final_all_files 3 experiment_assigned_counts_copy_final_thresh_files 3 experiment_statistic_assigned_counts_combine_stats_dna_rna_merge_all 3 experiment_statistic_correlation_combine_bc_assigned 3 experiment_statistic_correlation_combine_oligo 3 experiment_statistic_quality_metric 3 qc_report_assoc 1 qc_report_count 3 all 1 total 276 If the dry run finishes without errors, run the workflow. We use a machine with 30 threads/cores and 60 GB memory. Therefore, :code:`split_number` is set to 30 for parallelization. We also use 10 threads for mapping (bbmap), and snakemake ensures that no more than 30 threads are used in total. .. code-block:: bash snakemake -c 30 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000 .. note:: Please adapt your setup when running in a cluster environment. An example SLURM profile is available in MPRAsnakeflow under :code:`profiles/default/config.yaml`. You can use it with snakemake via :code:`--workflow-profile $PIPELINE/profiles/default`, but adjust it first, especially the :code:`slurm_partition` setting. Results ------- For assignment, all output files are written to :code:`results/assignment/dpSTR`. The final assignment file is :code:`results/assignment/dpSTR/assignment_barcodes.default.tsv.gz`. You should also inspect the QC report: :code:`results/assignment/dpSTR/qc_report.default.html`. An example assignment QC report is available here: `Example assignment QC report `_. For experiments, all output files are written to :code:`results/experiments/dpSTR`. Final count files: * :code:`results/experiments/dpSTR/reporter_experiment.barcode.dpSTRHelaRNHWT.dpSTRAssignment.default.all.tsv.gz` (assignment generated in this workflow) * :code:`results/experiments/dpSTR/reporter_experiment.barcode.dpSTRHelaRNHWT.GSM9209852Illumina.default.all.tsv.gz` (published Illumina assignment) * :code:`results/experiments/dpSTR/reporter_experiment.barcode.dpSTRHelaRNHWT.GSM9209853Element.default.all.tsv.gz` (published Element assignment) You should also inspect the QC reports, for example :code:`results/experiments/dpSTR/qc_report.dpSTRHelaRNHWT.GSM9209852Illumina.default.html`. Example QC reports are available here: `Experiment QC report `_, `Illumina QC report `_, and `Element QC report `_.