.. _GSE293036 example: .. role:: bash(code) :language: bash ============================================== GSE293036 (Granitto et al. GWAS variant MPRA) ============================================== This example runs the assignment and experiment workflows on data from `Granitto et al. Genome-wide discovery of multiple sclerosis genetic risk variant allelic regulatory activity. G3 Genes|Genomes|Genetics (2025) `_. The data were published in `GEO:GSE293036 `_. The authors used a massively parallel reporter assay (MPRA) to identify multiple sclerosis (MS) genetic risk variants with allele-specific regulatory activity, testing oligos in three conditions: GM12878, MS-1, and MS-2. .. note:: The authors shared their experimental design and data through GEO. For the barcode library, they aligned reads to a custom reference including both forward and reverse complement versions of the oligo sequences. For simplicity, we just included the forward versions in the design file, achieving similar oligo coverage (number of barcodes per oligo). Prerequisites ============= This example depends on the following data and software: Installation of MPRAsnakeflow ------------------------------ Please install conda, set up the MPRAsnakeflow environment, and clone the current MPRAsnakeflow master branch. You can find more help under :ref:`Installation`. In addition, install `sra-toolkit` and `openpyxl` to download and process the data (for example via conda): .. code-block:: bash conda install -c bioconda sra-tools conda install -c conda-forge openpyxl Design file ----------- The oligo library sequences are provided in Supplementary Table 3 of the paper (``Supplementary_Table_3_G3-2025-406100.xlsx``, columns ``seq_id`` and ``twist_seq``). Download the supplementary data zip, extract the table, and convert it to a gzipped FASTA, filtering out the reverse-complement (``_RC``) entries: .. code-block:: bash mkdir -p data cd data wget -O jkaf192_supplementary_data.zip \ https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/g3journal/15/11/10.1093_g3journal_jkaf192/1/jkaf192_supplementary_data.zip unzip -j jkaf192_supplementary_data.zip Supplementary_Table_3_G3-2025-406100.xlsx python3 - << 'PYEOF' import openpyxl, gzip wb = openpyxl.load_workbook("Supplementary_Table_3_G3-2025-406100.xlsx", read_only=True) ws = wb.active with gzip.open("design.fa.gz", "wt") as fh: for row in ws.iter_rows(min_row=2, values_only=True): seq_id, twist_seq = row[0], row[1] if seq_id and not str(seq_id).endswith("_RC"): fh.write(f">{seq_id}\n{twist_seq}\n") PYEOF cd .. Reads assignment data ---------------------- The barcode library sequencing (`GEO:GSM8874361 `_) is available under SRA accession ``SRR32870551``. This paired-end run has the barcode in read 1 and the oligo sequence in read 2. .. code-block:: bash mkdir -p data cd data prefetch SRR32870551 fastq-dump --gzip --split-files SRR32870551 && \ mv SRR32870551_1.fastq.gz SRR32870551_Barcode_1.fastq.gz && \ mv SRR32870551_2.fastq.gz SRR32870551_Barcode_2.fastq.gz cd .. .. note:: Please be sure that all files are downloaded completely without errors! With .. code-block:: bash tree data The folder should look like this: .. code-block:: text data ├── SRR32870551_Barcode_1.fastq.gz ├── SRR32870551_Barcode_2.fastq.gz └── design.fa.gz Reads count data ---------------- In this example we run just one sequencing lane for two replicates of the GM12878 condition, using the corresponding control replicates as input. .. list-table:: :header-rows: 1 * - Condition - Replicate - GSM - SRR * - Control - 2 - GSM8874363 - SRR32870549 * - Control - 3 - GSM8874364 - SRR32870548 * - GM12878 - 2 - GSM8874366 - SRR32870540 * - GM12878 - 3 - GSM8874367 - SRR32870536 .. note:: Each GSM has multiple sequencing runs in SRA. This example uses only the first run (r1) of each replicate. For a complete analysis, download and include all runs per replicate. .. code-block:: bash mkdir -p data cd data prefetch SRR32870549 SRR32870548 SRR32870540 SRR32870536 fastq-dump --gzip --split-files SRR32870549 && mv SRR32870549_1.fastq.gz SRR32870549_Control.fastq.gz && \ fastq-dump --gzip --split-files SRR32870548 && mv SRR32870548_1.fastq.gz SRR32870548_Control.fastq.gz && \ fastq-dump --gzip --split-files SRR32870540 && mv SRR32870540_1.fastq.gz SRR32870540_GM12878.fastq.gz && \ fastq-dump --gzip --split-files SRR32870536 && mv SRR32870536_1.fastq.gz SRR32870536_GM12878.fastq.gz cd .. With .. code-block:: bash tree data The folder should look like this: .. code-block:: text data ├── SRR32870536_GM12878.fastq.gz ├── SRR32870540_GM12878.fastq.gz ├── SRR32870548_Control.fastq.gz ├── SRR32870549_Control.fastq.gz ├── SRR32870551_Barcode_1.fastq.gz ├── SRR32870551_Barcode_2.fastq.gz └── design.fa.gz MPRAsnakeflow ============= We run assignment and count workflows together. It is also possible to run them separately using different config files. In that case, use assignment ``fromFile`` instead of ``fromConfig``. First, define the config file and the experiment CSV file to map DNA/RNA counts to the correct replicates. Important details for the config file: oligos are 200 bp long and barcodes are 20 bp. The linker sequence separating the barcode (read 1) from the oligo is :code:`TCTAGAGGTTCGTCGACGCGATCGCAGGAGCCGCAGTG`. Create config files ------------------- .. code-block:: bash cat << 'EOF' > config.yaml --- version: "0.7" assignments: GSE293036Assignment: bc_length: 20 BC_rev_comp: false linker: TCTAGAGGTTCGTCGACGCGATCGCAGGAGCCGCAGTG alignment_tool: split_number: 16 tool: bbmap configs: min_mapping_quality: 30 design_check: sequence_start: 1 sequence_length: 200 FWD: - data/SRR32870551_Barcode_1.fastq.gz REV: - data/SRR32870551_Barcode_2.fastq.gz design_file: data/design.fa.gz configs: default: {} experiments: GSE293036Experiment: bc_length: 20 split_number: 16 data_folder: data experiment_file: experiments.csv assignments: GSE293036Assignment: type: config assignment_name: GSE293036Assignment assignment_config: default configs: default: filter: bc_threshold: 30 EOF Create the :code:`experiments.csv` file to map DNA/RNA counts to replicates. The control samples serve as gDNA input (DNA_BC_F) for all conditions. .. code-block:: bash cat << 'EOF' > experiments.csv Condition,Replicate,DNA_BC_F,RNA_BC_F GM12878,3,SRR32870548_Control.fastq.gz,SRR32870536_GM12878.fastq.gz GM12878,2,SRR32870549_Control.fastq.gz,SRR32870540_GM12878.fastq.gz EOF Run snakemake ------------- Now we are ready to run MPRAsnakeflow. We run this example on one node with 60 GB memory and 16 cores. We run the pipeline directly in the working folder. The MPRAsnakeflow workflow can be located in a different directory. Here we assume :code:`/work/${USER}/MPRAsnakeflow`. First, do a dry run with snakemake using :code:`-n`: .. code-block:: bash conda activate mprasnakeflow snakemake -c 1 --sdm apptainer conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -n --quiet rules If the dry run finishes without errors, run the workflow. We use a machine with 16 threads/cores and 60 GB memory. Therefore, :code:`split_number` is set to 16 for parallelization. We also use 8 threads for mapping (bbmap), and snakemake ensures that no more than 16 threads are used in total. .. code-block:: bash snakemake -c 16 --sdm apptainer conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -q --set-threads assignment_mapping_bbmap=8 --resources mem_mb=60000 .. note:: Please adapt your setup when running in a cluster environment. An example SLURM profile is available in MPRAsnakeflow under :code:`profiles/default/config.yaml`. You can use it with snakemake via :code:`--workflow-profile $PIPELINE/profiles/default`, but adjust it first, especially the :code:`slurm_partition` setting. Results ------- For assignment, all output files are written to :code:`results/assignment/GSE293036Assignment`. The final assignment file is :code:`results/assignment/GSE293036Assignment/assignment_barcodes.default.tsv.gz`. You should also inspect the QC report: :code:`results/assignment/GSE293036Assignment/qc_report.default_GSE293036.html`. An example assignment QC report is available here: `Example assignment QC report `_. For experiments, all output files are written to :code:`results/experiments/GSE293036Experiment`. Final count file: * :code:`results/experiments/GSE293036Experiment/reporter_experiment.barcode.GM12878.GSE293036Assignment.default.all.tsv.gz` You should also inspect the QC report: :code:`results/experiments/GSE293036Experiment/qc_report.GM12878.GSE293036Assignment.default.html`. An example QC report is available here: `GM12878 QC report `_.