.. _GSE316891_example: .. role:: bash(code) :language: bash ================================ GSE316891 (Yan et al. L1a1 MPRA) ================================ This example demonstrates running the assignment and experiment workflows on the L1a1 MPRA dataset from Yan et al. Data is available on `GEO:GSE316891 `_. No preprint or publication is available at the moment, but the data is already public and can be used for testing MPRAsnakeflow. Prerequisites ============= This example depends on the following data and software: Installation of MPRAsnakeflow ----------------------------- Please install conda, set up the MPRAsnakeflow environment, and clone the current MPRAsnakeflow master branch. See :ref:`Installation` for details. Assignment Workflow =================== Download and Prepare Data ------------------------- .. code-block:: bash mkdir -p data/assignment cd data/assignment curl https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9462nnn/GSM9462019/suppl/GSM9462019%5FL1a1%2Efasta%2Egz | zcat > GSM9462019_L1a1.fa cat GSM9462019_L1a1.fa | sed 's/\[/|/g' | sed 's/\]/|/g' > GSM9462019_L1a1.fix.fa prefetch --max-size 20GB SRR36893758 fastq-dump --gzip --split-files SRR36893758 The folder should look like this: .. code-block:: text data/assignment/ ├── GSM9462019_L1a1.fa ├── GSM9462019_L1a1.fix.fa ├── SRR36893758_1.fastq.gz ├── SRR36893758_2.fastq.gz └── SRR36893758_3.fastq.gz Configure Assignment Workflow ----------------------------- Create a config file (e.g. ``config_assignment.yaml``) with the following content: .. code-block:: yaml version: "0.7.0" assignments: GSE316891L1a1: bc_length: 15 alignment_tool: split_number: 30 tool: bbmap FWD: - data/assignment/SRR36893758_1.fastq.gz REV: - data/assignment/SRR36893758_2.fastq.gz BC: - data/assignment/SRR36893758_3.fastq.gz design_check: sequence_start: 1 sequence_length: 270 design_file: data/assignment/GSM9462019_L1a1.fix.fa configs: default: {} Run Assignment Workflow ----------------------- .. code-block:: bash snakemake -c all --sdm apptainer conda \ --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile \ --configfile config_assignment.yaml Experiment Workflow =================== Download and Prepare Data ------------------------- .. code-block:: bash mkdir -p data/experiment cd data/ wget -O GSM9462019_L1a1_oligos_to_barcodes.txt.gz https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9462nnn/GSM9462019/suppl/GSM9462019%5FL1a1%5Foligos%5Fto%5Fbarcodes%2Etxt%2Egz zcat GSM9462019_L1a1_oligos_to_barcodes.txt.gz | tail -n +2 | awk -F',' -v "OFS=\t" '{print $2,$1}' | sed 's/\[/|/g' | sed 's/\]/|/g' | sort | gzip -c > GSM9462019_L1a1_oligos_to_barcodes.tsv.gz cd experiment prefetch --max-size 30GB SRR36893764 SRR36893763 SRR36893762 SRR36893761 SRR36893760 SRR36893759 fastq-dump --gzip --split-files SRR36893764 fastq-dump --gzip --split-files SRR36893763 fastq-dump --gzip --split-files SRR36893762 fastq-dump --gzip --split-files SRR36893761 fastq-dump --gzip --split-files SRR36893760 fastq-dump --gzip --split-files SRR36893759 The folder should look like this: .. code-block:: text data/experiment/ ├── SRR36893759_1.fastq.gz ├── SRR36893759_2.fastq.gz ├── SRR36893759_3.fastq.gz ├── SRR36893760_1.fastq.gz ├── SRR36893760_2.fastq.gz ├── SRR36893760_3.fastq.gz ├── SRR36893761_1.fastq.gz ├── SRR36893761_2.fastq.gz ├── SRR36893761_3.fastq.gz ├── SRR36893762_1.fastq.gz ├── SRR36893762_2.fastq.gz ├── SRR36893762_3.fastq.gz ├── SRR36893763_1.fastq.gz ├── SRR36893763_2.fastq.gz ├── SRR36893763_3.fastq.gz ├── SRR36893764_1.fastq.gz ├── SRR36893764_2.fastq.gz └── SRR36893764_3.fastq.gz Create experiment file (``experiments.csv``): .. code-block:: text Condition,Replicate,DNA_BC_F,DNA_BC_R,DNA_UMI,RNA_BC_F,RNA_BC_R,RNA_UMI L1a1,1,SRR36893764_1.fastq.gz,SRR36893764_2.fastq.gz,SRR36893764_3.fastq.gz,SRR36893763_1.fastq.gz,SRR36893763_2.fastq.gz,SRR36893763_3.fastq.gz L1a1,2,SRR36893762_1.fastq.gz,SRR36893762_2.fastq.gz,SRR36893762_3.fastq.gz,SRR36893761_1.fastq.gz,SRR36893761_2.fastq.gz,SRR36893761_3.fastq.gz L1a1,3,SRR36893760_1.fastq.gz,SRR36893760_2.fastq.gz,SRR36893760_3.fastq.gz,SRR36893759_1.fastq.gz,SRR36893759_2.fastq.gz,SRR36893759_3.fastq.gz Configure Experiment Workflow ----------------------------- Create a config file (e.g. ``config_experiment.yaml``): .. code-block:: yaml version: "0.7" experiments: GSE316891L1a1: bc_length: 15 umi_length: 16 data_folder: data/experiment experiment_file: experiments.csv assignments: L1a1Assignment: type: file assignment_file: results/assignment/GSE316891L1a1/assignment_barcodes.default.tsv.gz GEOAssignment: type: file assignment_file: data/GSM9462019_L1a1_oligos_to_barcodes.tsv.gz configs: default: {} Run Experiment Workflow ----------------------- .. code-block:: bash snakemake -c all --sdm apptainer conda \ --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile \ --configfile config_experiment.yaml Results ======= Assignment output files will be in :code:`results/assignment/GSE316891L1a1`. The final assignment is :code:`results/assignment/GSE316891L1a1/assignment_barcodes.default.tsv.gz` and the QC report is :code:`results/assignment/GSE316891L1a1/qc_report.default.html`. We also provide the `Assignment QC report `_. Inspecting their assignment file, they retrieve 77795 out of 79812 oligos with 113 mean and 93 median barcodes per oligo. Our default assignment retrieves 77208 oligos with 87 mean and 71 median barcodes per oligo. We are very strict with the majority vote (75% of the reads need to support the same oligo), which leads to a lower number of assigned oligos, but also a higher confidence in the assignment. You can adjust the config to be less strict and retrieve more oligos, but we recommend inspecting the QC report and assignment statistics before doing so. Experiment output files will be in :code:`results/experiments/GSE316891L1a1`. The main count files are :code:`results/experiments/GSE316891L1a1/reporter_experiment.barcode.L1a1.GEOAssignment.default.all.tsv.gz` and :code:`results/experiments/GSE316891L1a1/reporter_experiment.barcode.L1a1.L1a1Assignment.default.all.tsv.gz`. QC reports are in the same folder and can be inspected for our `L1a1Assignment here `_ and for their `GEOAssignment here `_. QC reports are very similar for both assignments with slightly more barcodes assigned in the GEOAssignment due to the richer assignment file. The final count files are very similar between the two assignments. Here we report the correlations of the normalized DNA, normalized RNA, and the log2 fold change between RNA and DNA of the two assignments, separated by replicate: .. list-table:: Correlations Between L1a1Assignment and GEOAssignment :header-rows: 1 * - replicate - number of shared oligos - pearson log2FC - spearman log2FC - pearson RNA - spearman RNA - pearson DNA - spearman DNA * - 1 - 75388 - 0.88 - 0.89 - 0.89 - 0.92 - 0.82 - 0.83 * - 2 - 75462 - 0.88 - 0.9 - 0.91 - 0.93 - 0.81 - 0.83 * - 3 - 75151 - 0.87 - 0.88 - 0.9 - 0.91 - 0.82 - 0.84