.. _GSE325670 example: .. role:: bash(code) :language: bash ========================================= GSE325670 (Hauser et al. Variant MPRA) ========================================= This example documents the `GSE325256 `_ variation and saturation mutagenesis MPRA from `Hauser et al. `_, which is one of the three MPRA datasets published under GEO accession `GSE325670 `_. The other two datasets are `GSE324616 `_ and `GSE325585 `_. Prerequisites ============= This example depends on the following data and software: Installation of MPRAsnakeflow ----------------------------- Please install conda, set up the MPRAsnakeflow environment, and clone the current MPRAsnakeflow master branch. See :ref:`Installation` for details. Assignment Workflow =================== Download and Prepare Data ------------------------- .. code-block:: bash mkdir -p data/assignment cd data/assignment # download design file curl https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9640nnn/GSM9640449/suppl/GSM9640449%5FVariantMPRAdesignFile%2Efa%2Egz | zcat > GSM9640449_VariantMPRAdesignFile.fa # Keep only the first entry for duplicate ID + sequence records in the multi-FASTA. awk 'BEGIN { RS=">"; ORS="" } NR==1 { next } { n = split($0, lines, "\n") id = lines[1] seq = "" body = "" for (i = 2; i <= n; i++) { if (lines[i] == "") continue seq = seq lines[i] body = (body == "" ? lines[i] : body "\n" lines[i]) } key = id "\t" seq if (!(key in seen)) { seen[key] = 1 print ">" id "\n" body "\n" } }' GSM9640449_VariantMPRAdesignFile.fa > GSM9640449_VariantMPRAdesignFile.dedup.fa # Merge identical sequences and trim 15 nt from each end. awk 'BEGIN { RS=">"; ORS="" } NR==1 { next } { n = split($0, lines, "\n") id = lines[1] seq = "" for (i = 2; i <= n; i++) { if (lines[i] == "") continue seq = seq lines[i] } if (seq == "") next if (!(seq in seen)) { seen[seq] = 1 order[++m] = seq headers[seq] = id } else { headers[seq] = headers[seq] "-" id } } END { for (i = 1; i <= m; i++) { s = order[i] if (length(s) <= 30) continue s_trim = substr(s, 16, length(s) - 30) print ">" headers[s] "\n" s_trim "\n" } }' GSM9640449_VariantMPRAdesignFile.dedup.fa > GSM9640449_VariantMPRAdesignFile.seq_merged.fa # download sequencing data wget -O SRR37895574_1.fastq.gz https://sra-pub-src-2.s3.amazonaws.com/SRR37895574/RMHex03_bcassoc_R1_trimmed.fastq.gz.1 wget -O SRR37895574_2.fastq.gz https://sra-pub-src-2.s3.amazonaws.com/SRR37895574/RMHex03_bcassoc_R2_trimmed.fastq.gz.1 wget -O SRR37895574_3.fastq.gz https://sra-pub-src-2.s3.amazonaws.com/SRR37895574/RMHex03_bcassoc_R3_trimmed.fastq.gz.1 cd ../../ The folder should look like this: .. code-block:: text assignment ├── GSM9640449_VariantMPRAdesignFile.dedup.fa ├── GSM9640449_VariantMPRAdesignFile.fa ├── GSM9640449_VariantMPRAdesignFile.seq_merged.fa ├── SRR37895574_1.fastq.gz ├── SRR37895574_2.fastq.gz └── SRR37895574_3.fastq.gz Configure Assignment Workflow ----------------------------- Create a config file (e.g. ``config_assignment.yaml``) with the following content: .. code-block:: yaml --- version: "0.7.0" assignments: GSE325256: bc_length: 15 alignment_tool: split_number: 30 tool: bbmap FWD: - data/assignment/SRR37895574_1.fastq.gz BC: - data/assignment/SRR37895574_2.fastq.gz REV: - data/assignment/SRR37895574_3.fastq.gz design_file: data/assignment/GSM9640449_VariantMPRAdesignFile.seq_merged.fa design_check: sequence_collisions: false configs: default: {} lowFraction: fraction: 0.51 GSE325256BWA: bc_length: 15 alignment_tool: split_number: 30 tool: bwa-additional-filtering configs: min_mapping_quality: 1 sequence_length: 220 alignment_start: 1 FWD: - data/assignment/SRR37895574_1.fastq.gz BC: - data/assignment/SRR37895574_2.fastq.gz REV: - data/assignment/SRR37895574_3.fastq.gz design_file: data/assignment/GSM9640449_VariantMPRAdesignFile.seq_merged.fa design_check: sequence_collisions: false configs: default: {} lowFraction: fraction: 0.51 GSE325256CIGAR: bc_length: 15 alignment_tool: split_number: 30 tool: bbmap configs: min_mapping_quality: 0 cigar_filter_regex: 208M|210M|212M|214M|216M|218M|219M FWD: - data/assignment/SRR37895574_1.fastq.gz BC: - data/assignment/SRR37895574_2.fastq.gz REV: - data/assignment/SRR37895574_3.fastq.gz design_file: data/assignment/GSM9640449_VariantMPRAdesignFile.seq_merged.fa design_check: sequence_collisions: false configs: default: {} lowFraction: fraction: 0.51 GSE325256StrandSensitive: bc_length: 15 strand_sensitive: enable: true alignment_tool: split_number: 30 tool: bbmap FWD: - data/assignment/SRR37895574_1.fastq.gz BC: - data/assignment/SRR37895574_2.fastq.gz REV: - data/assignment/SRR37895574_3.fastq.gz design_file: data/assignment/GSM9640449_VariantMPRAdesignFile.seq_merged.fa design_check: sequence_collisions: false configs: default: {} lowFraction: fraction: 0.51 Run Assignment Workflow ----------------------- .. code-block:: bash conda activate mprasnakeflow snakemake -c 1 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config_assignment.yaml -n -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000 snakemake -c 30 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config_assignment.yaml -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000 Experiment Workflow =================== Download and Prepare Data ------------------------- .. list-table:: Reads count data :header-rows: 1 * - Replicate - DNA SRA ID - RNA SRA ID * - 1 - SRR37425536 - SRR37425544 * - 2 - SRR37425537 - SRR37425545 * - 3 - SRR37425531 - SRR37425546 * - 4 - SRR37425532 - SRR37425533 .. code-block:: bash mkdir -p data/experiment cd data/experiment # download data prefetch --max-size 30GB SRR37425536 SRR37425544 SRR37425537 SRR37425545 SRR37425531 SRR37425546 SRR37425532 SRR37425533; fastq-dump --gzip --split-files SRR37425536 && \ fastq-dump --gzip --split-files SRR37425544 && \ fastq-dump --gzip --split-files SRR37425537 && \ fastq-dump --gzip --split-files SRR37425545 && \ fastq-dump --gzip --split-files SRR37425531 && \ fastq-dump --gzip --split-files SRR37425546 && \ fastq-dump --gzip --split-files SRR37425532 && \ fastq-dump --gzip --split-files SRR37425533 cd ../ wget -O GSM9640449_VariantMPRAbarcodeDictionary.tsv.gz https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9640nnn/GSM9640449/suppl/GSM9640449%5FVariantMPRAbarcodeDictionary%2Etsv%2Egz Create experiment file (``experiments.csv``): .. code-block:: text Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R neurons,1,SRR37425536_1.fastq.gz,SRR37425536_2.fastq.gz,SRR37425536_3.fastq.gz,SRR37425544_1.fastq.gz,SRR37425544_2.fastq.gz,SRR37425544_3.fastq.gz neurons,2,SRR37425537_1.fastq.gz,SRR37425537_2.fastq.gz,SRR37425537_3.fastq.gz,SRR37425545_1.fastq.gz,SRR37425545_2.fastq.gz,SRR37425545_3.fastq.gz neurons,3,SRR37425531_1.fastq.gz,SRR37425531_2.fastq.gz,SRR37425531_3.fastq.gz,SRR37425546_1.fastq.gz,SRR37425546_2.fastq.gz,SRR37425546_3.fastq.gz neurons,4,SRR37425532_1.fastq.gz,SRR37425532_2.fastq.gz,SRR37425532_3.fastq.gz,SRR37425533_1.fastq.gz,SRR37425533_2.fastq.gz,SRR37425533_3.fastq.gz Configure Experiment Workflow ----------------------------- Create a config file (e.g. ``config_experiment.yaml``): .. code-block:: yaml --- version: "0.7.0" experiments: GSE325256: split_number: 30 # increases speed due to large input data bc_length: 15 umi_length: 16 data_folder: data/experiment experiment_file: experiments.csv assignments: GSE325256BWAAssignment: type: file assignment_file: results/assignment/GSE325256BWA/assignment_barcodes.default.tsv.gz GSE325256CIGARAssignment: type: file assignment_file: results/assignment/GSE325256CIGAR/assignment_barcodes.default.tsv.gz GSE325256Assignment: type: file assignment_file: results/assignment/GSE325256/assignment_barcodes.default.tsv.gz GEOAssignment: type: file assignment_file: data/GSM9640449_VariantMPRAbarcodeDictionary.tsv.gz configs: default: {} Run Experiment Workflow ----------------------- .. code-block:: bash conda activate mprasnakeflow snakemake -c 1 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config_experiment.yaml -n -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000 snakemake -c 30 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config_experiment.yaml -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000 Results ======= Assignment results are written to :code:`results/assignment/GSE325256`, :code:`results/assignment/GSE325256BWA`, :code:`results/assignment/GSE325256CIGAR`, and :code:`results/assignment/GSE325256StrandSensitive`. The final assignment files and QC reports are placed in those directories. The library is very deeply sequenced and we get a median of over 1000 barcodes per oligo. Due to the challenging design (e.g. contains one base deletion but not within but at the end of the designed oligo) we also do not get the whole library assigned. Only around 70%. We can increase the assignment allowing a lower fraction percentage of reads assigned to the top hit (e.g. 0.51 instead of 0.75) but that also increases the risk of false positive assignments. Also the customized MPRAsnakeflow mapping with ``bwa`` disabling ``MAPQ`` with ``min_mapping_quality: 0`` but allowing only specific CIGAR strings ``cigar_filter_regex: 208M|210M|212M|214M|216M|218M|219M`` works slightly better. To reduce false positives, we also added the strand-sensitive mapping option (``GSE325256StrandSensitive`` using the default ``bbmap`` approach). We assign on average half of the barcodes (~500 per oligo). But it can be that we reduce the false positive rate. The Experiment QC report later can help to make an informed decision on the best assignment parameters for this dataset. You can have a look at the assignment QC reports here: .. list-table:: Assignment QC reports for GSE325256 :header-rows: 1 * - Assignment mode - Fraction - QC report * - bbmap - 0.75 - `bbmap 0.75 QC report `_ * - bbmap - 0.51 - `bbmap 0.51 QC report `_ * - bwa-additional-filtering - 0.75 - `bwa-additional-filtering 0.75 QC report `_ * - bwa-additional-filtering - 0.51 - `bwa-additional-filtering 0.51 QC report `_ * - bwa with CIGAR filter - 0.75 - `bwa CIGAR 0.75 QC report `_ * - bwa with CIGAR filter - 0.51 - `bwa CIGAR 0.51 QC report `_ * - bbmap strand-sensitive mode - 0.75 - `bbmap strand-sensitive 0.75 QC report `_ * - bbmap strand-sensitive mode - 0.51 - `bbmap strand-sensitive 0.51 QC report `_ Experiment results are written to :code:`results/experiments/GSE325256`. The count outputs use the published GEO assignment as well as the assignments generated in this workflow. We only used the mapping with the default fraction (0.75) for the experiment results. The results show that in comparison to their assignment file we reached a similar Pearson correlation (around 0.67) for activity. Interestingly the strand-sensitive mode increased the correlation to 0.72 suggesting that a high number of barcodes are false positives and the data might benefit from a strict mapping strategy. .. list-table:: Experiment QC reports for GSE325256 :header-rows: 1 * - Assignment mode - QC report * - GEO assignment - `GEO assignment QC report `_ * - bbmap - `bbmap experiment QC report `_ * - bwa-additional-filtering - `bwa-additional-filtering experiment QC report `_ * - bwa with CIGAR filter - `bwa CIGAR experiment QC report `_ * - bwa strand-sensitive mode - `strand-sensitive experiment QC report `_