GSE325670 (Hauser et al. Variant MPRA)

This example documents the GSE325256 variation and saturation mutagenesis MPRA from Hauser et al., which is one of the three MPRA datasets published under GEO accession GSE325670. The other two datasets are GSE324616 and GSE325585.

Prerequisites

This example depends on the following data and software:

Installation of MPRAsnakeflow

Please install conda, set up the MPRAsnakeflow environment, and clone the current MPRAsnakeflow master branch. See Installation for details.

Assignment Workflow

Download and Prepare Data

mkdir -p data/assignment
cd data/assignment

# download design file
curl https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9640nnn/GSM9640449/suppl/GSM9640449%5FVariantMPRAdesignFile%2Efa%2Egz | zcat > GSM9640449_VariantMPRAdesignFile.fa

# Keep only the first entry for duplicate ID + sequence records in the multi-FASTA.
awk 'BEGIN { RS=">"; ORS="" }
NR==1 { next }
{
    n = split($0, lines, "\n")
    id = lines[1]
    seq = ""
    body = ""
    for (i = 2; i <= n; i++) {
    if (lines[i] == "") continue
    seq = seq lines[i]
    body = (body == "" ? lines[i] : body "\n" lines[i])
    }
    key = id "\t" seq
    if (!(key in seen)) {
    seen[key] = 1
    print ">" id "\n" body "\n"
    }
}' GSM9640449_VariantMPRAdesignFile.fa > GSM9640449_VariantMPRAdesignFile.dedup.fa

# Merge identical sequences and trim 15 nt from each end.
awk 'BEGIN { RS=">"; ORS="" }
NR==1 { next }
{
    n = split($0, lines, "\n")
    id = lines[1]
    seq = ""
    for (i = 2; i <= n; i++) {
    if (lines[i] == "") continue
    seq = seq lines[i]
    }
    if (seq == "") next

    if (!(seq in seen)) {
    seen[seq] = 1
    order[++m] = seq
    headers[seq] = id
    } else {
    headers[seq] = headers[seq] "-" id
    }
}
END {
    for (i = 1; i <= m; i++) {
    s = order[i]
    if (length(s) <= 30) continue
    s_trim = substr(s, 16, length(s) - 30)
    print ">" headers[s] "\n" s_trim "\n"
    }
}' GSM9640449_VariantMPRAdesignFile.dedup.fa > GSM9640449_VariantMPRAdesignFile.seq_merged.fa


# download sequencing data
wget -O SRR37895574_1.fastq.gz https://sra-pub-src-2.s3.amazonaws.com/SRR37895574/RMHex03_bcassoc_R1_trimmed.fastq.gz.1
wget -O SRR37895574_2.fastq.gz https://sra-pub-src-2.s3.amazonaws.com/SRR37895574/RMHex03_bcassoc_R2_trimmed.fastq.gz.1
wget -O SRR37895574_3.fastq.gz https://sra-pub-src-2.s3.amazonaws.com/SRR37895574/RMHex03_bcassoc_R3_trimmed.fastq.gz.1

cd ../../

The folder should look like this:

assignment
├── GSM9640449_VariantMPRAdesignFile.dedup.fa
├── GSM9640449_VariantMPRAdesignFile.fa
├── GSM9640449_VariantMPRAdesignFile.seq_merged.fa
├── SRR37895574_1.fastq.gz
├── SRR37895574_2.fastq.gz
└── SRR37895574_3.fastq.gz

Configure Assignment Workflow

Create a config file (e.g. config_assignment.yaml) with the following content:

---
version: "0.7.0"
assignments:
  GSE325256:
    bc_length: 15
    alignment_tool:
      split_number: 30
      tool: bbmap
    FWD:
      - data/assignment/SRR37895574_1.fastq.gz
    BC:
      - data/assignment/SRR37895574_2.fastq.gz
    REV:
      - data/assignment/SRR37895574_3.fastq.gz
    design_file: data/assignment/GSM9640449_VariantMPRAdesignFile.seq_merged.fa
    design_check:
      sequence_collisions: false
    configs:
      default: {}
      lowFraction:
        fraction: 0.51
  GSE325256BWA:
    bc_length: 15
    alignment_tool:
      split_number: 30
      tool: bwa-additional-filtering
      configs:
        min_mapping_quality: 1
        sequence_length: 220
        alignment_start: 1
    FWD:
      - data/assignment/SRR37895574_1.fastq.gz
    BC:
      - data/assignment/SRR37895574_2.fastq.gz
    REV:
      - data/assignment/SRR37895574_3.fastq.gz
    design_file: data/assignment/GSM9640449_VariantMPRAdesignFile.seq_merged.fa
    design_check:
      sequence_collisions: false
    configs:
      default: {}
      lowFraction:
        fraction: 0.51
  GSE325256CIGAR:
    bc_length: 15
    alignment_tool:
      split_number: 30
      tool: bbmap
      configs:
        min_mapping_quality: 0
        cigar_filter_regex: 208M|210M|212M|214M|216M|218M|219M
    FWD:
      - data/assignment/SRR37895574_1.fastq.gz
    BC:
      - data/assignment/SRR37895574_2.fastq.gz
    REV:
      - data/assignment/SRR37895574_3.fastq.gz
    design_file: data/assignment/GSM9640449_VariantMPRAdesignFile.seq_merged.fa
    design_check:
      sequence_collisions: false
    configs:
      default: {}
      lowFraction:
        fraction: 0.51
  GSE325256StrandSensitive:
    bc_length: 15
    strand_sensitive:
      enable: true
    alignment_tool:
      split_number: 30
      tool: bbmap
    FWD:
      - data/assignment/SRR37895574_1.fastq.gz
    BC:
      - data/assignment/SRR37895574_2.fastq.gz
    REV:
      - data/assignment/SRR37895574_3.fastq.gz
    design_file: data/assignment/GSM9640449_VariantMPRAdesignFile.seq_merged.fa
    design_check:
      sequence_collisions: false
    configs:
      default: {}
      lowFraction:
        fraction: 0.51

Run Assignment Workflow

conda activate mprasnakeflow
snakemake -c 1 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config_assignment.yaml -n -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000
snakemake -c 30 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config_assignment.yaml -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000

Experiment Workflow

Download and Prepare Data

Reads count data
Replicate	DNA SRA ID	RNA SRA ID
1	SRR37425536	SRR37425544
2	SRR37425537	SRR37425545
3	SRR37425531	SRR37425546
4	SRR37425532	SRR37425533

mkdir -p data/experiment
cd data/experiment
# download data
prefetch --max-size 30GB SRR37425536 SRR37425544 SRR37425537 SRR37425545 SRR37425531 SRR37425546 SRR37425532 SRR37425533;
fastq-dump --gzip --split-files SRR37425536 && \
fastq-dump --gzip --split-files SRR37425544 && \
fastq-dump --gzip --split-files SRR37425537 && \
fastq-dump --gzip --split-files SRR37425545 && \
fastq-dump --gzip --split-files SRR37425531 && \
fastq-dump --gzip --split-files SRR37425546 && \
fastq-dump --gzip --split-files SRR37425532 && \
fastq-dump --gzip --split-files SRR37425533

cd ../
wget -O GSM9640449_VariantMPRAbarcodeDictionary.tsv.gz https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM9640nnn/GSM9640449/suppl/GSM9640449%5FVariantMPRAbarcodeDictionary%2Etsv%2Egz

Create experiment file (experiments.csv):

Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
neurons,1,SRR37425536_1.fastq.gz,SRR37425536_2.fastq.gz,SRR37425536_3.fastq.gz,SRR37425544_1.fastq.gz,SRR37425544_2.fastq.gz,SRR37425544_3.fastq.gz
neurons,2,SRR37425537_1.fastq.gz,SRR37425537_2.fastq.gz,SRR37425537_3.fastq.gz,SRR37425545_1.fastq.gz,SRR37425545_2.fastq.gz,SRR37425545_3.fastq.gz
neurons,3,SRR37425531_1.fastq.gz,SRR37425531_2.fastq.gz,SRR37425531_3.fastq.gz,SRR37425546_1.fastq.gz,SRR37425546_2.fastq.gz,SRR37425546_3.fastq.gz
neurons,4,SRR37425532_1.fastq.gz,SRR37425532_2.fastq.gz,SRR37425532_3.fastq.gz,SRR37425533_1.fastq.gz,SRR37425533_2.fastq.gz,SRR37425533_3.fastq.gz

Configure Experiment Workflow

Create a config file (e.g. config_experiment.yaml):

---
version: "0.7.0"
experiments:
  GSE325256:
    split_number: 30 # increases speed due to large input data
    bc_length: 15
    umi_length: 16
    data_folder: data/experiment
    experiment_file: experiments.csv
    assignments:
      GSE325256BWAAssignment:
        type: file
        assignment_file: results/assignment/GSE325256BWA/assignment_barcodes.default.tsv.gz
      GSE325256CIGARAssignment:
        type: file
        assignment_file: results/assignment/GSE325256CIGAR/assignment_barcodes.default.tsv.gz
      GSE325256Assignment:
        type: file
        assignment_file: results/assignment/GSE325256/assignment_barcodes.default.tsv.gz
      GEOAssignment:
        type: file
        assignment_file: data/GSM9640449_VariantMPRAbarcodeDictionary.tsv.gz
    configs:
      default: {}

Run Experiment Workflow

conda activate mprasnakeflow
snakemake -c 1 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config_experiment.yaml -n -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000
snakemake -c 30 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config_experiment.yaml -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000

Results

Assignment results are written to results/assignment/GSE325256, results/assignment/GSE325256BWA, results/assignment/GSE325256CIGAR, and results/assignment/GSE325256StrandSensitive. The final assignment files and QC reports are placed in those directories. The library is very deeply sequenced and we get a median of over 1000 barcodes per oligo. Due to the challenging design (e.g. contains one base deletion but not within but at the end of the designed oligo) we also do not get the whole library assigned. Only around 70%. We can increase the assignment allowing a lower fraction percentage of reads assigned to the top hit (e.g. 0.51 instead of 0.75) but that also increases the risk of false positive assignments. Also the customized MPRAsnakeflow mapping with bwa disabling MAPQ with min_mapping_quality: 0 but allowing only specific CIGAR strings cigar_filter_regex: 208M|210M|212M|214M|216M|218M|219M works slightly better.

To reduce false positives, we also added the strand-sensitive mapping option (GSE325256StrandSensitive using the default bbmap approach). We assign on average half of the barcodes (~500 per oligo). But it can be that we reduce the false positive rate.

The Experiment QC report later can help to make an informed decision on the best assignment parameters for this dataset.

You can have a look at the assignment QC reports here:

Assignment QC reports for GSE325256
Assignment mode	Fraction	QC report
bbmap	0.75	bbmap 0.75 QC report
bbmap	0.51	bbmap 0.51 QC report
bwa-additional-filtering	0.75	bwa-additional-filtering 0.75 QC report
bwa-additional-filtering	0.51	bwa-additional-filtering 0.51 QC report
bwa with CIGAR filter	0.75	bwa CIGAR 0.75 QC report
bwa with CIGAR filter	0.51	bwa CIGAR 0.51 QC report
bbmap strand-sensitive mode	0.75	bbmap strand-sensitive 0.75 QC report
bbmap strand-sensitive mode	0.51	bbmap strand-sensitive 0.51 QC report

Experiment results are written to results/experiments/GSE325256. The count outputs use the published GEO assignment as well as the assignments generated in this workflow. We only used the mapping with the default fraction (0.75) for the experiment results.

The results show that in comparison to their assignment file we reached a similar Pearson correlation (around 0.67) for activity. Interestingly the strand-sensitive mode increased the correlation to 0.72 suggesting that a high number of barcodes are false positives and the data might benefit from a strict mapping strategy.

Experiment QC reports for GSE325256
Assignment mode	QC report
GEO assignment	GEO assignment QC report
bbmap	bbmap experiment QC report
bwa-additional-filtering	bwa-additional-filtering experiment QC report
bwa with CIGAR filter	bwa CIGAR experiment QC report
bwa strand-sensitive mode	strand-sensitive experiment QC report