Assignment

_images/MPRAsnakeflow_assignment.png

Input files

Fastq Files

  • 2-3 Fastq files from library association sequencing

  • Candidate regulatory sequence (CRS) sequencing, forward and reverse read (paired end)

  • (optional) Index read with barcode. BC can also be present at the beginning of the in the forward read followed by a linker.

Design File

Fasta file of of CRS sequences with unique headers describing each tested sequence

Example file:

>CRS1
GACGGGAACGTTTGAGCGAGATCGAGGATAGGAGGAGCGGA
>CRS2
GGGCTCTCTTATATTAAGGGGGTGTGTGAACGCTCGCGATT
>CRS3
GGCGCGCTTTTTCGAAGAAACCCGCCGGAGAATATAAGGGA
>CRS4
TTAGACCGCCCTTTACCCCGAGAAAACTCAGCTACACACTC

snakemake

Options

With --help or -h you can see the help message.

Mandatory arguments:
--cores:

Use at most N CPU cores/jobs in parallel. If N is omitted or ‘all’, the limit is set to the number of available CPU cores. In case of cluster/cloud execution, this argument sets the number of total cores used over all jobs (made available to rules via workflow.cores).(default: None)

--configfile:

Specify or overwrite the config file of the workflow (see the docs). Values specified in JSON or YAML format are available in the global config dictionary inside the workflow. Multiple files overwrite each other in the given order. Thereby missing keys in previous config files are extended by following configfiles. Note that this order also includes a config file defined in the workflow definition itself (which will come first). (default: None)

--use-conda:

Required to run MPRAsnakeflow. If defined in the rule, run job in a conda environment. If this flag is not set, the conda directive is ignored. (default: False)

Recommended arguments:
--snakefile:

You should not need to specify this. By default, Snakemake will search for ‘Snakefile’, ‘snakefile’, ‘workflow/Snakefile’,’workflow/snakefile’ beneath the current working directory, in this order. Only if you definitely want a different layout, you need to use this parameter. This is very usefull when you want to have the results in a different folder than MPRAsnakeflow is in. (default: None)

Usefull arguments:
-n:

Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs. (default: False)

--touch, -t:

Touch output files (mark them up to date without really changing them) instead of running their commands. This is used to pretend that the rules were executed, in order to fool future invocations of snakemake. Fails if a file does not yet exist. Note that this will only touch files that would otherwise be recreated by Snakemake (e.g. because their input files are newer). For enforcing a touch, combine this with –force, –forceall, or –forcerun. Note however that you loose the provenance information when the files have been created in realitiy. Hence, this should be used only as a last resort. (default: False)

Rules

Rules run by snakemake in the assignment utility.

all

The overall all rule. Here is defined what final output files are expected.

assignment_attach_idx

Extract the index sequence and add it to the header.

assignment_bwa_ref

Create mapping reference for BWA from design file.

assignment_collect

Collect mapped reads into one BAM.

assignment_collectBCs

Get the barcodes.

assignment_fastq_split

Split the fastq files into n files for parallelisation. N is given by split_read in the configuration file.

assignment_getInputs

Concat the input fastq files per R1,R2,R3. If only single fastq file is provided a symbolic link is created.

assignment_hybridFWRead_get_reads_by_length

Get the barcode and read from the FW read using fixed length (when no index BC read is present).

assignmemt_hybridFWRead_get_reads_by_cutadapt

Get the barcode and read from the FW read using cutadapt (when no index BC read is present). Uses the paired end mode of cutadapt to write the FW and BC read.

assignment_merge

Merge the FW,REV and BC fastq files into one. Extract the index sequence from the middle and end of an Illumina run. Separates reads for Paired End runs. Merge/Adapter trim reads stored in BAM.

assignment_mapping

Map the reads to the reference.

assignment_idx_bam

Index the BAM file

assignment_flagstat

Run samtools flagstat. Results are in results/assignment/<assignment_name>/statistic/assignment/bam_stats.txt

assignment_getBCs

Get the barcodes (not filtered). Results are in results/assignment/<assignment_name>/barcodes_incl_other.sorted.tsv.gz

assignment_statistic_totalCounts

Statistic of the total (unfiltered counts). Results are in results/assignment/<assignment_name>/statistic/total_counts.tsv.gz

assignment_filter

Filter the barcodes file based on the config given in the config-file. Results for this run are here results/assignment/<assignment_name>/assignment_barcodes.<config_name>.sorted.tsv.gz.

assignment_statistic_assignedCounts

Statistic of filtered the assigned counts. Result is here results/assignment/<assignment_name>/statistic/assigned_counts.<config_name>.tsv.gz.

assignment_statistic_assignment

Statistic of the filtered assignment. Result is here results/assignment/<assignment_name>/statistic/assignment.<config_name>.tsv.gz and a plot here results/assignment/<assignment_name>/statistic/assignment.<config_name>.png.

Output

The output can be found in the folder defined by the option results/assignment/. It is structured in folders of the condition as

Files

File tree of the result folder (names in < > can be specified in the config file.)

├── assignment
│   └── <assignment_name>
│       ├── aligned_merged_reads.bam
│       ├── aligned_merged_reads.bam.bai
│       ├── assignment_barcodes.<config_name>.sorted.tsv.gz
│       ├── barcodes_incl_other.sorted.tsv.gz
│       ├── reference
│       │   ├── reference.fa
│       │   ├── reference.fa.amb
│       │   ├── reference.fa.ann
│       │   ├── reference.fa.bwt
│       │   ├── reference.fa.dict
│       │   ├── reference.fa.fai
│       │   ├── reference.fa.pac
│       │   └── reference.fa.sa
│       └── statistic
│           ├── assigned_counts.<config_name>.tsv.gz
│           ├── assignment
│           │   └── bam_stats.txt
│           ├── assignment.<config_name>.png
│           ├── assignment.<config_name>.tsv.gz
│           └── total_counts.tsv.gz
total_counts.tsv.gz

Statistic of BCs mapped to oligos.

assigned_counts.<config_name>.tsv.gz

Statistic of BCs mapped to oligos after fitering defined by config.

assignment.<config_name>.tsv.gz

Average/median support of BC per oligo. Oligos with >= 15 BCs.

reference.fa

Design file.

aligned_merged_reads.bam

Sorted bamfile for oligo alignment

barcodes_incl_other.sorted.tsv.gz

Complete list of all barcodes found in mapping file (ambigous and unambigous) with mappings (if possible)

assignment_barcodes.<config_name>.sorted.tsv.gz

Mapping file of barcodes to sequence.

assignment.<config_name>.png

Visualization of number of barcodes mapping to oligo.

bam_stats.txt

samtools bamstat output.