Config File
The config file is a yaml file that contains the configuration. Different runs can be configured. We recommend to use one config file per MPRA experiment or MPRA roject. But in theory many different experiments can be configured in only one file. It is divided into global
(generell settings), assignments
(assigment workflow), and experiments
(count workflow including variants). This is a full example file with all possible configurations. config/example_config.yaml
.
1---
2global: # generall configs effecting one or multiple parts
3 threads: 1
4 assignments:
5 split_number: 1 # number of files fastq should be split for parallelization
6assignments:
7 exampleAssignment: # name of an example assignment (can be any string)
8 bc_length: 15
9 sequence_length: # sequence length of design excluding adapters.
10 min: 195
11 max: 205
12 alignment_start: # start of an alignment in the reference. Here using 15 bp adapters. Can be different when using adapter free approaches
13 min: 15 # integer
14 max: 17 # integer
15 min_mapping_quality: 1 # integer >=0 Please use 1 when you have oligos that differ by 1 base in your reference/design file
16 FW:
17 - resources/Assignment_BasiC/R1.fastq.gz
18 BC:
19 - resources/Assignment_BasiC/R2.fastq.gz
20 REV:
21 - resources/Assignment_BasiC/R3.fastq.gz
22 reference: resources/design.fa
23 configs:
24 exampleAssignmentConfig: # name of an example filtering config
25 min_support: 3
26 fraction: 0.7
27experiments:
28 exampleCount:
29 bc_length: 15
30 umi_length: 10
31 data_folder: resources/Count_Basic/data
32 experiment_file: resources/example_experiment.csv
33 demultiplex: false
34 assignments:
35 fromFile:
36 type: file
37 assignment_file: resources/SRR10800986_filtered_coords_to_barcodes.tsv.gz
38 fromWorkflow:
39 type: config
40 assignment_name: exampleAssignment
41 assignment_config: exampleAssignmentConfig
42 design_file: resources/design.fa
43 label_file: resources/labels.tsv # optional
44 configs:
45 exampleConfig:
46 filter:
47 bc_threshold: 10
48 DNA:
49 min_counts: 1
50 RNA:
51 min_counts: 1
52 sampling: # optional, just for benmarking
53 DNA:
54 total: 30000000
55 threshold: 300
56 RNA:
57 total: 50000000
58 threshold: 300
59 variants: # optional
60 map: resources/variant_map.tsv
61 min_barcodes: [5, 10] # min BC for ref and alt sequence
Note that teh config file is conrolled by jscon schema. This means that the config file is validated against the schema. If the config file is not valid, the program will exit with an error message. The schema is located in workflow/schemas/config.schema.yaml
.
General settings
The general settings are located in the global
section. The following settings are possible:
global:
type: object
default:
threads: 1
assignments:
split_number: 1
properties:
assignments:
type: object
properties:
split_number:
type: integer
default: 1
additionalProperties: false
threads:
type: integer
default: 1
additionalProperties: false
- threads:
Number of threads that are available to run a rule. Right now this is used for bwa mem in the assignment workflow. Be sure to set up the snakemake option
-c
correctly when using larger number of possible threads. Default is set to 1.- assignments:
Global parameters that hold for the assignment workflow.
- split_number:
To parallize mapping for assignment the reads are split into
split_number
files. E.g. setting to 300 this means that the reads are split into 300 files and each file is mapped in parallel. This is only usefull when using on a cluster. Running the workflow only on one machine the default value shopuld be used. Default is set to 1.
Assignment workflow
The assignment workflow is configured in the assignments
section. The following settings are possible:
assignments:
description: Assignments to run with configurations
type: object
patternProperties:
description: name of the assignment
^([^_\.]+)$:
type: object
patternProperties:
^((sequence_length)|(alignment_start))$:
type: object
properties:
min:
type: integer
max:
type: integer
additionalProperties: false
required:
- min
- max
properties:
bc_length:
type: integer
BC_rev_comp:
type: boolean
default: false
linker_length:
type: integer
linker:
type: string
pattern: ^[ATCGNatcgn]+$
FW:
type: array
items:
type: string
minItems: 1
uniqueItems: true
BC:
type: array
items:
type: string
minItems: 1
uniqueItems: true
REV:
type: array
items:
type: string
minItems: 1
uniqueItems: true
min_mapping_quality:
type: integer
default: 1
minimum: 0
NGmerge:
type: object
properties:
min_overlap:
type: integer
default: 20
frac_mismatches_allowed:
type: number
default: 0.1
min_dovetailed_overlap:
type: integer
default: 50
required:
- min_overlap
- frac_mismatches_allowed
- min_dovetailed_overlap
default: {}
additionalProperties: false
reference:
type: string
configs:
type: object
patternProperties:
^([^_\.]+)$:
type: object
properties:
min_support:
type: integer
minimum: 1
default: 3
fraction:
type: number
exclusiveMinimum: 0.5
maximum: 1
default: 0.7
unknown_other:
type: boolean
default: false
ambiguous:
type: boolean
default: false
required:
- min_support
- fraction
additionalProperties: false
additionalProperties: false
minProperties: 1
oneOf:
- required:
- linker_length
- required:
- linker
- required:
- BC
required:
- FW
- REV
- bc_length
- reference
- configs
- alignment_start
- sequence_length
- min_mapping_quality
- NGmerge
additionalProperties: false
additionalProperties: false
minProperties: 1
Each asignment you want to process you have to giv him a name like example_assignment
. The name is used to name the output files.
- sequence_length:
Defines the
min
andmax
of asequence_length
specify .sequence_length
is basically the length of a sequence alignment to an oligo in the reference file. Because there can be insertion and deletions we recommend to vary it a bit around the exact length (e.g. +-5). In theory this option enables designs with multiple sequence lengths.- alignment_start:
Defines the
min
andmax
of the start of the alignment in an oligo. When using adapters you have to set basically the length of the adapter. Otherwise 1 will be the choice for most cases. We also recommend to vary this value a bit because the start might not be exact after the adapter. E.g. by +-1.- min_mapping_quality:
(Optinal) Defines the minimum mapping quality (MAPQ) of the alinment to an oligo. When using oligos with only 1bp difference it is recommended to set it to 0. Otherwise the default value of 1 is recommended.
- bc_length:
Length of the barcode. Must match with the length of
R2
.- BC_rev_comp:
(Optional) If set to
true
the barcode of is reverse complemented. Default isfalse
.- linker_length:
(Optional) Length of the linker. Only needed if you don’t have a barcode read and the barcode is in the FW read with the structure: BC+Linker+Insert. The fixed length is used for the linker after a fixed length of BC. The recommended option is
linker
by defining the exact linker sequence and using cutadapt for trimming.- linker:
(Optional) Length of the linker. Only needed if you don’t have a barcode read and the barcode is in the FW read with the structure: BC+Linker+Insert. Uses cutadapt to trim the linker to get the barcode as well as the starting of the insert.
- FW:
List of forward read files in gzipped fastq format. The full or relative path to the files should be used. Same order in R1, R2, and R3 is important.
- REV:
list of reverse read files in gzipped fastq format. The full or relative path to the files should be used. Same order in R1, R2, and R3 is important.
- BC:
List of index read files in gzipped fastq format. The full or relative path to the files should be used. Same order in R1, R2, and R3 is important.
- NGmerge:
(Optional) Options for NGmerge. NGmerge is used merge FW and REV reads. The following options are possible (we recommend to use the default values):
- min_overlap:
(Optional) Minimum overlap of the reads. Default is set to 20.
- frac_mismatches_allowed:
(Optional) Fraction of mismatches allowed in the overlap. Default is set to 0.1.
- min_dovetailed_overlap:
(Optional) Minimum dovetailed overlap. Default is set to 10.
- reference:
Design file (full or relative path) in fasta format. The design file should contain the oligos in fasta format. The header should contain the oligo name and should be unique. The sequence should be the sequence of the oligo and must also be unique. When having multiple oligo names with the same sequence please merge them into one fasta entry. The oligo name later used to link barcode to oligo. The sequence is used to map the reads to the oligos. Adapters can be in the seuqence and therefore
alignment_start
has to be adjusted.- configs:
After mapping the reads to the design file and extracting the barcodes per oligo the configuration (using different names) can be used to generate multiple filtering and configuration settings of the final maq oligo to barcode. Each configuration is a dictionary with the following keys:
- min_support:
Minimum number of same BC that map to teh same oligo. Larger value gives more evidence to be correct. But can remove lot’s of BCs (depedning on the complexity, sequencing depth and quality of sequencing). Recommended option is
3
.- fraction:
Minumum fraction of same BC that map to teh same oligo. E.g.
0.7
means that at least 70% of the BC map to the same oligo. Larger value gives more evidence to be correct. But can remove lot’s of BCs (depedning on the complexity, sequencing depth and quality of sequencing). Recommended option is0.7
.- unknown_other:
(Optional) Shows not mapped BCs in the final output map. Not recommended to use as mapping file fore the experiment workflow. But can be usefull for debugging. Default is
false
.- ambigous:
(Optional) Shows ambigous BCs in the final output map. Not recommended to use as mapping file fore the experiment workflow. But can be usefull for debugging. Default is
false
.
Experiment workflow (including counts)
The experiment workflow is configured in the experiments
section. Each experiment run (contains one experiment file with all replicates of an experiment). The following settings are possible:
experiments:
description: MPRA experiments to run with configurations
type: object
patternProperties:
description: name of the experiment
^([^_\.]+)$:
type: object
properties:
bc_length:
type: integer
minimum: 1
umi_length:
type: integer
minimum: 1
adapter:
type: string
pattern: ^[ATCGNatcgn]+$
data_folder:
type: string
experiment_file:
type: string
demultiplex:
type: boolean
default: false
design_file:
type: string
label_file:
type: string
assignments:
type: object
patternProperties:
^([^_\.]+)$:
type: object
properties:
type:
type: string
enum:
- file
- config
assignment_file:
type: string
assignment_name:
type: string
assignment_config:
type: string
sampling:
type: object
properties:
prop:
type: number
exclusiveMinimum: 0
maximum: 1
total:
type: integer
minimum: 1
required:
- type
additionalProperties: false
allOf:
- if:
properties:
type:
const: config
required:
- type
then:
required:
- assignment_name
- assignment_config
- if:
properties:
type:
const: file
required:
- type
then:
required:
- assignment_file
additionalProperties: false
configs:
type: object
patternProperties:
^([^_\.]+)$:
type: object
properties:
filter:
type: object
properties:
bc_threshold:
type: integer
minimum: 1
default: 10
patternProperties:
^((DNA)|(RNA))$:
type: object
properties:
min_counts:
type: integer
miminum: 0
default: 1
additionalProperties: false
required:
- min_counts
default:
bc_threshold: 10
DNA:
min_counts: 1
RNA:
min_counts: 1
required:
- bc_threshold
- DNA
- RNA
additionalProperties: false
sampling:
type: object
patternProperties:
^((DNA)|(RNA))$:
type: object
properties:
threshold:
type: integer
minimum: 1
prop:
type: number
exclusiveMinimum: 0
maximum: 1
total:
type: number
minimum: 1
seed:
type: integer
additionalProperties: false
additionalProperties: false
additionalProperties: false
required:
- filter
additionalProperties: false
variants:
type: object
properties:
map:
type: string
min_barcodes:
type: array
items:
type: integer
minimum: 1
required:
- map
- min_barcodes
# entries that have to be in the config file for successful validation
required:
- bc_length
- data_folder
- experiment_file
- demultiplex
- design_file
- assignments
- configs
additionalProperties: false
- bc_length:
Length of the barcode. This is used to extract the barcode from the index read. The barcode is extracted from the first
bc_length
bases of the index read. When no reverse read is given andadapter
is not set teh exact length is used to extract the DNA BC from the FW read.- umi_length:
(Optional) Length of the UMI. This is used to extract the UMI from the index read. The UMI is extracted from the last
umi_length
bases of the index read. Please provide if you use UMIs.- adapter:
(Optional) Adapter sequence in the FW read when no reverse read is given. This is used to trim the sequence and retrieve the BC using cutadapt.
- data_folder:
Folder where the fastq files are located. Files are defined in the
experiment_file
. The full or relative path to the folder should be used.- experiment_file:
Path to the experiment file. The full or relative path to the file should be used. The experiment file is a comma separated file and is decribed in the Experiment file section.
- demultiplex:
(Optional) If set to
true
the reads are demultiplexed. This means that the reads are split into different files for each barcode. This is usefull for further analysis. Default isfalse
.- design_file:
Design file (full or relative path) in fasta format. The design file should contain the oligos in fasta format. The header should contain the oligo name and should be unique. The sequence should be the sequence of the oligo and must also be unique. When having multiple oligo names with the same sequence please merge them into one fasta entry. Should be the same as
reference
in the Assignment workflow.- label_file:
(Optional) Path to the label file. The full or relative path to the file should be used. The label file is a tab separated file and contais the oligo name and the label of it. The oligo name should be the same as in the design file. The label is used to group the oligos in the final output, e.g. for plotting.
insert1_name label1 insert2_name label1 insert3_name label2
- assignments:
Per experiments multiple assignments can be defined (naming them differently). Everey assignment name contains the following configurations:
- type:
Can be
file
orconfig
.file
means that you use a mapping file which is tab separated and gzipped. It contains in the first column the barcode and in the second column the oligo name. This file can be generated by the Assignment workflow. When using :code:`config`this means that you are referring to a assignment that is specified in this config file.- assignment_file:
When using
file
please insert the path to the assignment file (tsv.gz). When usingconfig
please set the name of the config previously described the assignment that should be used.- assignment_name:
When using
config
please insert the name of the assignment specified in the config file.- assignment_config:
When using
config
please insert the name config of theassignment_name
you want to use.- sampling:
(Optional) Options Randomly removing barcodes in the assignment. Just for debug reasons.
- prop:
Sample down the BCs in the assignment file to this proporion.
- total:
Sample down the BCs in the assignment file to this number.
- configs:
Each experiment run can have multiple configurations including filter and sampling options.
- filter:
(Optional) Filter options. These options are available
- bc_threshold:
Minimum number of different BCs required per oligo. A higher value normally increases the correlation betwene replicates but also reduces the number of final oligos. Default option is
10
.- DNA:
Settings for DNA
- min_counts:
Mimimum number of DNA counts per barcode. When set to
0
a pseudo count is added. Default option is1
.
- RNA:
Settings for DNA
- min_counts:
Mimimum number of RNA counts per barcode. When set to
0
a pseudo count is added. Default option is1
.
- sampling:
(Optional) Options for sampling counts and barcodes. Just for debug reasons.
- DNA:
Settings for sampling DNA counts.
- threshold:
Maximum threshold for DNA counts assigned to a BC.
- prop:
Sample down the DNA counts to this proporion.
- total:
Sample down the DNA counts to this number.
- seed:
Seed for the random DNA sampling.
- RNA:
Settings for sampling RNA counts.
- threshold:
Maximum threshold for RNA counts assigned to a BC.
- prop:
Sample down the RNA counts to this proporion.
- total:
Sample down the RNA counts to this number.
- seed:
Seed for the random RNA sampling.
Experiment file
Here we have 4 different options:
Forward, reverse, and UMI read
Experiment file has a header with Condition
, Replicate
, DNA_BC_F
, DNA_UMI
, DNA_BC_R
, RNA_BC_F
, RNA_UMI
, and RNA_BC_R
. Condition together with replicate have to be a uniqe name. Both field entries are not allowed to have _
and .
. Multiple file names are allowd seperating them via ;
. An example experiment file can be found here: resources/example_experiment.csv
.
Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
HEPG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
HEPG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
HEPG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz
Forward and reverse read
Experiment file has a header with Condition
, Replicate
, DNA_BC_F
, DNA_BC_R
, RNA_BC_F
, and RNA_BC_R
. Condition together with replicate have to be a uniqe name. Both field entries are not allowed to have _
and .
. Multiple file names are allowd seperating them via ;
.
Only forward read
Experiment file has a header with Condition
, Replicate
, DNA_BC_F
, and RNA_BC_F
. Condition together with replicate have to be a uniqe name. Both field entries are not allowed to have _
and .
. Multiple file names are allowd seperating them via ;
.
Forward, reverse, and UMI read using demultiplex option
Experiment file has a header with Condition
, Replicate
, BC_DNA
, BC_RNA
, BC_F
, BC_R
, UMI
, and INDEX
. Condition together with replicate have to be a uniqe name. Both field entries are not allowed to have _
and .
. Multiple file names are allowd seperating them via ;
.