Config File
The config file is a YAML file that contains the configuration. Different runs can be configured. We recommend using one config file per MPRA experiment or MPRA project. However, in theory, many different experiments can be configured in a single file. It is divided into version (version of MPRAsnakeflow used), assignments (assignment workflow), and experiments (count workflow). This is a full example file with default configurations: config/example_config.yaml.
1---
2version: "0.7"
3assignments:
4 exampleAssignment: # name of an example assignment (can be any string)
5 bc_length: 15
6 alignment_tool:
7 split_number: 1 # number of files fastq should be split for parallelization
8 tool: exact # bbmap, bwa or exact
9 configs:
10 sequence_length: 171 # sequence length of design excluding adapters.
11 alignment_start: 1 # start of the alignment in the reference/design_file
12 FWD:
13 - resources/assoc_basic/data/SRR10800986_1.fastq.gz
14 BC:
15 - resources/assoc_basic/data/SRR10800986_2.fastq.gz
16 REV:
17 - resources/assoc_basic/data/SRR10800986_3.fastq.gz
18 design_file: resources/assoc_basic/design.fa
19 configs:
20 default: {} # name of an example filtering config
21experiments:
22 exampleCount:
23 bc_length: 15
24 umi_length: 10
25 data_folder: resources/count_basic/data
26 experiment_file: resources/count_basic/experiment.csv
27 demultiplex: false
28 assignments:
29 fromFile:
30 type: file
31 assignment_file: resources/count_basic/SRR10800986_barcodes_to_coords.tsv.gz
32 # label_file: resources/labels.tsv # optional
33 configs:
34 default: {}
Note that the config file is controlled by a JSON schema. This means that the config file is validated against the schema. If the config file is not valid, the program will exit with an error message. The schema is located in workflow/schemas/config.schema.yaml.
Version Settings
Set the version of the MPRAsnakeflow this configuration is used for. This is important for future updates. The version is used to check if the config file is compatible with the current version of the workflow. If the version is not the same, the workflow will exit with an error message.
version:
description: Version of MPRAsnakeflow
type: string
pattern: ^(\d+(\.\d+)?(\.\d+)?)|(0\.\d+(\.\d+)?)$
skip_version_check:
description: Skip version check
type: boolean
default: false
- version:
A string like “0.2.0” or “1.2”. When the major version is “0,” the minor version should match with MPRAsnakeflow, e.g., “0.2.0” is compatible with MPRAsnakeflow 0.2.0, 0.2.1, or 0.2.2. When the major version is greater than 0, the major version must match with MPRAsnakeflow. For example, a config of “1.2.1” is compatible with MPRAsnakeflow 1.7 or 1.0.
Assignment Workflow
The assignment workflow is configured in the assignments section. The following settings are possible:
assignments:
description: Assignments to run with configurations
type: object
patternProperties:
description: name of the assignment
^([^_\.]+)$:
type: object
properties:
alignment_tool:
type: object
properties:
split_number:
type: integer
default: 1
tool:
type: string
enum:
- exact
- bwa
- bbmap
- bwa-additional-filtering
- pbmm2
default: bbmap
allOf:
- if:
properties:
tool:
const: bwa
then:
properties:
configs:
type: object
properties:
min_mapping_quality:
type: integer
minimum: 0
default: 1
sequence_length:
type: object
properties:
min:
type: integer
max:
type: integer
alignment_start:
type: object
properties:
min:
type: integer
max:
type: integer
M:
type: boolean
description: mark shorter split hits as secondary
default: true
L:
type: array
description: penalty for 5'- and 3'-end clipping 80
items:
type: integer
minItems: 1
maxItems: 2
default: [80]
cigar_filter_regex:
type: string
description: Optional regex for full CIGAR matching (e.g. 200M or 200M|210M)
required:
- min_mapping_quality
required:
- configs
- if:
properties:
tool:
const: bwa-additional-filtering
then:
properties:
configs:
type: object
properties:
sequence_length:
type: integer
minimum: 1
min_mapping_quality:
type: integer
minimum: 0
default: 1
M:
type: boolean
description: mark shorter split hits as secondary
default: true
L:
type: array
description: penalty for 5'- and 3'-end clipping 80
items:
type: integer
minItems: 1
maxItems: 2
default: [80]
identity_threshold:
description: Identity threshold is used to choose which alignments are worth trying to rescue.
type: number
default: 0.98
mismatches_threshold:
description: Threshold of mismatches we investigate if we should try to rescue.
type: integer
default: 3
verbose:
description: print which alignments were rescured and which could not be rescued
type: boolean
default: false
required:
- sequence_length
- min_mapping_quality
required:
- configs
- if:
properties:
tool:
const: bbmap
then:
properties:
configs:
type: object
properties:
min_mapping_quality:
type: integer
minimum: 0
default: 30
cigar_filter_regex:
type: string
description: Optional regex for full CIGAR matching (e.g. 200M or 200M|210M)
required:
- min_mapping_quality
required:
- configs
- if:
properties:
tool:
const: exact
then:
properties:
configs:
type: object
properties:
sequence_length:
type: integer
minimum: 1
alignment_start:
type: integer
minimum: 1
required:
- sequence_length
- alignment_start
required:
- configs
- if:
properties:
tool:
const: pbmm2
then:
properties:
configs:
type: object
properties:
preset:
type: string
enum:
- SUBREAD
- CCS
- ISOSEQ
- HIFI
- UNROLLED
default: SUBREAD
min_concordance:
type: number
minimum: 0
maximum: 1
default: 0.9
required:
- preset
- min_concordance
required:
- configs
required:
- tool
bc_length:
type: integer
BC_rev_comp:
type: boolean
default: false
linker_length:
type: integer
linker:
type: string
pattern: ^[ATCGNatcgn]+$
FWD:
type: array
items:
type: string
minItems: 1
uniqueItems: true
BC:
type: array
items:
type: string
minItems: 1
uniqueItems: true
REV:
type: array
items:
type: string
minItems: 1
uniqueItems: true
adapters:
type: object
properties:
BC:
oneOf:
- type: object
properties:
five_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
three_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
- type: array
items:
type: integer
minItems: 1
uniqueItems: true
FWD:
oneOf:
- type: object
properties:
five_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
three_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
- type: array
items:
type: integer
minItems: 1
uniqueItems: true
REV:
oneOf:
- type: object
properties:
five_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
three_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
- type: array
items:
type: integer
minItems: 1
uniqueItems: true
merge_tool:
type: string
enum:
- NGmerge
- fastq-join
default: NGmerge
NGmerge:
type: object
properties:
min_overlap:
type: integer
default: 20
frac_mismatches_allowed:
type: number
default: 0.1
min_dovetailed_overlap:
type: integer
default: 50
required:
- min_overlap
- frac_mismatches_allowed
- min_dovetailed_overlap
default: {}
fastq-join:
type: object
properties:
min_overlap:
type: integer
default: 6
max_pct_mismatch:
type: number
default: 8
required:
- min_overlap
- max_pct_mismatch
default: {}
design_file:
type: string
design_check:
type: object
properties:
fast:
type: boolean
default: true
sequence_collisions:
type: boolean
default: true
sequence_start:
type: integer
minimum: 1
sequence_length:
type: integer
minimum: 1
default: {}
required:
- fast
- sequence_collisions
allOf:
- if:
properties:
sequence_collisions:
const: true
required:
- sequence_collisions
then:
required:
- sequence_start
- sequence_length
strand_sensitive:
type: object
default: {}
properties:
enable:
type: boolean
default: false
forward_adapter:
type: string
pattern: ^[ATCGN]+$
default: AGGACCGGATCAACT
reverse_adapter:
type: string
pattern: ^[ATCGN]+$
default: TCGGTTCACGCAATG
required:
- enable
configs:
type: object
patternProperties:
^([^_\.]+)$:
type: object
properties:
min_support:
type: integer
minimum: 1
default: 3
fraction:
type: number
exclusiveMinimum: 0.5
maximum: 1
default: 0.75
required:
- min_support
- fraction
minProperties: 1
oneOf:
- required:
- FWD
- linker_length
- required:
- FWD
- linker
- required:
- FWD
- BC
- required:
- long_read_input
- linker
required:
- strand_sensitive
- bc_length
- design_file
- configs
- alignment_tool
minProperties: 1
For each assignment you want to process, you must give it a name like example_assignment. The name is used to name the output files.
- alignment_tool:
Alignment tool configuration that is used to map the reads to the oligos.
- split_number:
To parallelize mapping for assignment, the reads are split into
split_numberfiles. For example, setting it to 300 means that the reads are split into 300 files, and each file is mapped in parallel. This is only useful when using a cluster. When running the workflow on a single machine, the default value should be used. The default is set to1. (For technical reasons, when multiple assignments are defined, all will be set to the maximum defined in the config.)- tool:
Alignment tool that is used. Currently,
bbmap,bwa,bwa-additional-filtering,exact, andpbmm2are supported. Default isbbmap.- configs:
Configurations of the alignment tool selected.
- sequence_length (exact, bwa-additional-filtering):
Defines the
sequence_length, which is the length of a sequence alignment to an oligo in the design file. Only one length design is supported.- alignment_start (exact):
Defines the start of the alignment in an oligo. When using adapters, you must set the length of the adapter. Otherwise, 1 will be the choice for most cases.
- sequence_length (bwa):
(Optional) Defines the
minandmaxof asequence_lengthspecification.sequence_lengthis the length of a sequence alignment to an oligo in the design file. Because there can be insertions and deletions, we recommend varying it slightly around the exact length (e.g., ±5). This option enables designs with multiple sequence lengths.- alignment_start (bwa):
(Optional) Defines the
minandmaxof the start of the alignment in an oligo. When using adapters, you must set the length of the adapter. Otherwise, 1 will be the choice for most cases. We also recommend varying this value slightly because the start might not be exact after the adapter (e.g., ±1).- min_mapping_quality (bwa, bwa-additional-filtering, bbmap):
(Optional) Defines the minimum mapping quality (MAPQ) of the alignment to an oligo. MAPQs differ between bbmap and bwa. For bwa: When using oligos with only 1bp difference, it is recommended to set it to 1 (bwa default is
1). BBMap is better here, and we can use, for example, 30 or 35. For regions with larger edit distances, 30 or 40 might be a good choice. Default is30(bbmap).- cigar_filter_regex (bwa, bbmap):
(Optional) Regular expression to filter alignments by CIGAR string before barcode assignment. The full CIGAR string must match. Example values are
200Mor200M|210M. If not set, no CIGAR-based filtering is applied.- M:
(bwa, bwa-additional-filtering): (Optional) BWA option
-M: Mark shorter split hits as secondary. Default istrue.- L:
(bwa, bwa-additional-filtering): (Optional) BWA option
-L: Array with one (both ends same value) or two values for the penalty of 5’- and 3’-end clipping. Default in MPRAsnakeflow is[80]. Default BWA mem is[5, 5].- identity_threshold (bwa-additional-filtering):
(Optional) Identity threshold is used to choose which alignments are worth trying to rescue. Default is
0.98.- mismatches_threshold (bwa-additional-filtering):
(Optional) Threshold of mismatches we investigate if we should try to rescue. Default is
3.- verbose (bwa-additional-filtering):
(Optional) Print which alignments were rescued and which could not be rescued. Default is
false.- preset (pbmm2):
(Optional) Preset for pbmm2 alignment. Default is
SUBREAD.- min_concordance (pbmm2):
(Optional) Minimum concordance for pbmm2 alignment. Default is
0.9.
- bc_length:
Length of the barcode. Must match the length of
BC.- BC_rev_comp:
(Optional) If set to
true, the barcode is reverse complemented. Default isfalse.- linker_length:
(Optional) Length of the linker. Only needed if you don’t have a barcode read and the barcode is in the forward read with the structure: BC+Linker+Insert. The fixed length is used for the linker after a fixed length of BC. The recommended option is
linkerby defining the exact linker sequence and using cutadapt for trimming.- linker:
(Required for long read, otherwise optional) The exact linker between BC and oligo. Short read data: Only needed if you don’t have a barcode read and the barcode is in the forward read with the structure: BC+Linker+Insert. Uses cutadapt to trim the linker to get the barcode as well as the start of the insert. Long read data: Required! BC will be taken after the linker.
- FWD:
List of forward-read files in gzipped fastq format. The full or relative path to the files should be used. The same order in FWD, BC, and REV is important.
- REV:
(Optional) List of reverse-read files in gzipped fastq format. Files have to overlap the FWD read by at least 10 bp (see
NGmergeandmin_dovetailed_overlap). The full or relative path to the files should be used. The same order in FWD, BC, and REV is important.- BC:
(Optional) List of index-read files in gzipped fastq format. The full or relative path to the files should be used. The same order in FWD, BC, and REV is important. If not set BC must be in the FWD read and the linker or fixed length option has to be used to extract the BC.
- adapters:
(Optional) List of adapter sequences or fixed length to trim reads before running the workflow. Can be configured for all read inputs (FWD, REV, BC). See Adapter trimming for a detailed overview.
- merge_tool:
(Optional) Tool to merge the FWD and REV reads into one read. Currently,
NGmergeandfastq-joinare supported. Default isNGmerge.- NGmerge:
(Optional) Options for NGmerge. NGmerge is used to merge FWD and REV reads. The following options are possible (we recommend using the default values):
- min_overlap:
(Optional) Minimum overlap of the reads. Default is
20.- frac_mismatches_allowed:
(Optional) Fraction of mismatches allowed in the overlap. Default is
0.1.- min_dovetailed_overlap:
(Optional) Minimum dovetailed overlap. Default is
50.
- fastq-join:
(Optional) Options for fastq-join. Fastq-join is used to merge FWD and REV reads. The following options are possible (we recommend using the default values):
- min_overlap:
(Optional) N-minimum overlap. fastq-join option
-m. Default is6.- max_pct_mismatch:
(Optional) N-percent maximum difference. fastq-join option
-p. Default is8.
- design_file:
Design file (full or relative path) in fasta format. The design file should contain the oligos in fasta format. The header should contain the oligo name and should be unique. The sequence should be the sequence of the oligo and must also be unique. When having multiple oligo names with the same sequence, please merge them into one fasta entry. The oligo name is later used to link the barcode to the oligo. The sequence is used to map the reads to the oligos. Adapters can be in the sequence, and therefore
alignment_starthas to be adjusted.- design_check:
(Optional) Options for checking your design fasta file. The design file cannot have
[or], duplicated headers, and for best performance, sequences should not be identical.- fast:
(Optional) Use a simple dictionary to find identical sequences. This is faster but uses only the whole (or center part depending on start/length) of the design file. Cannot find substrings as part of any sequence. Set to false for more correct, but slower, search. Default is
true.- sequence_collisions:
(Optional) Check if there are identical sequences in the design file. Default is
true.- sequence_start:
(Conditionally required) 1-based start position used for sequence collision checking. Required only when
sequence_collisionsis set totrueand no alignment_start is defined via the mapping tool config (bwa or exact).- sequence_length:
(Conditionally required) Number of bases used for sequence collision checking. Required only when
sequence_collisionsis set totrueand no sequence_length is defined via the mapping tool config (bwa, bwa-additional-filtering, or exact).
- strand_sensitive:
(Optional) If enabled, the reads are mapped to the oligos in a strand-sensitive way by adding unique adapters to both ends of the oligo reference as well as the FASTQ files. By default, this option is not enabled.
- enable:
(Optional) If set to
true, the strand-sensitive mapping is enabled. Default isfalse.- forward_adapter:
(Optional) Adapter sequence added 5’ of the oligo. Default is
AGGACCGGATCAACT.- reverse_adapter:
(Optional) Adapter sequence added 3’ of the oligo. Default is
TCGGTTCACGCAATG.
- configs:
After mapping the reads to the design file and extracting the barcodes per oligo, the configuration (using different names) can be used to generate multiple filtering and configuration settings of the final mapping oligo to barcode. Use <your_config_name>: {} to use the default values for the keys. Each configuration is a dictionary with the following keys:
- min_support:
A minimum number of same BC that map to the same oligo. Larger value gives more evidence to be correct. But can remove lots of BCs (depending on the complexity, sequencing depth and quality of sequencing). Recommended option is
3.- fraction:
Minimum fraction of same BC that map to the same oligo. E.g.
0.7means that at least 70% of the BC map to the same oligo. A larger value gives more evidence to be correct. But can remove lots of BCs (depending on the complexity, sequencing depth and quality of sequencing). Recommended option is0.7.
Experiment workflow (including counts)
The experiment workflow is configured in the experiments section. Each experiment run (contains one experiment file with all replicates of an experiment). The following settings are possible:
experiments:
description: MPRA experiments to run with configurations
type: object
patternProperties:
description: name of the experiment
^([^_\.]+)$:
type: object
properties:
split_number:
type: integer
default: 1
bc_length:
type: integer
minimum: 1
bc_extraction:
type: string
enum:
- start
- end
default: start
umi_length:
type: integer
minimum: 1
umi_extraction:
type: string
enum:
- start
- end
default: start
adapters:
type: object
properties:
UMI:
oneOf:
- type: object
properties:
five_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
three_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
- type: array
items:
type: integer
minItems: 1
uniqueItems: true
FWD:
oneOf:
- type: object
properties:
five_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
three_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
- type: array
items:
type: integer
minItems: 1
uniqueItems: true
REV:
oneOf:
- type: object
properties:
five_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
three_prime:
type: array
items:
type: string
pattern: ^[ATCGNatcgn]+$
minItems: 1
uniqueItems: true
- type: array
items:
type: integer
minItems: 1
uniqueItems: true
data_folder:
type: string
experiment_file:
type: string
demultiplex:
type: boolean
default: false
merge_tool:
type: string
enum:
- custom
- NGmerge
default: custom
NGmerge:
type: object
properties:
min_overlap:
type: integer
default: 11
frac_mismatches_allowed:
type: number
default: 0.1
required:
- min_overlap
- frac_mismatches_allowed
default: {}
label_file:
type: string
assignments:
type: object
patternProperties:
^([^_\.]+)$:
type: object
properties:
type:
type: string
enum:
- file
- config
assignment_file:
type: string
assignment_name:
type: string
assignment_config:
type: string
sampling:
type: object
properties:
prop:
type: number
exclusiveMinimum: 0
maximum: 1
total:
type: integer
minimum: 1
required:
- type
allOf:
- if:
properties:
type:
const: config
required:
- type
then:
required:
- assignment_name
- assignment_config
- if:
properties:
type:
const: file
required:
- type
then:
required:
- assignment_file
configs:
type: object
patternProperties:
^([^_\.]+)$:
type: object
properties:
filter:
type: object
default: {}
properties:
bc_threshold:
type: integer
minimum: 1
default: 10
outlier_detection:
type: object
properties:
method:
type: string
enum:
- rna_counts_zscore
times_zscore:
type: number
exclusiveMinimum: 0
default: 3
required:
- times_zscore
default: {}
min_dna_counts:
type: integer
miminum: 0
default: 1
min_rna_counts:
type: integer
miminum: 0
default: 1
required:
- bc_threshold
- min_rna_counts
- min_dna_counts
sampling:
type: object
patternProperties:
^((DNA)|(RNA))$:
type: object
properties:
threshold:
type: integer
minimum: 1
prop:
type: number
exclusiveMinimum: 0
maximum: 1
total:
type: number
minimum: 1
seed:
type: integer
required:
- filter
variants:
type: object
properties:
map:
type: string
min_barcodes:
type: array
items:
type: integer
minimum: 1
required:
- map
- min_barcodes
# entries that have to be in the config file for successful validation
required:
- split_number
- bc_length
- data_folder
- experiment_file
- demultiplex
- assignments
- configs
- bc_length:
Length of the barcode. This is used to extract the barcode from the index read. The barcode is extracted from the first
bc_lengthbases of the index read. When no reverse read is given andadapteris not set, the exact length is used to extract the DNA BC from the FWD read.- umi_length:
(Optional) Length of the UMI. This is used to extract the UMI from the index read. The UMI is extracted from the last
umi_lengthbases of the index read. Please provide if you use UMIs.- split_number:
(Optional) To parallelize merging forward and reverse reads, they can be split into into
split_numberfiles. For example, setting it to 30 means that the reads are split into 30 files, and each file is trimmed (if set) and merged in parallel. This is only useful when using a cluster to speed up the slower merging step. When running the workflow on a single machine, the default value should be used. The default is set to1. (For technical reasons, when multiple experiments are defined, all will be set to the maximum defined in the config.)- adapters:
(Optional) List of adapter sequences or fixed length to trim reads before running the workflow. Can be configured for all read inputs (FWD, REV, UMI). See Adapter trimming for a detailed overview.
- data_folder:
Folder where the fastq files are located. Files are defined in the
experiment_file. The full or relative path to the folder should be used.- experiment_file:
Path to the experiment file. The full or relative path to the file should be used. The experiment file is a comma separated file and is described in the Experiment file section.
- demultiplex:
(Optional) If set to
truethe reads are demultiplexed. This means that the reads are split into different files for each barcode. This is useful for further analysis. Default isfalse.- merge_tool:
(Optional) Select the read-merging/counting backend for paired-read experiments.
- custom:
Keep the legacy BAM-based path (FastQ2doubleIndexBAM + MergeTrimReadsBAM). Slow but better results usually.
- NGmerge:
Use NGmerge-based merging for paired reads. This is supported for both no-UMI and UMI experiments. Usually faster than custom but correlation across replicates might be lower.
no-UMI: FWD/REV are merged via NGmerge and barcode counts are extracted from merged reads.
UMI: UMI reads are first attached to read headers, then FWD/REV are merged via NGmerge; BCxUMI counts are extracted from merged read headers and sequences.
Default is
custom.- NGmerge:
(Optional) NGmerge options for experiment counts when
merge_tool: NGmerge.- min_overlap:
(Optional) Minimum overlap of the reads. NGmerge option
-m. Default is11.- frac_mismatches_allowed:
(Optional) Fraction of mismatches allowed in the overlap. NGmerge option
-p. Default is0.1.
- label_file:
(Optional) Path to the label file. The full or relative path to the file should be used. The label file is a tab separated file and contains the oligo name and the label of it. The oligo name should be the same as in the design file. The label is used to group the oligos in the final output, e.g. for plotting.
insert1_name label1 insert2_name label1 insert3_name label2
- assignments:
Per experiments multiple assignments can be defined (naming them differently). Every assignment name contains the following configurations:
- type:
Can be
fileorconfig.filemeans that you use a mapping file which is tab separated and gzipped. It contains in the first column the barcode and in the second column the oligo name. This file can be generated by the Assignment workflow. When usingconfigthis means that you are referring to a assignment that is specified in this config file.- assignment_file:
When using
fileplease insert the path to the assignment file (tsv.gz). When usingconfigplease set the name of the config previously described the assignment that should be used.- assignment_name:
When using
configplease insert the name of the assignment specified in the config file.- assignment_config:
When using
configplease insert the name config of theassignment_nameyou want to use.- sampling:
(Optional) Options Randomly removing barcodes in the assignment. Just for debug reasons.
- prop:
Sample down the BCs in the assignment file to this proportion.
- total:
Sample down the BCs in the assignment file to this number.
- configs:
Each experiment run can have multiple configurations including filter and sampling options.
- filter:
(Optional) Filter options. These options are available
- bc_threshold:
Minimum number of different BCs required per oligo. A higher value normally increases the correlation between replicates but also reduces the number of final oligos. Default option is
10.- min_dna_counts:
Minimum number of DNA counts per barcode. When set to
0a pseudo count is added. Default option is1.- min_rna_counts:
Minimum number of RNA counts per barcode. When set to
0a pseudo count is added. Default option is1.- outlier_detection:
(Optional) Outlier detection. Methods and strategies to remove outlier barcodes in the final counts. The following options are possible:
- method:
Method to remove outliers. Currently
rna_counts_zscore,ratio_madornone(no outlier detection) are supported. Default option isrna_counts_zscore.- mad_bins:
(Optional) For method
ratio_mad: Number of bins for the median absolute deviation (MAD) method. Default option is20.- times_mad:
(Optional) For method
ratio_mad: Times the MAD to remove outliers. Default option is5.- times_zscore:
(Optional) For method
rna_counts_zscore: Times the zscore to remove outliers. Default option is3.
- sampling:
(Optional) Options for sampling counts and barcodes. Just for debug reasons.
- DNA:
Settings for sampling DNA counts.
- threshold:
Maximum threshold for DNA counts assigned to a BC.
- prop:
Sample down the DNA counts to this proportion.
- total:
Sample down the DNA counts to this number.
- seed:
Seed for the random DNA sampling.
- RNA:
Settings for sampling RNA counts.
- threshold:
Maximum threshold for RNA counts assigned to a BC.
- prop:
Sample down the RNA counts to this proportion.
- total:
Sample down the RNA counts to this number.
- seed:
Seed for the random RNA sampling.
Experiment file
Here we have 4 different options:
Forward, reverse, and UMI read
Experiment file has a header with Condition, Replicate, DNA_BC_F, DNA_UMI, DNA_BC_R, RNA_BC_F, RNA_UMI, and RNA_BC_R. Condition together with replicate have to be a unique name. Both field entries are not allowed to have _ and .. Multiple file names are allowed, separating them via ;. An example experiment file can be found here: resources/example_experiment.csv.
Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
HEPG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
HEPG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
HEPG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz
Forward and reverse read
Experiment file has a header with Condition, Replicate, DNA_BC_F, DNA_BC_R, RNA_BC_F, and RNA_BC_R. Condition together with replicate have to be a unique name. Both field entries are not allowed to have _ and .. Multiple file names are allowed, separating them via ;.
Only forward read
Experiment file has a header with Condition, Replicate, DNA_BC_F, and RNA_BC_F. Condition together with replicate have to be a unique name. Both field entries are not allowed to have _ and .. Multiple file names are allowed, separating them via ;.
Forward, reverse, and UMI read using demultiplex option
Experiment file has a header with Condition, Replicate, BC_DNA, BC_RNA, BC_F, BC_R, UMI, and INDEX. Condition together with replicate have to be a unique name. Both field entries are not allowed to have _ and .. Multiple file names are allowed, separating them via ;.