Config File

The config file is a yaml file that contains the configuration. Different runs can be configured. We recommend using one config file per MPRA experiment or MPRA project. But in theory, many different experiments can be configured in only one file. It is divided into version (version of MPRAsnakeflow used), assignments (assignment workflow), and experiments (count workflow). This is a full example file with default configurations. config/example_config.yaml.

---
version: "0.3"
assignments:
  exampleAssignment: # name of an example assignment (can be any string)
    bc_length: 15
    alignment_tool:
      split_number: 1 # number of files fastq should be split for parallelization
      tool: exact # bbmap, bwa or exact
      configs:
        sequence_length: 171 # sequence length of design excluding adapters.
        alignment_start: 1 # start of the alignment in the reference/design_file
    FW:
      - resources/assoc_basic/data/SRR10800986_1.fastq.gz
    BC:
      - resources/assoc_basic/data/SRR10800986_2.fastq.gz
    REV:
      - resources/assoc_basic/data/SRR10800986_3.fastq.gz
    design_file: resources/assoc_basic/design.fa
    configs:
      default: {} # name of an example filtering config
experiments:
  exampleCount:
    bc_length: 15
    umi_length: 10
    data_folder: resources/count_basic/data
    experiment_file: resources/count_basic/experiment.csv
    demultiplex: false
    assignments:
      fromFile:
        type: file
        assignment_file: resources/count_basic/SRR10800986_barcodes_to_coords.tsv.gz
    # label_file: resources/labels.tsv # optional
    configs:
      default: {}

Note that the config file is controlled by json schema. This means that the config file is validated against the schema. If the config file is not valid, the program will exit with an error message. The schema is located in workflow/schemas/config.schema.yaml.

Version settings

Set the version of the MPRAsnakeflow this configuration is used. This is important for future updates. The version is used to check if the config file is compatible with the current version of the workflow. If the version is not the same the workflow will exit with an error message.

  version:
    description: Version of MPRAsnakeflow
    type: string
    pattern: ^(\d+(\.\d+)?(\.\d+)?)|(0\.\d+(\.\d+)?)$
  skip_version_check:
    description: Skip version check
    type: boolean
    default: false

version:: A string like “0.2.0” or “1.2”. When major version “0” is used the minor version should fit with MPRAsnakeflow, e.g. “0.2.0” is compatible with MPRAsnakeflow 0.2.0. as well as 0.2.1 or 0.2.2. When major version greater than 0 is used then the major version has to fit with MPRAsnakeflow. E.g. config of “1.2.1” fits also with MPRAsnakeflow 1.7 or 1.0.

Assignment workflow

The assignment workflow is configured in the assignments section. The following settings are possible:

  assignments:
    description: Assignments to run with configurations
    type: object
    patternProperties:
      description: name of the assignment
      ^([^_\.]+)$:
        type: object
        properties:
          alignment_tool:
            type: object
            properties:
              split_number:
                type: integer
                default: 1
              tool:
                type: string
                enum:
                  - exact
                  - bwa
                  - bbmap
                default: bbmap
            allOf:
              - if:
                  properties:
                    tool:
                      const: bwa
                then:
                  properties:
                    configs:
                      type: object
                      properties:
                        min_mapping_quality:
                          type: integer
                          minimum: 0
                          default: 1
                        sequence_length:
                          type: object
                          properties:
                            min:
                              type: integer
                            max:
                              type: integer
                          required:
                            - min
                            - max
                        alignment_start:
                          type: object
                          properties:
                            min:
                              type: integer
                            max:
                              type: integer
                          required:
                            - min
                            - max
                      required:
                        - sequence_length
                        - alignment_start
                        - min_mapping_quality
                  required:
                    - configs
              - if:
                  properties:
                    tool:
                      const: bbmap
                then:
                  properties:
                    configs:
                      type: object
                      properties:
                        min_mapping_quality:
                          type: integer
                          minimum: 0
                          default: 30
                        sequence_length:
                          type: integer
                          minimum: 1
                        alignment_start:
                          type: integer
                          minimum: 1
                      required:
                        - min_mapping_quality
                        - sequence_length
                        - alignment_start
                  required:
                    - configs
              - if:
                  properties:
                    tool:
                      const: exact
                then:
                  properties:
                    configs:
                      type: object
                      properties:
                        sequence_length:
                          type: integer
                          minimum: 1
                        alignment_start:
                          type: integer
                          minimum: 1
                      required:
                        - sequence_length
                        - alignment_start
                  required:
                    - configs
            required:
              - tool
          bc_length:
            type: integer
          BC_rev_comp:
            type: boolean
            default: false
          linker_length:
            type: integer
          linker:
            type: string
            pattern: ^[ATCGNatcgn]+$
          FW:
            type: array
            items:
              type: string
            minItems: 1
            uniqueItems: true
          BC:
            type: array
            items:
              type: string
            minItems: 1
            uniqueItems: true
          REV:
            type: array
            items:
              type: string
            minItems: 1
            uniqueItems: true
          NGmerge:
            type: object
            properties:
              min_overlap:
                type: integer
                default: 20
              frac_mismatches_allowed:
                type: number
                default: 0.1
              min_dovetailed_overlap:
                type: integer
                default: 50
            required:
              - min_overlap
              - frac_mismatches_allowed
              - min_dovetailed_overlap
            default: {}
          design_file:
            type: string
          design_check:
            type: object
            properties:
              fast:
                type: boolean
                default: true
              sequence_collitions:
                type: boolean
                default: true
            default: {}
            required:
              - fast
              - sequence_collitions
          strand_sensitive:
            type: object
            default: {}
            properties:
              enable:
                type: boolean
                default: false
              forward_adapter:
                type: string
                pattern: ^[ATCGN]+$
                default: AGGACCGGATCAACT
              reverse_adapter:
                type: string
                pattern: ^[ATCGN]+$
                default: TCGGTTCACGCAATG
          configs:
            type: object
            patternProperties:
              ^([^_\.]+)$:
                type: object
                properties:
                  min_support:
                    type: integer
                    minimum: 1
                    default: 3
                  fraction:
                    type: number
                    exclusiveMinimum: 0.5
                    maximum: 1
                    default: 0.75
                required:
                  - min_support
                  - fraction
            minProperties: 1
        oneOf:
          - required:
              - linker_length
          - required:
              - linker
          - required:
              - BC
        required:
          - FW
          - REV
          - bc_length
          - design_file
          - configs
          - alignment_tool
          - NGmerge
    minProperties: 1

For each assignment you want to process you have to give him a name like example_assignment. The name is used to name the output files.

alignment_tool:

Alignment tool configuration that is used to map the reads to the oligos.

split_number:

To parallize mapping for assignment the reads are split into split_number files. E.g. setting to 300 means that the reads are split into 300 files and each file is mapped in parallel. This is only useful when using on a cluster. Running the workflow only on one machine the default value should be used. The default is set to 1. (For technical reasons when multiple assignments defined all will set to the maximum defined in the config.)

tool:

Alignment tool that is used. Currently bbmap bwa, exact are supported. Default is bbmap.

configs:

Configurations of the alignment tool selected.

sequence_length (bwa):: Defines the min and max of a sequence_length specify. sequence_length is basically the length of a sequence alignment to an oligo in the design file. Because there can be insertion and deletions we recommend to vary it a bit around the exact length (e.g. +-5). In theory, this option enables designs with multiple sequence lengths.
alignment_start (bwa):: Defines the min and max of the start of the alignment in an oligo. When using adapters you have to set basically the length of the adapter. Otherwise, 1 will be the choice for most cases. We also recommend varying this value a bit because the start might not be exact after the adapter. E.g. by +-1.
min_mapping_quality (bwa, bbmap):: (Optional) Defines the minimum mapping quality (MAPQ) of the alignment to an oligo. MAPQs are different between bbmap and bwa. For bwa: When using oligos with only 1bp difference it is recommended to set it to 1. BBMap is better here and we can use for example 30 or 35- For regions only with larger edit distances 30 or 40 might be a good choice. Default 30 (use bbmap).
sequence_length (exact, bbmap):: Defines the sequence_length which is the length of a sequence alignment to an oligo in the design file. Only one length design is supported.
alignment_start (exact, bbmap):: Defines the start of the alignment in an oligo. When using adapters you have to set basically the length of the adapter. Otherwise, 1 will be the choice for most cases.

bc_length:

Length of the barcode. Must match with the length of BC.

BC_rev_comp:

(Optional) If set to true the barcode is reverse complemented. Default is false.

linker_length:

(Optional) Length of the linker. Only needed if you don’t have a barcode read and the barcode is in the FW read with the structure: BC+Linker+Insert. The fixed length is used for the linker after a fixed length of BC. The recommended option is linker by defining the exact linker sequence and using cutadapt for trimming.

linker:

(Optional) Length of the linker. Only needed if you don’t have a barcode read and the barcode is in the FW read with the structure: BC+Linker+Insert. Uses cutadapt to trim the linker to get the barcode as well as the starting of the insert.

FW:

List of forward-read files in gzipped fastq format. The full or relative path to the files should be used. The same order in FW, BC, and REV is important.

REV:

List of reverse read files in gzipped fastq format. The full or relative path to the files should be used. Same order in FW, BC, and REV is important.

BC:

List of index read files in gzipped fastq format. The full or relative path to the files should be used. Same order in FW, BC, and REV is important.

NGmerge:

(Optional) Options for NGmerge. NGmerge is used to merge FW and REV reads. The following options are possible (we recommend to use the default values):

min_overlap:: (Optional) Minimum overlap of the reads. Default 20.
frac_mismatches_allowed:: (Optional) Fraction of mismatches allowed in the overlap. Default 0.1.
min_dovetailed_overlap:: (Optional) Minimum dovetailed overlap. Default 10.

design_file:

Design file (full or relative path) in fasta format. The design file should contain the oligos in fasta format. The header should contain the oligo name and should be unique. The sequence should be the sequence of the oligo and must also be unique. When having multiple oligo names with the same sequence please merge them into one fasta entry. The oligo name was later used to link barcode to oligo. The sequence is used to map the reads to the oligos. Adapters can be in the sequence and therefore alignment_start has to be adjusted.

design_check:

(Optional) Options for checking your design fasta file. Design file cannot have [ or ], duplicated headers and for best performance sequences should not be identical.

fast:: (Optional) Using a simple dictionary to find identical sequences. This is faster but uses only the whole (or center part depending on start/length) of the design file. Cannot find substrings as part of any sequence. Set to false for more correct, but slower, search. Default true.
sequence_collitions:: (Optional) Check if there are identical sequences in the design file. Default true.

strand_sensitive:

(Optional) If is enabled the reads are mapped to the oligos in a strand-sensitive way by adding unique adapters to both ends of the oligo reference as well as the FASTQ files. Then MPRASnakeflow is able to distiguish between sense and antisense. By default this option is not enabled.

enable:: (Optional) If set to true the strand-sensitive mapping is enabled. Default is false.
forward_adapter:: (Optional) Adapter sequence added 5’ of the oligo. Default is AGGACCGGATCAACT.
reverse_adapter:: (Optional) Adapter sequence added 3’ of the oligo. Default is TCGGTTCACGCAATG.

configs:

After mapping the reads to the design file and extracting the barcodes per oligo, the configuration (using different names) can be used to generate multiple filtering and configuration settings of the final mapping oligo to barcode. Use <your_config_name>: {} to use the default values for the keys. Each configuration is a dictionary with the following keys:

min_support:: A minimum number of same BC that map to the same oligo. Larger value gives more evidence to be correct. But can remove lot’s of BCs (depedning on the complexity, sequencing depth and quality of sequencing). Recommended option is 3.
fraction:: Minimum fraction of same BC that map to the same oligo. E.g. 0.7 means that at least 70% of the BC map to the same oligo. A larger value gives more evidence to be correct. But can remove lots of BCs (depending on the complexity, sequencing depth and quality of sequencing). Recommended option is 0.7.
unknown_other:: (Optional) Shows not mapped BCs in the final output map. Not recommended to use as mapping file for the experiment workflow. But can be useful for debugging. Default is false.
ambigous:: (Optional) Shows ambiguous BCs in the final output map. Not recommended to use as mapping file fore the experiment workflow. But can be usefull for debugging. Default is false.

Experiment workflow (including counts)

The experiment workflow is configured in the experiments section. Each experiment run (contains one experiment file with all replicates of an experiment). The following settings are possible:

  experiments:
    description: MPRA experiments to run with configurations
    type: object
    patternProperties:
      description: name of the experiment
      ^([^_\.]+)$:
        type: object
        properties:
          bc_length:
            type: integer
            minimum: 1
          umi_length:
            type: integer
            minimum: 1
          adapter:
            type: string
            pattern: ^[ATCGNatcgn]+$
          data_folder:
            type: string
          experiment_file:
            type: string
          demultiplex:
            type: boolean
            default: false
          label_file:
            type: string
          assignments:
            type: object
            patternProperties:
              ^([^_\.]+)$:
                type: object
                properties:
                  type:
                    type: string
                    enum:
                      - file
                      - config
                  assignment_file:
                    type: string
                  assignment_name:
                    type: string
                  assignment_config:
                    type: string
                  sampling:
                    type: object
                    properties:
                      prop:
                        type: number
                        exclusiveMinimum: 0
                        maximum: 1
                      total:
                        type: integer
                        minimum: 1
                required:
                  - type
                allOf:
                  - if:
                      properties:
                        type:
                          const: config
                      required:
                        - type
                    then:
                      required:
                        - assignment_name
                        - assignment_config
                  - if:
                      properties:
                        type:
                          const: file
                      required:
                        - type
                    then:
                      required:
                        - assignment_file
          configs:
            type: object
            patternProperties:
              ^([^_\.]+)$:
                type: object
                properties:
                  filter:
                    type: object
                    default: {}
                    properties:
                      bc_threshold:
                        type: integer
                        minimum: 1
                        default: 10
                      outlier_detection:
                        type: object
                        properties:
                          method:
                            type: string
                            enum:
                              - rna_counts_zscore
                          times_zscore:
                            type: number
                            exclusiveMinimum: 0
                            default: 3
                        required:
                          - times_zscore
                        default: {}
                      min_dna_counts:
                        type: integer
                        miminum: 0
                        default: 1
                      min_rna_counts:
                        type: integer
                        miminum: 0
                        default: 1
                    required:
                      - bc_threshold
                      - min_rna_counts
                      - min_dna_counts
                  sampling:
                    type: object
                    patternProperties:
                      ^((DNA)|(RNA))$:
                        type: object
                        properties:
                          threshold:
                            type: integer
                            minimum: 1
                          prop:
                            type: number
                            exclusiveMinimum: 0
                            maximum: 1
                          total:
                            type: number
                            minimum: 1
                          seed:
                            type: integer
                required:
                  - filter
          variants:
            type: object
            properties:
              map:
                type: string
              min_barcodes:
                type: array
                items:
                  type: integer
                  minimum: 1
            required:
              - map
              - min_barcodes
        # entries that have to be in the config file for successful validation
        required:
          - bc_length
          - data_folder
          - experiment_file
          - demultiplex
          - assignments
          - configs

bc_length:

Length of the barcode. This is used to extract the barcode from the index read. The barcode is extracted from the first bc_length bases of the index read. When no reverse read is given and adapter is not set teh exact length is used to extract the DNA BC from the FW read.

umi_length:

(Optional) Length of the UMI. This is used to extract the UMI from the index read. The UMI is extracted from the last umi_length bases of the index read. Please provide if you use UMIs.

adapter:

(Optional) Adapter sequence in the FW read when no reverse read is given. This is used to trim the sequence and retrieve the BC using cutadapt.

data_folder:

Folder where the fastq files are located. Files are defined in the experiment_file. The full or relative path to the folder should be used.

experiment_file:

Path to the experiment file. The full or relative path to the file should be used. The experiment file is a comma separated file and is decribed in the Experiment file section.

demultiplex:

(Optional) If set to true the reads are demultiplexed. This means that the reads are split into different files for each barcode. This is usefull for further analysis. Default is false.

label_file:

(Optional) Path to the label file. The full or relative path to the file should be used. The label file is a tab separated file and contais the oligo name and the label of it. The oligo name should be the same as in the design file. The label is used to group the oligos in the final output, e.g. for plotting.

insert1_name label1
insert2_name label1
insert3_name label2

assignments:

Per experiments multiple assignments can be defined (naming them differently). Everey assignment name contains the following configurations:

type:

Can be file or config. file means that you use a mapping file which is tab separated and gzipped. It contains in the first column the barcode and in the second column the oligo name. This file can be generated by the Assignment workflow. When using config this means that you are referring to a assignment that is specified in this config file.

assignment_file:

When using file please insert the path to the assignment file (tsv.gz). When using config please set the name of the config previously described the assignment that should be used.

assignment_name:

When using config please insert the name of the assignment specified in the config file.

assignment_config:

When using config please insert the name config of the assignment_name you want to use.

sampling:

(Optional) Options Randomly removing barcodes in the assignment. Just for debug reasons.

prop:: Sample down the BCs in the assignment file to this proporion.
total:: Sample down the BCs in the assignment file to this number.

configs:

Each experiment run can have multiple configurations including filter and sampling options.

filter:

(Optional) Filter options. These options are available

bc_threshold:

Minimum number of different BCs required per oligo. A higher value normally increases the correlation betwene replicates but also reduces the number of final oligos. Default option is 10.

min_dna_counts:

Mimimum number of DNA counts per barcode. When set to 0 a pseudo count is added. Default option is 1.

min_rna_counts:

Mimimum number of RNA counts per barcode. When set to 0 a pseudo count is added. Default option is 1.

outlier_detection:

(Optional) Outlier detection. Methods and strategies to remove outlier barcodes in the final counts. The following options are possible:

method:: Method to remove outliers. Currently rna_counts_zscore, ratio_mad or none (no outlier detection) are supported. Default option is rna_counts_zscore.
mad_bins:: (Optional) For method ratio_mad: Number of bins for the median absolute deviation (MAD) method. Default option is 20.
times_mad:: (Optional) For method ratio_mad: Times the MAD to remove outliers. Default option is 5.
times_zscore:: (Optional) For method rna_counts_zscore: Times the zscore to remove outliers. Default option is 3.

sampling:

(Optional) Options for sampling counts and barcodes. Just for debug reasons.

DNA:

Settings for sampling DNA counts.

threshold:: Maximum threshold for DNA counts assigned to a BC.
prop:: Sample down the DNA counts to this proporion.
total:: Sample down the DNA counts to this number.
seed:: Seed for the random DNA sampling.

RNA:

Settings for sampling RNA counts.

threshold:: Maximum threshold for RNA counts assigned to a BC.
prop:: Sample down the RNA counts to this proporion.
total:: Sample down the RNA counts to this number.
seed:: Seed for the random RNA sampling.

Experiment file

Here we have 4 different options:

Forward, reverse, and UMI read

Experiment file has a header with Condition, Replicate, DNA_BC_F, DNA_UMI, DNA_BC_R, RNA_BC_F, RNA_UMI, and RNA_BC_R. Condition together with replicate have to be a uniqe name. Both field entries are not allowed to have _ and .. Multiple file names are allowd seperating them via ;. An example experiment file can be found here: resources/example_experiment.csv.

Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
HEPG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
HEPG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
HEPG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz

Forward and reverse read

Experiment file has a header with Condition, Replicate, DNA_BC_F, DNA_BC_R, RNA_BC_F, and RNA_BC_R. Condition together with replicate have to be a uniqe name. Both field entries are not allowed to have _ and .. Multiple file names are allowd seperating them via ;.

Only forward read

Experiment file has a header with Condition, Replicate, DNA_BC_F, and RNA_BC_F. Condition together with replicate have to be a uniqe name. Both field entries are not allowed to have _ and .. Multiple file names are allowd seperating them via ;.

Forward, reverse, and UMI read using demultiplex option

Experiment file has a header with Condition, Replicate, BC_DNA, BC_RNA, BC_F, BC_R, UMI, and INDEX. Condition together with replicate have to be a uniqe name. Both field entries are not allowed to have _ and .. Multiple file names are allowd seperating them via ;.