.. _Plasmid example: .. role:: bash(code) :language: bash ======================================== ENCODE data (Gosai et al. Plasmid based) ======================================== This example runs the assignment and experiment workflow on data from Ryan Tewhey's lab, available via the ENCODE portal. We use the A549 experiment data published in the article `Gosai SJ, Castro RI, Fuentes N et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature. 2024. `_. The main differences from other examples are: - For the assignment part, the barcode is attached to the forward read, so we must split them first based on a linker or, in this case, a specific sequence length after the barcode. - For the experiment/count workflow, as is often done in plasmid-based assays, only one DNA sequencing is performed before electroporation. We need to define the DNA count sequencing fastq file for each replicate. MPRAsnakeflow will recognize that these files are identical, count DNA only once, and use the result for all replicates. Prerequisites ============= This example depends on the following data and software: Installation of MPRAsnakeflow ----------------------------- Please install conda, the MPRAsnakeflow environment and clone the actual MPRAsnakeflow master branch. You will find more help under :ref:`Installation`. Design file ------------- We need the design file. Unfortunately the headers are not compatible with our pipeline. Therefore, we will fix them first. .. code-block:: bash mkdir -p ENCFF074MMO/data/fasta cd ENCFF074MMO/data/fasta wget https://www.encodeproject.org/files/ENCFF074MMO/@@download/ENCFF074MMO.fasta.gz zcat ENCFF074MMO.fasta.gz | sed 's/[\(\)]//g' > ENCFF074MMO.fix.fa cd ../../ Reads association data ----------------------- There are four sets of association sequencing for this data, which contains a barcode and oligo-forward (FWD read) and oligo-reverse (REV read) sequence. These data must be downloaded. All data is publicly available on ENCODE. .. note:: You need 23 GB disk space to the download association data! .. code-block:: bash mkdir -p data/oligo_barcode cd data/oligo_barcode wget https://www.encodeproject.org/files/ENCFF215NWC/@@download/ENCFF215NWC.fastq.gz wget https://www.encodeproject.org/files/ENCFF235LUE/@@download/ENCFF235LUE.fastq.gz wget https://www.encodeproject.org/files/ENCFF317FNS/@@download/ENCFF317FNS.fastq.gz wget https://www.encodeproject.org/files/ENCFF381TUK/@@download/ENCFF381TUK.fastq.gz wget https://www.encodeproject.org/files/ENCFF420UFL/@@download/ENCFF420UFL.fastq.gz wget https://www.encodeproject.org/files/ENCFF504ZPY/@@download/ENCFF504ZPY.fastq.gz wget https://www.encodeproject.org/files/ENCFF708MPZ/@@download/ENCFF708MPZ.fastq.gz wget https://www.encodeproject.org/files/ENCFF847HXK/@@download/ENCFF847HXK.fastq.gz cd ../../ .. note:: Please be sure that all files are downloaded completely without errors! Please check the md5sum of files (see the ENCODE portal). With .. code-block:: bash tree data The folder should look like this: .. code-block:: text data ├── fasta │ ├── ENCFF074MMO.fasta.gz │ └── ENCFF074MMO.fix.fa └── oligo_barcode ├── ENCFF215NWC.fastq.gz ├── ENCFF235LUE.fastq.gz ├── ENCFF317FNS.fastq.gz ├── ENCFF381TUK.fastq.gz ├── ENCFF420UFL.fastq.gz ├── ENCFF504ZPY.fastq.gz ├── ENCFF708MPZ.fastq.gz └── ENCFF847HXK.fastq.gz Reads count data ----------------- Here we have ten sequencing runs for the DNA count data (before electroporation). And then we have five sequencing runs for RNA. For each replicate one run. So five replicates in total. .. note:: You need 14 GB disk space to the download the count data! .. code-block:: bash mkdir -p data/rna_dna_counts cd data/rna_dna_counts wget https://www.encodeproject.org/files/ENCFF019RUN/@@download/ENCFF019RUN.fastq.gz wget https://www.encodeproject.org/files/ENCFF168OJL/@@download/ENCFF168OJL.fastq.gz wget https://www.encodeproject.org/files/ENCFF448RQK/@@download/ENCFF448RQK.fastq.gz wget https://www.encodeproject.org/files/ENCFF491IXU/@@download/ENCFF491IXU.fastq.gz wget https://www.encodeproject.org/files/ENCFF564JPU/@@download/ENCFF564JPU.fastq.gz wget https://www.encodeproject.org/files/ENCFF696HJK/@@download/ENCFF696HJK.fastq.gz wget https://www.encodeproject.org/files/ENCFF850RIY/@@download/ENCFF850RIY.fastq.gz wget https://www.encodeproject.org/files/ENCFF891CIZ/@@download/ENCFF891CIZ.fastq.gz wget https://www.encodeproject.org/files/ENCFF944CEQ/@@download/ENCFF944CEQ.fastq.gz wget https://www.encodeproject.org/files/ENCFF966RRE/@@download/ENCFF966RRE.fastq.gz wget https://www.encodeproject.org/files/ENCFF061UCM/@@download/ENCFF061UCM.fastq.gz wget https://www.encodeproject.org/files/ENCFF311KDS/@@download/ENCFF311KDS.fastq.gz wget https://www.encodeproject.org/files/ENCFF371LCK/@@download/ENCFF371LCK.fastq.gz wget https://www.encodeproject.org/files/ENCFF501AHK/@@download/ENCFF501AHK.fastq.gz wget https://www.encodeproject.org/files/ENCFF554OMB/@@download/ENCFF554OMB.fastq.gz cd ../../ .. note:: Please be sure that all files are downloaded completely without errors! Please check the md5sum of files (see the ENCODE portal). With .. code-block:: bash tree data The folder should look like this: .. code-block:: text data ├── fasta │ └── ENCFF074MMO.fasta.gz ├── oligo_barcode │ ├── ENCFF215NWC.fastq.gz │ ├── ENCFF235LUE.fastq.gz │ ├── ENCFF317FNS.fastq.gz │ ├── ENCFF381TUK.fastq.gz │ ├── ENCFF420UFL.fastq.gz │ ├── ENCFF504ZPY.fastq.gz │ ├── ENCFF708MPZ.fastq.gz │ └── ENCFF847HXK.fastq.gz └── rna_dna_counts ├── ENCFF019RUN.fastq.gz ├── ENCFF061UCM.fastq.gz ├── ENCFF168OJL.fastq.gz ├── ENCFF311KDS.fastq.gz ├── ENCFF371LCK.fastq.gz ├── ENCFF448RQK.fastq.gz ├── ENCFF491IXU.fastq.gz ├── ENCFF501AHK.fastq.gz ├── ENCFF554OMB.fastq.gz ├── ENCFF564JPU.fastq.gz ├── ENCFF696HJK.fastq.gz ├── ENCFF850RIY.fastq.gz ├── ENCFF891CIZ.fastq.gz ├── ENCFF944CEQ.fastq.gz └── ENCFF966RRE.fastq.gz MPRAsnakeflow ================================= We will run assignment and count workflow together. But it is of course possible to run them separately using different config files. Then you have to use the assignment `fromFile` not `fromConfig`. But first we need to define the config file and the experiment CSV file to map DNA/RNA counts to the correct replicate. Create config files ------------------- .. code-block:: bash cat << 'EOF' > config.yaml --- version: "0.6.0" assignments: ENCFF074MMOAssignment: bc_length: 20 BC_rev_comp: false linker: TCTAGAGGTTCGTCGACGCGATCGCAGGAGCCGCAGTG adapters: REV: five_prime: - AACTGGCCGCTTGACG alignment_tool: split_number: 30 tool: bbmap configs: sequence_length: 200 alignment_start: 1 FWD: - data/oligo_barcode/ENCFF235LUE.fastq.gz - data/oligo_barcode/ENCFF708MPZ.fastq.gz - data/oligo_barcode/ENCFF381TUK.fastq.gz - data/oligo_barcode/ENCFF847HXK.fastq.gz REV: - data/oligo_barcode/ENCFF215NWC.fastq.gz - data/oligo_barcode/ENCFF317FNS.fastq.gz - data/oligo_barcode/ENCFF504ZPY.fastq.gz - data/oligo_barcode/ENCFF420UFL.fastq.gz design_file: data/fasta/ENCFF074MMO.fix.fa configs: default: {} experiments: ENCFF074MMOExperiment: bc_length: 20 data_folder: data/rna_dna_counts experiment_file: experiment.csv assignments: ENCFF074MMOAssignment: type: config assignment_name: ENCFF074MMOAssignment assignment_config: default configs: default: {} EOF And the :code:`experiment.csv` file to map the DNA/RNA counts to the correct replicate. The experiment file is a simple CSV file with the following content: .. code-block:: bash cat << 'EOF' > experiment.csv Condition,Replicate,DNA_BC_F,RNA_BC_F A549,1,ENCFF448RQK.fastq.gz;ENCFF019RUN.fastq.gz;ENCFF850RIY.fastq.gz;ENCFF966RRE.fastq.gz;ENCFF168OJL.fastq.gz;ENCFF891CIZ.fastq.gz;ENCFF696HJK.fastq.gz;ENCFF491IXU.fastq.gz;ENCFF944CEQ.fastq.gz;ENCFF564JPU.fastq.gz,ENCFF311KDS.fastq.gz A549,2,ENCFF448RQK.fastq.gz;ENCFF019RUN.fastq.gz;ENCFF850RIY.fastq.gz;ENCFF966RRE.fastq.gz;ENCFF168OJL.fastq.gz;ENCFF891CIZ.fastq.gz;ENCFF696HJK.fastq.gz;ENCFF491IXU.fastq.gz;ENCFF944CEQ.fastq.gz;ENCFF564JPU.fastq.gz,ENCFF554OMB.fastq.gz A549,3,ENCFF448RQK.fastq.gz;ENCFF019RUN.fastq.gz;ENCFF850RIY.fastq.gz;ENCFF966RRE.fastq.gz;ENCFF168OJL.fastq.gz;ENCFF891CIZ.fastq.gz;ENCFF696HJK.fastq.gz;ENCFF491IXU.fastq.gz;ENCFF944CEQ.fastq.gz;ENCFF564JPU.fastq.gz,ENCFF501AHK.fastq.gz A549,4,ENCFF448RQK.fastq.gz;ENCFF019RUN.fastq.gz;ENCFF850RIY.fastq.gz;ENCFF966RRE.fastq.gz;ENCFF168OJL.fastq.gz;ENCFF891CIZ.fastq.gz;ENCFF696HJK.fastq.gz;ENCFF491IXU.fastq.gz;ENCFF944CEQ.fastq.gz;ENCFF564JPU.fastq.gz,ENCFF371LCK.fastq.gz A549,5,ENCFF448RQK.fastq.gz;ENCFF019RUN.fastq.gz;ENCFF850RIY.fastq.gz;ENCFF966RRE.fastq.gz;ENCFF168OJL.fastq.gz;ENCFF891CIZ.fastq.gz;ENCFF696HJK.fastq.gz;ENCFF491IXU.fastq.gz;ENCFF944CEQ.fastq.gz;ENCFF564JPU.fastq.gz,ENCFF061UCM.fastq.gz EOF Run snakemake ------------- Now we are ready to run MPRAsnakeflow. We will do it on one node with 50GB memory and 30 cores. We will run the pipeline directly in the :code:`ENCFF074MMO` folder. The MPRAsnakeflow workflow can be in a different directory. Let's assume :code:`/home/user/MPRAsnakeflow`. First we do a try run using snakemake :code:`-n` option. The MPRAsnakeflow command is: .. code-block:: bash conda activate mprasnakeflow snakemake -c 1 --sdm apptainer conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -n --quiet rules You should see a list of rules that will be executed. Here is the summary: .. code-block:: text Job stats: job count ----------------------------------------------------------------------- ------- all 1 assignment_attach_idx 60 assignment_check_design 1 assignment_collect 1 assignment_collectBCs 1 assignment_fastq_split 3 assignment_filter 1 assignment_flagstat 1 assignment_hybridFWDRead_get_reads_by_cutadapt 1 assignment_idx_bam 1 assignment_mapping_bbmap 30 assignment_mapping_bbmap_getBCs 30 assignment_merge_NGmerge 30 assignment_preprocessing_adapter_remove 1 assignment_statistic_assignedCounts 1 assignment_statistic_assignment 1 assignment_statistic_quality_metric 1 assignment_statistic_totalCounts 1 experiment_assigned_counts_assignBarcodes 10 experiment_assigned_counts_combine_replicates 2 experiment_assigned_counts_combine_replicates_barcode_output 1 experiment_assigned_counts_copy_final_all_files 1 experiment_assigned_counts_copy_final_thresh_files 1 experiment_assigned_counts_dna_rna_merge 5 experiment_assigned_counts_filterAssignment 1 experiment_assigned_counts_make_master_tables 1 experiment_counts_dna_rna_merge_counts 10 experiment_counts_filter_counts 6 experiment_counts_final_counts 6 experiment_counts_onlyFWD_raw_counts 6 experiment_statistic_assigned_counts_combine_BC_assignment_stats 1 experiment_statistic_assigned_counts_combine_BC_assignment_stats_helper 1 experiment_statistic_assigned_counts_combine_stats_dna_rna_merge 1 experiment_statistic_assigned_counts_combine_stats_dna_rna_merge_all 1 experiment_statistic_bc_overlap_combine_assigned_counts 1 experiment_statistic_bc_overlap_combine_counts 1 experiment_statistic_bc_overlap_run 4 experiment_statistic_correlation_bc_counts 2 experiment_statistic_correlation_bc_counts_hist 2 experiment_statistic_correlation_calculate 1 experiment_statistic_correlation_combine_bc_assigned 1 experiment_statistic_correlation_combine_bc_raw 1 experiment_statistic_correlation_combine_oligo 1 experiment_statistic_correlation_hist_box_plots 1 experiment_statistic_counts_BC_in_RNA_DNA 10 experiment_statistic_counts_BC_in_RNA_DNA_merge 2 experiment_statistic_counts_barcode_base_composition 6 experiment_statistic_counts_final 2 experiment_statistic_counts_frequent_umis 6 experiment_statistic_counts_stats_merge 2 experiment_statistic_counts_table 12 experiment_statistic_quality_metric 1 qc_report_assoc 1 qc_report_count 1 total 278 When dry-run does not give any errors we will run the workflow. We use a machine with 30 threads/cores to run the workflow and 60GB memory. Therefore :code:`split_number` is set to 30 to parallelize the workflow. Also we are using 10 threads for mapping (bbmap). But snakemake takes care that no more than 30 threads are used. .. code-block:: bash snakemake -c 30 --sdm conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile config.yaml -q --set-threads assignment_mapping_bbmap=10 --resources mem_mb=60000 .. note:: Please modify your code when running in a cluster environment. We have an example SLURM profile within the MPRAsnakeflow repository under :code:`profiles/default/config.yaml`. You can use it in snakemake with :code:`--workflow-profile $PIPELINE/profiles/default`. But adapt your before the :code:`slurm_partition` Results ----------------- For the assignment all output files will be in the :code:`results/assignment/ENCFF074MMOAssignment` folder. The final assignment is in :code:`results/assignment/ENCFF074MMOAssignment/assignment_barcodes.default.tsv.gz`. Also you should have a look at the qc report: :code:`results/assignment/ENCFF074MMOAssignment/qc_report.default.html`. You can find an example qc report here: `Example assignment QC report `_. For the experiment all output files will be in the :code:`results/experiments/ENCFF074MMOExperiment` folder. The final count files is :code:`results/experiments/ENCFF074MMOExperiment/reporter_experiment.barcode.A549.ENCFF074MMOAssignment.default.all.tsv.gz` for the barcode file and :code:`results/experiments/ENCFF074MMOExperiment/reporter_experiment.oligo.A549.ENCFF074MMOAssignment.default.all.tsv.gz` for the aggregated oligo files. Also you should have a look at the qc report: :code:`results/experiments/ENCFF074MMOExperiment/qc_report.A549.ENCFF074MMOAssignment.default.html`. You can find an example qc report here: `Example experiment QC report `_.