.. _Count example: .. role:: bash(code) :language: bash ========================= Basic Experiment Workflow ========================= This example runs the count workflow on 5'/5' WT MPRA data in the HepG2 cell line from `Klein J., Agarwal, V., Keith, A., et al. 2019 `_. Prerequisites ============= This example depends on the following data and software: Installing MPRAsnakeflow ------------------------ Please install conda, the MPRAsnakeflow environment, and clone the actual ``MPRAsnakeflow`` master branch. You will find more help under :ref:`Installation`. Producing an Association (.tsv.gz) File --------------------------------------- This workflow requires a Python dictionary of candidate regulatory sequences (CRS) mapped to their barcodes in a tab-separated (.tsv.gz) format. For this example, the file can be generated using :ref:`Assignment example` or found in the :code:`resources/count_basic` folder in `MPRAsnakeflow `_ (file: :code:`SRR10800986_barcodes_to_coords.tsv.gz`). Alternatively, if the association file is in pickle (.pickle) format because you used MPRAflow, you can convert it to .tsv.gz format using the in-built function in MPRAsnakeflow with the following code: .. code-block:: bash conda activate mprasnakeflow python assignment_pickle_to_tsv.py --input .pickle | sort | uniq | gzip -c > .tsv.gz Design (.fa) File ----------------- The design file can be generated using the :ref:`Assignment example` or downloaded from the :code:`resources/count_basic` folder in `MPRAsnakeflow `_. Reads ----- There is one condition (HepG2) with three technical replicates. Each replicate contains a forward (barcode-forward), reverse (barcode-reverse), and index (unique molecular identifier) read for DNA and RNA. These data must be downloaded. All data is publicly available on the Short Read Archive (SRA). We will use SRA-toolkit to obtain the data. .. note:: You need 9 GB of disk space to download the data and upwards of 50 GB to process it! .. code-block:: bash conda install sra-tools mkdir -p count_basic/data cd count_basic/data fastq-dump --gzip --split-files SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886 cd .. For large files and unstable internet connections, we recommend the command ``prefetch`` from SRA tools before running ``fastq-dump``. This command provides better warnings when something goes wrong. .. code-block:: bash conda install sra-tools cd count_basic/data prefetch SRR10800881 SRR10800882 SRR10800883 SRR10800884 SRR10800885 SRR10800886 fastq-dump --gzip --split-files SRR10800986 cd .. .. note:: Please ensure that all files are downloaded completely without errors! Depending on your internet connection, this can take a while. If you just want some data to run MPRAsnakeflow, you can limit yourself to one condition and/or just one replicate. The data folder view can be seen with the following command: .. code-block:: bash tree data The folder should look like this: .. code-block:: text data ├── design.fa ├── SRR10800881_1.fastq.gz ├── SRR10800881_2.fastq.gz ├── SRR10800881_3.fastq.gz ├── SRR10800882_1.fastq.gz ├── SRR10800882_2.fastq.gz ├── SRR10800882_3.fastq.gz ├── SRR10800883_1.fastq.gz ├── SRR10800883_2.fastq.gz ├── SRR10800883_3.fastq.gz ├── SRR10800884_1.fastq.gz ├── SRR10800884_2.fastq.gz ├── SRR10800884_3.fastq.gz ├── SRR10800885_1.fastq.gz ├── SRR10800885_2.fastq.gz ├── SRR10800885_3.fastq.gz ├── SRR10800886_1.fastq.gz ├── SRR10800886_2.fastq.gz ├── SRR10800886_3.fastq.gz └── SRR10800986_barcodes_to_coords.tsv.gz Here is an overview of the files: .. csv-table:: HepG2 Data :header: "Condition", "GEO Accession", "SRA Accession", "SRA Runs" :widths: 40, 10, 10, 20 "HepG2-DNA-1: HepG2 DNA replicate 1", GSM4237863, SRX7474781, "SRR10800881" "HepG2-RNA-1: HepG2 RNA replicate 1", GSM4237864, SRX7474782, "SRR10800882" "HepG2-DNA-2: HepG2 DNA replicate 2", GSM4237865, SRX7474783, "SRR10800883" "HepG2-RNA-2: HepG2 RNA replicate 2", GSM4237866, SRX7474784, "SRR10800884" "HepG2-DNA-3: HepG2 DNA replicate 3", GSM4237867, SRX7474785, "SRR10800885" "HepG2-RNA-3: HepG2 RNA replicate 3", GSM4237868, SRX7474786, "SRR10800886" Run MPRAsnakeflow ================= Now we are ready to start MPRAsnakeflow and count the number of barcodes. First, we need to generate an experiment file (`experiment.csv`) to specify the conditions, replicates, and corresponding reads. Creating experiment.csv ----------------------- Our experiment file looks like this: .. code-block:: text Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R HepG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz HepG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz HepG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz Save it into the :code:`count_basic/data` folder as :code:`experiment.csv`. MPRAsnakeflow ============= Now we have everything ready to run the count workflow of MPRAsnakeflow. Run the pipeline directly in the :code:`count_basic` folder. The MPRAsnakeflow workflow can be in a different directory. Let's assume :code:`/home/user/MPRAsnakeflow`. First, configure the config file and save it to the :code:`count_basic` folder. The config file is a simple text file with the following content: .. literalinclude:: ../../resources/count_basic/config.yml :language: yaml Perform a dry-run using Snakemake's :code:`-n` option: .. code-block:: bash cd count_basic conda activate mprasnakeflow snakemake -c 1 --use-conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile /home/user/MPRAsnakeflow/resources/count_basic/config.yml -n -q If the dry-run does not give any errors, run the workflow using 30 threads: .. code-block:: bash snakemake -c 30 --use-conda --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --configfile /home/user/MPRAsnakeflow/resources/count_basic/config.yml .. note:: Please modify your code when running in a cluster environment. We have an example SLURM config file here: :code:`config/sbatch.yml`. Results ------- All output files will be in the :code:`results/experiments/countBasic` folder. To generate a final report, use the following command: .. code-block:: bash snakemake --config config.yml --snakefile /home/user/MPRAsnakeflow/workflow/Snakefile --report report.html This HTML report contains information about the Snakemake run and integrates statistics tables and plots. Total file tree of the results folder: .. code-block:: text results └── experiments └── countBasic ├── assigned_counts │   └── fromWorkflow │   ├── default │   │   ├── HEPG2_1_merged_assigned_counts.tsv.gz │   │   ├── HEPG2_2_merged_assigned_counts.tsv.gz │   │   ├── HEPG2_3_merged_assigned_counts.tsv.gz │   │   ├── HEPG2_allreps_merged.tsv.gz │   │   └── HEPG2_allreps_minThreshold_merged.tsv.gz │   ├── HEPG2_1_DNA_final_counts.config.default.tsv.gz │   ├── HEPG2_1.merged.config.default.tsv.gz │   ├── HEPG2_1_RNA_final_counts.config.default.tsv.gz │   ├── HEPG2_2_DNA_final_counts.config.default.tsv.gz │   ├── HEPG2_2.merged.config.default.tsv.gz │   ├── HEPG2_2_RNA_final_counts.config.default.tsv.gz │   ├── HEPG2_3_DNA_final_counts.config.default.tsv.gz │   ├── HEPG2_3.merged.config.default.tsv.gz │   └── HEPG2_3_RNA_final_counts.config.default.tsv.gz ├── assignment │   └── fromWorkflow.tsv.gz ├── counts │   ├── HEPG2_1_DNA.bam │   ├── HEPG2_1_DNA_filtered_counts.tsv.gz │   ├── HEPG2_1_DNA_final_counts.tsv.gz │   ├── HEPG2_1_DNA_raw_counts.tsv.gz │   ├── HEPG2_1.merged.config.default.tsv.gz │   ├── HEPG2_1_RNA.bam │   ├── HEPG2_1_RNA_filtered_counts.tsv.gz │   ├── HEPG2_1_RNA_final_counts.tsv.gz │   ├── HEPG2_1_RNA_raw_counts.tsv.gz │   ├── HEPG2_2_DNA.bam │   ├── HEPG2_2_DNA_filtered_counts.tsv.gz │   ├── HEPG2_2_DNA_final_counts.tsv.gz │   ├── HEPG2_2_DNA_raw_counts.tsv.gz │   ├── HEPG2_2.merged.config.default.tsv.gz │   ├── HEPG2_2_RNA.bam │   ├── HEPG2_2_RNA_filtered_counts.tsv.gz │   ├── HEPG2_2_RNA_final_counts.tsv.gz │   ├── HEPG2_2_RNA_raw_counts.tsv.gz │   ├── HEPG2_3_DNA.bam │   ├── HEPG2_3_DNA_filtered_counts.tsv.gz │   ├── HEPG2_3_DNA_final_counts.tsv.gz │   ├── HEPG2_3_DNA_raw_counts.tsv.gz │   ├── HEPG2_3.merged.config.default.tsv.gz │   ├── HEPG2_3_RNA.bam │   ├── HEPG2_3_RNA_filtered_counts.tsv.gz │   ├── HEPG2_3_RNA_final_counts.tsv.gz │   └── HEPG2_3_RNA_raw_counts.tsv.gz └── statistic ├── assigned_counts │   └── fromWorkflow │   ├── default │   │   ├── combined │   │   │   └── HEPG2_merged_assigned_counts.statistic.tsv.gz │   │   ├── HEPG2_1_merged_assigned_counts.statistic.tsv.gz │   │   ├── HEPG2_2_merged_assigned_counts.statistic.tsv.gz │   │   ├── HEPG2_3_merged_assigned_counts.statistic.tsv.gz │   │   ├── HEPG2_average_allreps_merged.tsv.gz │   │   ├── HEPG2_barcodesPerInsert.png │   │   ├── HEPG2_correlation_minThreshold.tsv │   │   ├── HEPG2_correlation.tsv │   │   ├── HEPG2_DNA_pairwise_minThreshold.png │   │   ├── HEPG2_DNA_pairwise.png │   │   ├── HEPG2_group_barcodesPerInsert_box_minThreshold.png │   │   ├── HEPG2_group_barcodesPerInsert_box.png │   │   ├── HEPG2_Ratio_pairwise_minThreshold.png │   │   ├── HEPG2_Ratio_pairwise.png │   │   ├── HEPG2_RNA_pairwise_minThreshold.png │   │   └── HEPG2_RNA_pairwise.png │   ├── HEPG2_1_DNA_default.statistic.tsv.gz │   ├── HEPG2_1_RNA_default.statistic.tsv.gz │   ├── HEPG2_2_DNA_default.statistic.tsv.gz │   ├── HEPG2_2_RNA_default.statistic.tsv.gz │   ├── HEPG2_3_DNA_default.statistic.tsv.gz │   └── HEPG2_3_RNA_default.statistic.tsv.gz ├── barcode │   ├── assigned_counts │   │   └── fromWorkflow │   │   ├── HEPG2_default_barcode_correlation.tsv │   │   ├── HEPG2_default_barcode_DNA_pairwise.png │   │   ├── HEPG2_default_barcode_Ratio_pairwise.png │   │   ├── HEPG2_default_barcode_RNA_pairwise.png │   │   ├── HEPG2_default_DNA_perBarcode.png │   │   └── HEPG2_default_RNA_perBarcode.png │   └── counts │   ├── HEPG2_default_barcode_correlation.tsv │   ├── HEPG2_default_barcode_DNA_pairwise.png │   ├── HEPG2_default_barcode_Ratio_pairwise.png │   ├── HEPG2_default_barcode_RNA_pairwise.png │   ├── HEPG2_default_DNA_perBarcode.png │   └── HEPG2_default_RNA_perBarcode.png ├── bc_overlap.assigned_counts.default.fromWorkflow.tsv ├── bc_overlap.counts.default.tsv ├── counts │   ├── BCNucleotideComposition.HEPG2_1_DNA.tsv.gz │   ├── BCNucleotideComposition.HEPG2_1_RNA.tsv.gz │   ├── BCNucleotideComposition.HEPG2_2_DNA.tsv.gz │   ├── BCNucleotideComposition.HEPG2_2_RNA.tsv.gz │   ├── BCNucleotideComposition.HEPG2_3_DNA.tsv.gz │   └── BCNucleotideComposition.HEPG2_3_RNA.tsv.gz ├── counts.filtered.tsv ├── counts.freqUMIs.HEPG2_1_DNA.txt ├── counts.freqUMIs.HEPG2_1_RNA.txt ├── counts.freqUMIs.HEPG2_2_DNA.txt ├── counts.freqUMIs.HEPG2_2_RNA.txt ├── counts.freqUMIs.HEPG2_3_DNA.txt ├── counts.freqUMIs.HEPG2_3_RNA.txt ├── counts.raw.tsv ├── statistic_assigned_bc_correlation_merged_fromWorkflow_default.tsv ├── statistic_assigned_counts_merged_fromWorkflow_default.tsv ├── statistic_assigned_counts_single_fromWorkflow_default.tsv ├── statistic_bc_correlation_merged_default.tsv └── statistic_oligo_correlation_merged_fromWorkflow_default.tsv