From c93fc143e1d38d9453e120d675a8ce56a3bbb70c Mon Sep 17 00:00:00 2001
From: EPPIcenter <eppicenter@pop-os.localdomain>
Date: Tue, 5 Sep 2023 17:30:34 -0700
Subject: [PATCH] add pseudocigar documentation

---
 docs/modules/analysis/index.md              |  2 +-
 docs/modules/analysis/resistance-markers.md | 63 +++++++++++----
 docs/modules/core-pipeline/index.md         |  2 +-
 docs/pipeline-outputs.md                    | 88 +++++++++++++++++++++
 4 files changed, 137 insertions(+), 18 deletions(-)
 create mode 100644 docs/pipeline-outputs.md

diff --git a/docs/modules/analysis/index.md b/docs/modules/analysis/index.md
index 4ccbbde..c934a57 100644
--- a/docs/modules/analysis/index.md
+++ b/docs/modules/analysis/index.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Analysis Modules
-nav_order: 6
+nav_order: 7
 has_children: true
 toc: true
 ---
diff --git a/docs/modules/analysis/resistance-markers.md b/docs/modules/analysis/resistance-markers.md
index 2f14037..b66cc3b 100644
--- a/docs/modules/analysis/resistance-markers.md
+++ b/docs/modules/analysis/resistance-markers.md
@@ -7,30 +7,61 @@ parent: Analysis Modules
 
 # Resistance markers
 
-This module will identify variants that are known to provide antibiotic resistance to *Plasmodium falciparum* by using the previously generated `allele_data.txt` and `Mapping/*mpileup.txt` files, along with a codon table (`codontable.txt`) and resistance marker genomic coordinates (`resistance_markers_amplicon_v4.txt`).
+This module will identify variants that are known to provide antibiotic resistance to *Plasmodium falciparum* by using the **PseudoCIGAR** string found in `allele_data.txt`. This module identifies mutations found within the genomic coordinates of interest (`resistance_markers_amplicon_v4.txt`), and reports any new indels or SNPs occurred. 
 
 ## File Outputs
 
-A table with resistance markers as well as a table with natural haplotypes (resistance markers found in the same amplicon), both including read counts inside brackets.
+There are three files that output by this module:
 
-Whenever multiple variants for a given resistance marker are found in the same sample, they are separated by `_` and the reference variant is presented first followed by the alternate variants. This also applies to the read counts. Cells remain blank if no resistance marker was found.
+### resmarker_table.txt
 
-Nomenclature of resistance markers can be broken down into name of the gene and amino acid position of the marker. For instance, for `dhfr_16`, `dhfr` is the gene and `16` is the amino acid position.
+This file contains all codons that are found in the ASV as specified by the genomic coordinates in the provided resistance marker table. The file summarizes what the 3-base sequence was in the sample, what was expected, and whether there was a synonomous or non-synomous amino acid change.
 
-### resmarkers_summary.txt
+|Column|Description|
+|:--:|:--:|
+|SampleID|The sample being reported|
+|GeneID|A numeric identifier for the *P. falciparum* gene and gene position|
+|Gene|The name of the gene|
+|CodonID|The codon number|
+|RefCodon|The codon in the reference|
+|Codon|The codon in the ASV|
+|CodonStart|The codon start position|
+|CodonRefAlt|Can be 'REF' or 'ALT', depending on whether the ASV codon matches the reference codon ('ALT' if they do not match)|
+|RefAA|The amino acid coded by the `RefCodon`|
+|AA|The amino acid code by the `Codon`|
+|AARefAlt|Can be 'REF' or 'ALT', depending on whether the ASV amino acid matches the reference amino acid ('ALT' if they do not match)|
+|Reads|The number of reads that contain this `Codon`|
 
-|SampleName|dhfr_16|dhfr_51|dhfr_59|dhfr_108|dhfr_164|...|
-|---|---|---|---|---|---|---|
-|sample1|A [350]|I [350]|R [350]||I [314]|...|
-|sample2|A [1541]|I_N [49_1492]|R [1541]|N [1460]|I [1460]|...|
-|sample3|A [226]|I [226]|R [226]|N [249]|I [249]|...|
 
-### resmarkers_haplotype_summary.txt 
+### resmarker_microhap_table.txt
 
-|SampleName|dhfr_16/dhfr_51/dhfr_59|dhfr_108/dhfr_164|mdr1_1034/mdr1_1042|crt_72/crt_73/crt_74/crt_75/crt_76|...|
-|---|---|---|---|---|---|
-|Sample_A|A/I/R [949]|N/I [798]|S/N [1449]|C/V/M/N/K [859]|...|
-|Sample_B|A/I/R [348]|N/I [257]|S/N [399]|C/V/M/N/K [390]|...|
-|Sample_C||N/I [94]|S/N [269]|C/V/M/N/K [183]|...|
+This file provides the same information as the resistance marker table, but in less granular form and joined by haplotype.   
+
+|Column|Description|
+|:--:|:--:|
+|SampleID|The sample being reported|
+|GeneID|A numeric identifier for the *P. falciparum* gene and gene position|
+|Gene|The name of the gene|
+|MicrohapIndex|A collapsed verison of the 'CodonID' (see 'resmarker_table.txt'). This will contain all codon IDs that are included in the microhaplotype.|
+|RefMicrohap|A collapsed version of the 'Ref' column that reports all reference amino acids in order by `CodonID`|
+|Microhaplotype|A collapsed version of the 'Alt' column that reports all ASV amino acids in order by `CodonID`|
+|MicrohapRefAlt|Can be 'ALT' or 'REF', depending on whether the ASV microhaplotype matches the reference microhaplotype ('ALT' if they do not match).|
+|Reads|Number of reads that have the `Microhaplotype`.|
+
+### resmarker_new_mutations.txt
+
+This file contains DNA mutations that were not within the specified genomic coordinates. The indels and SNPs here are listed in this file to allow end users an opportunity to see other mutations found. They should be interpreted with some caution as they are simply reported and not filtered in any way. 
+
+
+|Column|Description|
+|:--:|:--:|
+|SampleID|The sample being reported|
+|GeneID|A numeric identifier for the *P. falciparum* gene and gene position|
+|Gene|The name of the gene|
+|CodonID|The codon number|
+|Position|The position in the reference sequence where the indel or SNP was found|
+|Alt|The DNA base found at the position in the ASV|
+|Ref|The DNA base found at the position in the reference sequence|
+|Reads|The number of reads that support the `Alt` base|
 
 [jekyll-organization]: https://github.com/EPPIcenter
diff --git a/docs/modules/core-pipeline/index.md b/docs/modules/core-pipeline/index.md
index 90618c5..94c7e99 100644
--- a/docs/modules/core-pipeline/index.md
+++ b/docs/modules/core-pipeline/index.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Core Pipeline Modules
-nav_order: 5
+nav_order: 6
 has_children: true
 ---
 
diff --git a/docs/pipeline-outputs.md b/docs/pipeline-outputs.md
new file mode 100644
index 0000000..877fd92
--- /dev/null
+++ b/docs/pipeline-outputs.md
@@ -0,0 +1,88 @@
+---
+layout: default
+title: Pipeline Outputs
+nav_order: 5
+has_children: false
+---
+
+# Core Pipeline Outputs
+
+Below is a description of the files that you will find in your output directory. These are files that are part of the core pipeline.
+
+## Amplicon and Sample Coverage
+
+There are two coverage files:
+
+1. sample_coverage.txt
+2. amplicon_coverage.txt
+
+These provide sample and amplicon level coverage statistics for your sequencing data. The following metrics can be included for each (further broken down by amplicon in the amplicon_coverage.txt file):
+
+|Metric|Description|
+|:--:|:--:|
+|Input|This is the starting number of reads that were found for the sample. In the amplicon_coverage.txt file, the starting number of reads for the amplicon for the sample is reported|
+|No Dimers|Illumina Adapter dimers are first removed from the sequencing data. This number will inform you how many reads remain after filtering.|
+|OutputDada2|This is the number of denoised sequences. DADA2 is the denoising algorithm that the pipeline uses. In this step, reads that do not meet quality thresholds will be filtered out, reducing the number of sequences output from the module|
+|OutputPostprocessing|This is the number of sequences that remain after filter out reads that did not pass the specified alignment threshold after aligning to the provided reference sequence. At this step, off target sequences will be filtered out of the final dataset.|
+
+## Allele Data
+
+The allele_data.txt file contains all of the amplicon sequencing variants (ASVs) found in your sequencing dataset. There are 6 columns in this file that will be defined below.
+
+### ASV Identification
+
+There are 5 columns that identify the ASV reported.
+
+|Column|Description|
+|:--:|:--:|
+|sampleID|The reported sample|
+|locus|The reported locus for the sample|
+|asv|The denoised ASV|
+|reads|The number of reads that support this ASV|
+|allele|A unique identifier for the allele that is formed using the locus and an incrementing integer|
+
+### ASV Annotations (PseudoCIGAR)
+
+The `PseudoCIGAR` column provides a pseudocigar string that describes the ASV using *reference* coordinates and keys. The string is a succint representation of all:
+
+* Indels and SNPs that were identified
+* Locations that were masked by either user provided masking data, or by built in homopolymer and tandem repeat masking
+
+#### **Mutations (Indels and SNPs)**
+
+##### Indels
+
+The following syntax is used to report insertions:
+
+`{position}I=[ATCG]`
+
+Deletions are reported the same way but with `D=`:
+
+`{position}D=[ACTG]`
+
+In both cases:
+ * `position` is where the insertion or deletion occured along the *reference* sequence
+ * `[ACTG]` is the base that was inserted in the ASV (does not exist in the reference at that position), or the base that was deleted in the ASV (does exist in the reference at that position, but not in the ASV).
+
+##### SNPs
+
+SNPs use a slightly different syntax:
+
+`{position}[ACTG]`
+
+Where:
+ * `position` is where the substitution occured along the *reference* sequence.
+ * `[ACTG]` is the new base at the position in the ASV sequence.
+
+#### **Masks**
+
+If you are masking low complexity regions, you may see masking annotation in your PseudoCIGAR sequence. The following syntax is used for masks:
+
+`{start_position}+{mask_length}N`
+
+Where:
+
+* `start_position` is where the mask begins
+* `mask_length` is the length of the mask
+
+Any mutations that were identified in this masking region will be superseded by the mask. In other words, a substitution or indel will not be reported in the PseudoCIGAR string if the position is within a masked range.