Skip to content

Commit

Permalink
add pseudocigar documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
EPPIcenter committed Sep 6, 2023
1 parent 9aa2d43 commit c93fc14
Show file tree
Hide file tree
Showing 4 changed files with 137 additions and 18 deletions.
2 changes: 1 addition & 1 deletion docs/modules/analysis/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Analysis Modules
nav_order: 6
nav_order: 7
has_children: true
toc: true
---
Expand Down
63 changes: 47 additions & 16 deletions docs/modules/analysis/resistance-markers.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,30 +7,61 @@ parent: Analysis Modules

# Resistance markers

This module will identify variants that are known to provide antibiotic resistance to *Plasmodium falciparum* by using the previously generated `allele_data.txt` and `Mapping/*mpileup.txt` files, along with a codon table (`codontable.txt`) and resistance marker genomic coordinates (`resistance_markers_amplicon_v4.txt`).
This module will identify variants that are known to provide antibiotic resistance to *Plasmodium falciparum* by using the **PseudoCIGAR** string found in `allele_data.txt`. This module identifies mutations found within the genomic coordinates of interest (`resistance_markers_amplicon_v4.txt`), and reports any new indels or SNPs occurred.

## File Outputs

A table with resistance markers as well as a table with natural haplotypes (resistance markers found in the same amplicon), both including read counts inside brackets.
There are three files that output by this module:

Whenever multiple variants for a given resistance marker are found in the same sample, they are separated by `_` and the reference variant is presented first followed by the alternate variants. This also applies to the read counts. Cells remain blank if no resistance marker was found.
### resmarker_table.txt

Nomenclature of resistance markers can be broken down into name of the gene and amino acid position of the marker. For instance, for `dhfr_16`, `dhfr` is the gene and `16` is the amino acid position.
This file contains all codons that are found in the ASV as specified by the genomic coordinates in the provided resistance marker table. The file summarizes what the 3-base sequence was in the sample, what was expected, and whether there was a synonomous or non-synomous amino acid change.

### resmarkers_summary.txt
|Column|Description|
|:--:|:--:|
|SampleID|The sample being reported|
|GeneID|A numeric identifier for the *P. falciparum* gene and gene position|
|Gene|The name of the gene|
|CodonID|The codon number|
|RefCodon|The codon in the reference|
|Codon|The codon in the ASV|
|CodonStart|The codon start position|
|CodonRefAlt|Can be 'REF' or 'ALT', depending on whether the ASV codon matches the reference codon ('ALT' if they do not match)|
|RefAA|The amino acid coded by the `RefCodon`|
|AA|The amino acid code by the `Codon`|
|AARefAlt|Can be 'REF' or 'ALT', depending on whether the ASV amino acid matches the reference amino acid ('ALT' if they do not match)|
|Reads|The number of reads that contain this `Codon`|

|SampleName|dhfr_16|dhfr_51|dhfr_59|dhfr_108|dhfr_164|...|
|---|---|---|---|---|---|---|
|sample1|A [350]|I [350]|R [350]||I [314]|...|
|sample2|A [1541]|I_N [49_1492]|R [1541]|N [1460]|I [1460]|...|
|sample3|A [226]|I [226]|R [226]|N [249]|I [249]|...|

### resmarkers_haplotype_summary.txt
### resmarker_microhap_table.txt

|SampleName|dhfr_16/dhfr_51/dhfr_59|dhfr_108/dhfr_164|mdr1_1034/mdr1_1042|crt_72/crt_73/crt_74/crt_75/crt_76|...|
|---|---|---|---|---|---|
|Sample_A|A/I/R [949]|N/I [798]|S/N [1449]|C/V/M/N/K [859]|...|
|Sample_B|A/I/R [348]|N/I [257]|S/N [399]|C/V/M/N/K [390]|...|
|Sample_C||N/I [94]|S/N [269]|C/V/M/N/K [183]|...|
This file provides the same information as the resistance marker table, but in less granular form and joined by haplotype.

|Column|Description|
|:--:|:--:|
|SampleID|The sample being reported|
|GeneID|A numeric identifier for the *P. falciparum* gene and gene position|
|Gene|The name of the gene|
|MicrohapIndex|A collapsed verison of the 'CodonID' (see 'resmarker_table.txt'). This will contain all codon IDs that are included in the microhaplotype.|
|RefMicrohap|A collapsed version of the 'Ref' column that reports all reference amino acids in order by `CodonID`|
|Microhaplotype|A collapsed version of the 'Alt' column that reports all ASV amino acids in order by `CodonID`|
|MicrohapRefAlt|Can be 'ALT' or 'REF', depending on whether the ASV microhaplotype matches the reference microhaplotype ('ALT' if they do not match).|
|Reads|Number of reads that have the `Microhaplotype`.|

### resmarker_new_mutations.txt

This file contains DNA mutations that were not within the specified genomic coordinates. The indels and SNPs here are listed in this file to allow end users an opportunity to see other mutations found. They should be interpreted with some caution as they are simply reported and not filtered in any way.


|Column|Description|
|:--:|:--:|
|SampleID|The sample being reported|
|GeneID|A numeric identifier for the *P. falciparum* gene and gene position|
|Gene|The name of the gene|
|CodonID|The codon number|
|Position|The position in the reference sequence where the indel or SNP was found|
|Alt|The DNA base found at the position in the ASV|
|Ref|The DNA base found at the position in the reference sequence|
|Reads|The number of reads that support the `Alt` base|

[jekyll-organization]: https://github.com/EPPIcenter
2 changes: 1 addition & 1 deletion docs/modules/core-pipeline/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Core Pipeline Modules
nav_order: 5
nav_order: 6
has_children: true
---

Expand Down
88 changes: 88 additions & 0 deletions docs/pipeline-outputs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
layout: default
title: Pipeline Outputs
nav_order: 5
has_children: false
---

# Core Pipeline Outputs

Below is a description of the files that you will find in your output directory. These are files that are part of the core pipeline.

## Amplicon and Sample Coverage

There are two coverage files:

1. sample_coverage.txt
2. amplicon_coverage.txt

These provide sample and amplicon level coverage statistics for your sequencing data. The following metrics can be included for each (further broken down by amplicon in the amplicon_coverage.txt file):

|Metric|Description|
|:--:|:--:|
|Input|This is the starting number of reads that were found for the sample. In the amplicon_coverage.txt file, the starting number of reads for the amplicon for the sample is reported|
|No Dimers|Illumina Adapter dimers are first removed from the sequencing data. This number will inform you how many reads remain after filtering.|
|OutputDada2|This is the number of denoised sequences. DADA2 is the denoising algorithm that the pipeline uses. In this step, reads that do not meet quality thresholds will be filtered out, reducing the number of sequences output from the module|
|OutputPostprocessing|This is the number of sequences that remain after filter out reads that did not pass the specified alignment threshold after aligning to the provided reference sequence. At this step, off target sequences will be filtered out of the final dataset.|

## Allele Data

The allele_data.txt file contains all of the amplicon sequencing variants (ASVs) found in your sequencing dataset. There are 6 columns in this file that will be defined below.

### ASV Identification

There are 5 columns that identify the ASV reported.

|Column|Description|
|:--:|:--:|
|sampleID|The reported sample|
|locus|The reported locus for the sample|
|asv|The denoised ASV|
|reads|The number of reads that support this ASV|
|allele|A unique identifier for the allele that is formed using the locus and an incrementing integer|

### ASV Annotations (PseudoCIGAR)

The `PseudoCIGAR` column provides a pseudocigar string that describes the ASV using *reference* coordinates and keys. The string is a succint representation of all:

* Indels and SNPs that were identified
* Locations that were masked by either user provided masking data, or by built in homopolymer and tandem repeat masking

#### **Mutations (Indels and SNPs)**

##### Indels

The following syntax is used to report insertions:

`{position}I=[ATCG]`

Deletions are reported the same way but with `D=`:

`{position}D=[ACTG]`

In both cases:
* `position` is where the insertion or deletion occured along the *reference* sequence
* `[ACTG]` is the base that was inserted in the ASV (does not exist in the reference at that position), or the base that was deleted in the ASV (does exist in the reference at that position, but not in the ASV).

##### SNPs

SNPs use a slightly different syntax:

`{position}[ACTG]`

Where:
* `position` is where the substitution occured along the *reference* sequence.
* `[ACTG]` is the new base at the position in the ASV sequence.

#### **Masks**

If you are masking low complexity regions, you may see masking annotation in your PseudoCIGAR sequence. The following syntax is used for masks:

`{start_position}+{mask_length}N`

Where:

* `start_position` is where the mask begins
* `mask_length` is the length of the mask

Any mutations that were identified in this masking region will be superseded by the mask. In other words, a substitution or indel will not be reported in the PseudoCIGAR string if the position is within a masked range.

0 comments on commit c93fc14

Please sign in to comment.