Daniel J Giguere, Alexander T Bacheli
Raw reads are available under project number PRJEB36155 from the European Nucleotide Archive
This repository is intended to accompany the pre-print by providing the code used, and an example for the circos plot visualizations. The main pipeline can be found in validation-work-flow-final.sh. Fasta files of the genomes are available here
We applied the following strategy to orient the genomes:
-
Annotate the genome using prokka.
-
Identify the start of the dnaA genes using prokka (using a grep "dnaA" command). If multiple dnaA genes exist, extract the first dnaA identified by prokka.
-
Using a python script (orient.py), re-orient the genome so that the start of the dnaA gene correspond with the start of the genome.
After filtering Nanopore reads by > 90% query coverage, apparent read coverage at the beginning and end of the reference genome will appear to decrease because minimap2
will map a read only to the start or end of a fasta file, which causes apparent query coverage to be artifically low. Therefore we applied the following strategy:
-
Using the oriented genome, approximate the mid-point of the genome based on its length.
-
Make a re-oriented copy of the genome that begins at the halfway point of the oriented genome using
modify-genome.py
. -
Map long reads to both the original and re-oriented genomes, separately.
-
To determine the long read coverage for the first and last quarters of the original genome, calculate the coverage in windows of 1000 bp of the second and third quarters of the re-oriented genome. Because the second and third quarters of the re-oriented genome are relatively far from the origin of the re-oriented genome, reads will not be mapped off the "edge" of the fasta file (assuming the read length is relatively small compared to genome size). Coverage of filtered reads can therefore be correctly calculated at the start of end of a fasta file.
-
To determine the long read coverage for the second and third quarters of the complete genome calculate the coverage of every 1000 bp of the second and third quarters of the original genome.
-
Concatenate the coverage from each quarter of the genome together to get the complete genome coverage.
-
The final calculated coverage segment of the genome is assumed to be 1000 bp in length. Because most genomes are not an exact factor of 1000, a given genome likely includes the coverage of part of the start of the genome (refer to image below). To accruately determine the coverage of this end segment of the genome, trim the coverage to the correct length of the genome.
The R package circlize
was used to generate each of the genome figure. An example script is available as circos.md to reproduce the plots for one genome that requires the following as input:
- table of GC content, skew and culmulative skew calculated from
circlize_gc_information.R
- unfiltered Illumina coverage
- filtered Illumina coverage
- unfiltered nanopore coverage
- filtered nanopore coverage
- coding sequences (positive and negative strand in separate files)
- location of tRNA and rRNA genes
- cytoband file (required for length and genome name)
validation-work-flow.sh
requires a number of scripts to run. These scripts are:
- bed-file-orientation.py: orienting the prokka annotations according to the start of the genome
- cigar-parse.py: filtering reads aligned to a genome
- dot-plot.py for creating dot plots of segments of a genome and the flye re-assembled genome against the polished genome (also requires dot-plot.R)
- extract-circularized-fasta.py: extracting circularized genomes above a certain minimum size for anaylsis
- modify-genome.py: re-orienting a genome so that the start of the genome begins half way into the fasta sequence
- modify-nanopore-bed.py: calculating genome coverage from both the "forward-oriented" and "reverse-oriented" genomes
- orient.py: orienting the genome to start at the dnaA gene
- table1.py: creating the first summary table
- table2.py: creating the second summary table