The scripts contained here Implement clustering algorithms on genetic data and find optimal parameters through the performance of predictive growth models. They contain code to run on sequence alignments and/or phylogenetic trees, allowing users to choose between multiple implemented cluster-building algorithms. These algorithms can be further augmented through the selection of parameters, such as a required similarity for cluster formation, or a required level of certainty. The package also takes in meta-data associated with sequences (such as a known collection date or subtype/variant classification) by parsing headers. These can also allow users to identify cluster-level characteristics, such as the range of collection dates or the most common subtype/variant within a cluster.
If a subset of sequences are specified as “New”, then these scripts simulate cluster growth by building clusters in two stages: first clusters are built from sequences which are not specified as new, then the new sequences are added to clusters. Depending on the clustering method used, this second step may include compromises to insure that new sequences do not retroactively change the membership of clusters. For example, if a single new sequence forms a cluster with two, previously separate clusters, then those two clusters would have ambiguous growth. Pairing cluster-level meta-data, with the growth of clusters is a common goal in research and these scripts contain some functions to help test predictive models based on cluster data. Furthermore, they facilitate the assignment of multiple cluster sets from the same data using different methods and parameters. Pairing these with the effectiveness of growth models can be useful in method/parameter selection.
- pplacer - includes
guppy
Pplacer proposes the most likely places for new sequences to join a fixed maximum likelihood tree, simulating growth prospectively This step assumes that we already have a newick-formatted maximum likelihood tree with bootstrap support values saved as node labels. In this example, data/na.nwk contains such a tree, produced using data from a study of HIV in northern alberta by Vrancken, et al (2017). The end result from pplacer (and a related program guppy) is a list of trees, each containing the placements of 1 new sequence
source("pplacer_utils.R")
input.tree <- "data/na.nwk" # A tree containing only older sequences (ie. excluding new sequences that could represent growth)
tree.log <- "data/na.log" # A log file from the tree building process
full.alignment <- "data/na.fasta" # The full alignment (part of this was used to build the input tree)
out.ref <- "data/na.refpackage" # Name of the output file (this will be a reference package)
program <- "IQ-TREE" # Name of the program used (either "IQ-TREE" or "FastTree")
# Creates a reference package that pplacer can run on
# Follows this by running both pplacer and guppy
taxitCreate(input.tree, tree.log, full.alignment, out.ref, program)
pplacer_guppy(out.ref, "data/")
After pplacer has proposed the growth placements, the cluster analysis described above can be run
source("subT_Lib.R") # load import.tree, growthSim and multiSTClu functions
date.format <- "%Y-%m-%d" # Date format for time info
var.indices <- c(1,2,3) # The indices of the unique sequence identifier, the time info, and any additional variables (in that order - there may be more than one additional variable)
sep <- '_' # A splitter to pull variable info from headers
extra.vars = c("Subtype") # The names of any additional variables beyond
growth.info <- "data/na_growth.tree" # End result of pplacer and guppy runs. A list of trees that include new sequence placements.
# Extends the tree with information necessary for clustering and prospective growth measurement
extended.tree <- import.tree(input.tree, var.indices, date.format, sep, extra.vars)
extended.tree$g <- growthSim(extended.tree, growth.info)
distance.thresholds <- seq(0, 0.04, 0.001) # Distance thresholds for clustering
bootstrap.requirements <- 0.95 # Minimum bootstrap requirements for clustering
cores <- 4 # Number of cores for parallel processing
clustering.function <- STClu # Type of clustering algorithm (subtree step clustering for this example)
# Builds a list of different cluster sets made from different parameters
clus <- multiSTClu(extended.tree, distance.thresholds, bootstrap.requirements, cores, clustering.function)
proposed.var <- "Time" # Proposed additional variable for proposed model
runID <- 0 # Unique identifier for run
proposed.formula <- New~Old+Time # Formula for proposed model. This compares to a null model (New~Old)
var.translation <- list(function(x){mean(x)}) # List of transformation functions for proposed variables so that we can obtain 1 value per cluster (ie. mean time within cluster)
# List of Resulting data
# This includes a profile of delta AIC values and cluster stats
res <- GAICRun(clus, runID, cores, proposed.formula, proposed.var, var.translation)
To display the delta-AIC (GAIC) profile:
plot(res$MaxD, res$GAIC, type='l', xlab="Distance threshold", ylab="GAIC")
- Chato C, Kalish ML, Poon AF. Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection. Virus evolution. 2020 Jan;6(1):veaa011.
- Chato C, Feng Y, Ruan Y, Xing H, Herbeck J, Kalish M, Poon A. Optimized phylogenetic clustering of HIV-1 sequence data for public health applications. bioRxiv. 2022 Jan 1.
- Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics. 2010 Dec;11(1):1-6.
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank. Nucleic acids research. 2000 Jan 1;28(1):15-8.
- Vrancken B, Adachi D, Benedet M, Singh A, Read R, Shafran S, Taylor GD, Simmonds K, Sikora C, Lemey P, Charlton CL. The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data. Infection, Genetics and Evolution. 2017 Aug 1;52:100-5.