Scripts and output files associated with a manuscript "Prospects for a sequence-based taxonomy of influenza A virus subtypes":
https://www.biorxiv.org/content/10.1101/2023.07.06.548035v2
Input data files can be obtained from Zenodo under a Creative Commons license:
-
chainsaw-plot.R
- R script to visualize the number of subtrees produced by the edgewise clustering ("chainsaw") method as a function of the internal branch length cutoff.
This script is applied to results obtained for protein sequence phylogenies reconstructed for all eight influenza A virus (IAV) genome segments, with an emphasis on hemagglutinin (HA) and neuraminidase (NA) proteins. -
coldates.R
- a simple R script that was used to generate Figure S1 (a barplot of the number of IAV sequences deposited to the GISAID database per year). -
chainsaw.py
- Python script implementing the edgewise clustering method. Requires Biopython. Running the script without any arguments prints a histogram summary of branch lengths to the console. Specifying a branch length cutoff with--cutoff
prints a summary of the resulting subtres (defaulting to-f summary
). Setting the-f
option tolabels
writes a detailed CSV output listing subtree assignments for all tips. The script also calculates the normalized mutual information between the subtree partition and subtype labels. -
compress-seqs.py
- This Python script looks for exact matches in unaligned sequences of the input FASTA file, and writes the unique sequence to an output FASTA file using the first label encountered. All other duplicate labels are written to a CSV file to link them to the first label. This script also filters out sequences with an excessive number of ambiguous amino acids (X
). -
concat-genes.py
- This Python script concatenates the non-overlapping amino acid sequences for M1/M2 or NS1/NS2 records from the same isolate. The input is assumed to be a CDS FASTA file generated by the NCBI Genbank interface. -
filter-prot.py
- This Python script applies an initial filter on the CDS FASTA files downloaded from Genbank. It uses regular expressions to remove records that do not correspond to the query protein. -
get-metadata.py
- The default sequence names for Genbank CDS downloads are not very informative, so this script is used to retrieve more useful metadata such as the strain name and collection date from the database based on the accession number. It takes either a FASTA or NWK file as input. The results are written to a CSV file. -
midpoint.R
- This small R script simpily calls themidpoint
rooting function of thephangorn
package on the input tree. -
plot-trees.R
- This R script requires the R packageggfree
. It generates plots of the large HA and NA phylogenies, colouring branches based on subtype labels on the tips. -
relabel-fasta.py
- This Python script uses the CSV generated byget-metadata.py
to replace the sequence names in the user-specified FASTA input file. -
subtree-grid.R
- This R script was used to generate the supplementary figure summarizing the results of node-wise clustering of the HA phylogeny. -
subtyping.py
- This Python script implements the nodewise clustering method, calculating a number of summary statistics for every internal node of the input tree.
-
HA.mindiv0_08.maxpat1_2.subtypes.csv
- This CSV file was generated for the HA sequence alignment using thesubtyping.py
script that implements a nodewise clustering method with minimum divergence (mindiv
) set to 0.08 and maximum mean patristic distance (maxpat
) set to 1.2. These data were used to generate Supplementary Figure 3D. -
chainsaw-HA-0.18.labels.csv
- This CSV file was generated using thechainsaw.py
script for the input treeHA.nwk
with the options-f labels
and--cutoff 0.18
. The results were used to generate the matrix plot comparing subtrees to NA subtype labels (Figure 2B). -
chainsaw-NA-0.41.labels.csv
- This CSV file was generated using thechainsaw.py
script for the input treeNA.nwk
with the options-f labels
and--cutoff 0.41
. The results were used to generate the matrix plots comparing subtrees to NA subtype labels (Figure 3B). -
chainsaw-nsubtrees-na.csv
- This CSV was generated by runningchainsaw.py
for the input treeNA.nwk
under varying settings of--cutoff
, and recording the number of subtrees listed in the summary outputs (Figure 3A). -
chainsaw-nsubtrees-others.csv
- This CSV was generated by runningchainsaw.py
for all trees except forHA.nwk
andNA.nwk
under varying settings of--cutoff
, and recording the number of subtrees listed in the summary outputs (Supplementary Figure S4). -
chainsaw-nsubtrees.csv
- This CSV was generated by runningchainsaw.py
for the input treeHA.nwk
under varying settings of--cutoff
, and recording the number of subtrees listed in the summary outputs (Figure 2A). -
edge-index.RData
- This file stores some intermediate outputs of theplot-trees.R
script. -
subtree-grid.csv
- This CSV file was generated by thesubtree-grid.py
script for the HA phylogeny with no--minlen
or--maxlen
option specified. It is used by the scriptsubtree-grid.R
to generate Supplementary Figure S3 (A to C). -
treeplots.RData
- This file stores some intermediate outputs of theplot-trees.R
script.