Project aimed at determining the rates of insertions and deletions in the five variable regions (V1-V5) of the HIV-1 gp120 surface envelope glycoprotein
Overview:
- parsed over 26,000 HIV-1 gp120 sequences from the Los Alamos National Laboratory (LANL) HIV Database and sorted them into their respective group M subtypes and circulating recombinant forms (CRFs)
- filtered sequences to ensure sufficient coverage of gp120 (>1,400 nt) and availability of collection dates
- performed a pairwise alignments between each sequence and the HXB2 reference genome to locate and extract the five variable and five conserved regions of gp120
- performed multiple sequence alignments (MSAs) among concatenated conserved regions within each group M clade
- reconstructed phylogenetic trees from these MSAs, and rescaled the trees in time using sequence collection dates
- extracted cherries of the phylogenetic trees and checked for length differences in their variable regions to detect indels
- applied a binomial-Poisson model to these data to determine indel rates for each variable loop within each group M clade