Skip to content

Retrieve HIVdb algorithm as XML and apply locally to HIV sequences

License

Notifications You must be signed in to change notification settings

PoonLab/sierra-local

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI DOI

sierra-local

sierra-local is a Python 3 implementation of the Stanford University HIV Drug Resistance Database (HIVdb) Sierra web service for generating drug resistance predictions from HIV-1 sequence data. This Python package enables laboratories to run this prediction algorithm without needing to transmit patient data over the network, and confers full control over data provenance and security.

Rationale

The Stanford HIVdb algortihm is a widely used method for predicting the drug resistance phenotype of an HIV-1 infection based on its genetic sequence, specifically the complete or partial sequence of the genomic regions encoding the primary targets of modern antiretroviral therapy. Prediction of HIV-1 drug resistance is an important component in the routine clinical management of HIV-1 infection, being faster and more cost-effective than the direct measurement of drug resistance by culturing virus isolates in the laboratory. The HIVdb algorithm is essentially rules-based classifier that is actively maintained and released to the public domain in the ASI (Algorithm Specification Interface) exchange format, demonstrating a laudable commitment by the HIVdb developers to open-source research and clinical practice.

The HIVdb algorithm is usually accessed through a web service hosted at Stanford University (Sierra). While this is a convenient format for many clinical laboratories, it requires a network connection and the transmission of potentially sensitive patient-derived data to a remote server. Transmitting sequence data over the web may present a bottleneck for laboratories located at sites that are geographically distant from the host server, or where network traffic is prone to service disruptions. Furthermore, the use of HIV-1 sequence data in criminal cases raises significant issues around data privacy.

Our objective was to build a lightweight, open-source Python implementation of the HIVdb algorithm for processing data on a local computer without sending any data over the network. During the development of sierra-local, the maintainers of Sierra released the source code for their web service under a permissive free software license (GPL v3.0). We were thrilled that the HIVdb developers elected to release their server code, but we remained committed to complete sierra-local so the HIV research and clinical communities can process their own data without needing to install and maintain an Apache server, build an SQL database, or to install a sizeable number of software dependencies.

Dependencies

We tried to minimize dependencies:

Post-Align is the new alignment program and requires the following dependencies:

Installation

Setting up Post-Align

Post-Align is the new alignment program used by sierrapy, which we've incorporated into sierra-local. After cython is installed, run:

pip install https://github.com/hivdb/post-align/archive/8e2ee118261987208c17add6cef5c1270e325a4c.zip

which is adapted from Post-Align's docker script

Setting up Sierra-Local

On a Linux system, you can install sierra-local as follows:

git clone http://github.com/PoonLab/sierra-local
cd sierra-local
sudo python3 setup.py install

Note that you need super-user privileges to install the package by this method. For more detailed instructions, please refer to the document INSTALL.md that should be located in the root directory of this Python package.

Using sierra-local

Command-line interface (CLI)

To run a quick example, use the following sequence of commands:

art@Jesry:~/git/sierra-local$ python3 scripts/retrieve_hivdb_data.py RT RT.fa
art@Jesry:~/git/sierra-local$ sierralocal RT.fa
searching path /root/miniconda3/envs/py395/lib/python3.10/site-packages/sierralocal/data/HIVDB*.xml
searching path /root/miniconda3/envs/py395/lib/python3.10/site-packages/sierralocal/data/apobec*.json
HIVdb version 9.4
Aligning using post-align
Aligned RT.fa
100 sequences found in file RT.fa.
Writing JSON to file RT_results.json
Time elapsed: 19.796 seconds (5.1555 it/s)

To swap between running Post-Align (default) and NucAmnio, you can specify using -alignment, where inputting nuc will result in NucAmino being called

will@Jesry:~/sierra-local# sierralocal RT.fa -alignment nuc
searching path /root/miniconda3/envs/py395/lib/python3.10/site-packages/sierralocal/data/HIVDB*.xml
searching path /root/miniconda3/envs/py395/lib/python3.10/site-packages/sierralocal/data/apobec*.json
HIVdb version 9.4
Found NucAmino binary /root/miniconda3/envs/py395/lib/python3.10/site-packages/sierralocal/bin/nucamino-linux-amd64
Aligned RT.fa
100 sequences found in file RT.fa.
Writing JSON to file RT_results.json
Time elapsed: 4.1417 seconds (25.45 it/s)

retrieve_hivdb_data.py is a Python script that we provided to download small samples of HIV-1 sequence data from the Stanford HIVdb database. In this case, we have retrieved 100 reverse transcriptase (RT) sequences and processsed them with the sierra-local pipeline. By default, the results are written to the file [FASTA basename]_results.json:

art@Jesry:~/git/sierra-local$ head RT_results.json 
[
  {
    "inputSequence": {
      "header": "U54771.CM240.CRF01_AE.0"
    },
    "subtypeText": "CRF01_AE",
    "validationResults": [],
    "alignedGeneSequences": [
      {
        "firstAA": 1,

We can also specify a different ASI (XML) file representing an earlier version of the HIVdb algorithm to reprocess the same data, and save the output to another file:

art@Jesry:~/git/sierra-local$ sierralocal -xml sierralocal/data/HIVDB_8.5.d926dfff.xml RT.fa -o RT-v8.5.json
/usr/local/lib/python3.6/dist-packages/sierralocal/data/apobec.tsv
HIVdb version 8.5
Found NucAmino binary /usr/local/lib/python3.6/dist-packages/sierralocal/bin/nucamino-linux-amd64
Aligned RT.fa
100 sequences found in file RT.fa.
Writing JSON to file RT-v8.5.json
Time elapsed: 5.5014 seconds (18.337 it/s)

We find that switching versions of the algorithm from 8.5 to 8.7 results in substantial changes in resistance scores for these data with the introduction of a new drug doravirine (DOR). In addition, two of 100 cases were scored differently:

art@Jesry:~/git/sierra-local$ python3 scripts/json2csv.py RT_results.json RT-v8.7.json.csv
art@Jesry:~/git/sierra-local$ python3 scripts/json2csv.py RT-v8.5.json RT-v8.5.json.csv
art@Jesry:~/git/sierra-local$ R
> v5 <- read.csv('RT-v8.5.json.csv')
> v7 <- read.csv('RT-v8.7.json.csv')
> v7 <- v7[,-which(names(v7)=='DOR')]
> temp <- sapply(1:nrow(v5), function(i) any(v5[i,] != v7[i,]))
> which(temp)
[1] 23 63
> v5[23,]
                name subtype ABC AZT D4T DDI FTC LMV TDF EFV ETR NVP RPV
23 Y14503.BCF13.O.22 Group O  15 -10 -10  10  60  60 -10  50  45  95  65
> v7[23,]
                name subtype ABC AZT D4T DDI FTC LMV TDF EFV ETR NVP RPV
23 Y14503.BCF13.O.22 Group O  15 -10 -10  10  60  60 -10  50  55 105  75
> v5[63,]
                name subtype ABC AZT D4T DDI FTC LMV TDF EFV ETR NVP RPV
63 AF102332.A11.B.62       B  90 115 115  90  80  80  60   0   0   0   0
> v7[63,]
                name subtype ABC AZT D4T DDI FTC LMV TDF EFV ETR NVP RPV
63 AF102332.A11.B.62       B  90 115 115  90  80  80  60   0  10  10  10

To specify your own JSON file for APOBEC DRMS, you can call -json followed by your file. In the example below, EXTERNAL-APOBEC.json is located in the same directory as I called sierralocal:

root@LAPTOP-4FGEVBR0:~/sierra-local# sierralocal RT.fa -json EXTERNAL-APOBEC.json

As a Python module

If you have downloaded the package source to your computer, you can also run sierra-local as a Python module from the root directory of the package. In the following example, we are calling the main function of sierra-local from an interactive Python session:

art@Jesry:~/git/sierra-local$ git clone http://github.com/PoonLab/sierra-local
art@Jesry:~/git/sierra-local$ cd sierra-local
art@Jesry:~/git/sierra-local$ python3
Python 3.6.6 (default, Sep 12 2018, 18:26:19) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from sierralocal.main import sierralocal
>>> sierralocal('RT.fa', 'RT.json')
searching path /home/art/git/sierra-local/sierralocal/data/HIVDB*.xml
searching path /home/art/git/sierra-local/sierralocal/data/apobec*.tsv
HIVdb version 8.8
Found NucAmino binary /home/art/git/sierra-local/sierralocal/bin/nucamino-linux-amd64
Aligned RT.fa
100 sequences found in file RT.fa.
Writing JSON to file RT.json
(100, 0.04676532745361328)

Note that this doesn't require any sudo privileges.

Updating the algorithm

The Stanford HIVdb database regularly updates its resistance genotyping algorithm and publishes the associated ASI2 XML file on their github, hivfacts. In previous versions of sierra-local, we used Python to automatically query this website and download the newest version if it was not already present on the user's computer. Subsequent changes to the Stanford HIVdb website, however, meant that users would have to install several additional dependencies in order for Python to locate the required files. As a result, we decided to make the updater.py script an optional step of the pipeline.

Manually running the script enabled me to grab the most recent versions of the ASI2 and APOBEC files from the HIVdb webserver:

art@orolo:~/git/sierra-local$ python3 sierralocal/updater.py 
Downloading the latest HIVDB XML File
Updated HIVDB XML into sierralocal/data/HIVDB_9.4.xml
Downloading the latest APOBEC DRMS File
Updated APOBEC DRMs into sierralocal/data/apobec_drms.json

Now of course, it would be much simpler to manually download these files yourself in hivfacts, but in some applications there may be a benefit to automating this step.

About Us

This project was developed at the Poon lab within the Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario. Development of sierra-local was supported in part by a grant from the Canadian Institutes of Health Research (PJT-156178).

If you use sierra-local for your work, please cite the following paper:

  • sierra-local: A lightweight standalone application for drug resistance prediction. Jasper C Ho, Garway T Ng, Mathias Renaud, Art FY Poon, (2019). Journal of Open Source Software, 4(33), 1186, https://doi.org/10.21105/joss.01186

If you want to reference the validation of sierra-local on HIV-1 pol data, please cite the following preprint:

  • sierra-local: A lightweight standalone application for secure HIV-1 drug resistance prediction. Jasper C Ho, Garway T Ng, Mathias Renaud, Art FY Poon. bioRxiv 393207; doi: https://doi.org/10.1101/393207