Climate change is contributing to a sizable increase in antimicrobial resistant bacteria, fungi, viruses, and parasites. Bacteriophages are an alternative treatment for resistant bacteria, and they have been used to treat infection since the early 1900s. However, there are too many bacteriophages to experimentally determine each ones host. On the other hand, it is cheap and time-effective to sequence the phages. Here, we present a new data-set for creating computational algorithms to match phage to host. This new data-set contains 4,827 phage-host interaction pairs with complete genomes, gene annotations, and protein sequences for the phages and hosts. We provide a review of historical algorithms that have shown success and strategies for developing new algorithms. Using features we extract and random forest algorithms, we achieve a 94% test accuracy for predicting whether a phage infects a given host with a random forest classifier.
Run the notebook jupyter_notebooks/Processing.ipynb with the data in the directory "newprotein" at this link: https://drive.google.com/drive/folders/16ofXFoms7HcS5vhn4yjexLRz_zRzJ2US?usp=sharing You will need to unzip the 4 zip files in this directory and set the corresponding directories in the notebook cells to the file path to which you extract them.