Name/Link | Prokaryotic Portion | Viral Portion | Prophage-masked? | Taxonomy for Prokaryotic Portion | Comments |
---|---|---|---|---|---|
Default database | HumGut | MGV + RefSeq viral | N | NCBI | Default database (as described in our manuscript) |
Masked version of default database | HumGut | MGV + RefSeq viral | Y | NCBI | Prophage-masked version of default database (as described in our manuscript) |
Default database - GTDB | HumGut | MGV + RefSeq viral | N | GTDB | Default database with GTDB taxonomy for prokaryotic portion |
UHGGV2 + MGV | UHGGV2 | MGV + RefSeq viral | N | GTDB | Default database with UHGGv2 replacing HumGut. UHGGv2 includes low-prevalence prokaryotes filtered by HumGut |
HumGut + UHGV "MQ+" | HumGut | UHGV ( |
N | NCBI | Same as default database but replacing the viral portion with new viral genome catalog UHGV. Here we included UHGV genomes |
HumGut + UHGV "HQ+" | HumGut | UHGV ( |
N | NCBI | Same as previous line but using only |
UHGGv2 + UHGV "MQ+" | UHGGV2 | UHGV ( |
N | GTDB | UHGGv2 for prokaryotic portion; UHGV for viral portion ( |
Kraken2 database
- hash.k2d
- taxo.k2d
- opts.k2d
- seqid2taxid.map
Bracken databases (built for use with various read lengths N):
- databaseNmers.kmer_distrib
Additional files required for pipeline to run:
- inspect.out
- taxonomy/nodes.dmp
- taxonomy/names.dmp
- library/species_genome_size.txt
For use with post-processing scripts:
- host_prediction_to_genus.tsv
- species_name_to_vir_score.txt
Note: Phanta was developed with human gut metagenomes in mind. Phanta's default database was built based on human-gut viral and bacterial genomes. If you wish to apply Phanta on non human gut metagenomes you'll probably need to supply a custom database. In such cases please open new discussion and we can discuss the best way to help/collaborate on that.
The total tar.gz file should be about 20-25 GB (depends on the exact version).
Phanta is based on Kraken2/Bracken. As a result, as you can see above, the main components of a Phanta database are a Kraken2 database and Bracken database(s). After you have these, you’re almost there! More details below.
You can either follow the recommendations of the Kraken2 developers here, or the recommendations below.
For every genome in the database, you will need to have both a taxonomic ID and the name. Additionally, you will need this information for all higher taxonomic ranks in the lineage of each genome.
Then, you will use this information to make the following two files:
- names.dmp
- nodes.dmp
The names.dmp file specifies the taxid and name of each taxon in the database. To enter lines in our names.dmp file, we use the following Python code (generally as part of a “for loop”):
names_file.write(str(taxid) + "\t|\t" + name + "\t|\t-\t|\tscientific name\t|\n")
Each line of the nodes.dmp file specifies a parent-child taxonomic relationship. To enter lines in our nodes.dmp file, we use the following Python code (generally as part of a “for loop”):
nodes_file.write(str(taxid) + "\t|\t" + str(parent_taxid) + "\t|\t" + rank + "\t|\t-\t|\n")
If you would like some suggestions about how to designate taxonomic relationships between viral genomes, please see “Suggestions for viral taxonomy” below.
Now, create a new empty folder for your database. Put the names.dmp/nodes.dmp into a subfolder called taxonomy
.
Create a multifasta file with all the genomes that you would like to add to your database. Then assign each genome a unique taxonomic ID. Put this unique taxonomic ID in the header line for each genome, in the following format:
>genome_name|kraken:taxid|XXXXX
# example
>MGYG000000001_1|kraken:taxid|3012254
Then formally “add” them to the database using the following command (adjusting the threads as needed):
kraken2-build --add-to-library path_to_fasta_file --db path_to_database_folder --threads 8
kraken2-build --build --db path_to_database_folder --threads 8
Step 1D: “inspect” the Kraken2 database to check that the taxonomic relationships in the database are consistent with what you aimed to specify.
kraken2-inspect --db path_to_database_folder --report-zero-counts --threads 8 > inspect.out
This command will generate an “inspection report” - we recommend checking the taxonomy that is specified for a few species in various domains.
bracken-build -d path_to_database_folder -t 10 -l 150
Adjust the threads and -l argument as necessary (this specifies the read length for the sequences that will be classified with this Bracken database). Note that you can create multiple Bracken DBs for each database, for each of the different read lengths you desire.
Utilize the calculate_genome_size.py
script provided in the pipeline_scripts
subfolder of this repo.
Usage: python calculate_genome_size.py /path/to/database/folder
There are two files you will need to create:
- host_prediction_to_genus.tsv - for usage of the
post_pipeline_scripts/collapse_viral_abundances_by_host.py
script.- This tab-separated file should contain two columns, named
species_taxa
andHost genus
. The first column should contain viral species taxids, and the second should contain the predicted host. - You can predict a host using many different tools, for example using iPHoP. You do not need to assign a host for every viral species and thus you can omit some viral species from the host_prediction_to_genus file.
- Here are the first five lines of host_prediction_to_genus.tsv for the default Phanta database, as an example:
- This tab-separated file should contain two columns, named
species_taxa Host genus
4005213 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Ruminococcus_D
4005409 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Ruminococcus_D
4005420 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Acutalibacteraceae;g__Ruminococcus_E
4005427 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__UBA738
- species_name_to_vir_score.txt - for usage of the
post_pipeline_scripts/calculate_lifestyle_stats/lifestyle_stats.R
script.- This tab-separated file should contain two columns with no header.
- First column contains the viral species name - second column contains the predicted ‘virulence score’ (from 0 to 1). To get this score, you can use BACPHLIP or a similar tool.
- Here are the first five lines of species_name_to_vir_score.txt for the default Phanta database, as an example:
Yak enterovirus 0.8118782016546136
Mycobacterium phage Contagion 0.012499999999999956
Kadipiro virus 0.8118782016546136
Rotavirus C 0.8118782016546136
Alfalfa mosaic virus 0.8118782016546136
- Cluster the genomes to the species-level using a 95% ANI threshold (85% alignment fraction), as recommended by MIUViG. In the default Phanta database, we chose to include all the genomes in the final database; we designated each genome as a “strain” of the relevant species cluster.
- Then assign higher ranks. As described in the methods of the Phanta paper, you can use clustering to assign some of the higher ranks (e.g., genus). However, as also mentioned in the methods, we recommend to eventually merge the clustering-based taxonomy with a well-recognized viral taxonomy, such as ICTV.