Here are brief instructions from the re-factoring done in August 2019 to use accession numbers not GI numbers. The original notes are below.
In the data
directory it is expected that you will have
- The
*.dmp
NCBI taxonomy files from the tarball here, and - The
nucl_gb.accession2taxid.gz
file from here
You can either get these by hand or else just run make download
in the
data
directory.
Once you have the data files in place, you can just run make
, which will
create you a (currently 17GB) Sqlite3 database file, taxonomy.db
that can
be used with the
AccessionLineageFetcher
Python class in dark-matter, or
with similar code that you write yourself.
You can also just run make xxx
(where xxx
is one of taxids
, nodes
,
names
or hosts
) in case you just want to recreate one of the tables in
the database.
If you want to do something else, you're on your own for the time being!
But the scripts in this directory, the Makefile
in the data
directory,
and the dark matter code will hopefully be instructive.
Here are scripts to help you create a database (mysql or sqlite) from some of the NCBI's taxonomy data. This can be used with the LineageFetcher Python class in dark-matter, or with similar code that you write yourself.
There are two large files on the NCBI FTP site and you may not want them
both. You'll need at least one of them. It all depends on the gi
numbers
you want to be able to look up taxonomy information for.
The download script and database creation scripts assume you want both files. If you don't, you can edit these scripts to remove the file you don't need.
To change download.sh
just remove one of the file names from the line
that says for file in gi_taxid_nucl.dmp gi_taxid_prot.dmp
. To change the
create scripts, delete the line that imports the data from the file you
don't want (gi_taxid_nucl.dmp
or gi_taxid_prot.dmp
).
The download.sh
script will download all the file you need from
the NCBI ftp site. If you already
have what's needed (at least one of gi_taxid_nucl.dmp.gz
,
gi_taxid_prot.dmp.gz
, and taxdump.tar.gz
) you can skip this step,
though you will need to uncompress the first two and extract names.dmp
and nodes.dmp
from taxdump.tar.gz
using e.g.,
$ gunzip gi_*.gz
$ tar xfz taxdump.tar.gz names.dmp nodes.dmp
Adding all the data to the databases takes a lot of time. (And yes, the scripts load the data from the files before adding indices to the database tables in case you're wondering). It could take some hours. The input files are big (as on Sept 29, 2018):
$ du -s -h gi_taxid_*
11G gi_taxid_nucl.dmp
8.9G gi_taxid_prot.dmp
Run:
$ create-sqlite.sh ncbi-taxonomy-sqlite.db
to make the database file (or give your own database filename on the command line). On my machine this creates a 38GB database file.
Run:
$ create-mysql.sh [args] ncbi-taxonomy-mysql.db
you will probably need to add additional arguments (like --user
and
--password
). All arguments are simply given to mysql
on its command
line.