Skip to content

SRA tools docker

Kenneth Durbrow edited this page Sep 2, 2021 · 6 revisions

Documentation and Help for the NCBI SRA Toolkit Docker image

The NCBI SRA Toolkit is now maintaining a Docker image ncbi/sra-tools

Note: registry URL

When deploying the image in a container VM, please use the following path for the Docker Hub registry: registry.hub.docker.com/ncbi/sra-tools

User settings and changing the HOME environment variable

The toolkit expects to find the preconfigured settings file in ${HOME}/.ncbi/user-settings.mkfg. During docker build, the preconfigured settings file is put in /root/.ncbi/user-settings.mkfg. If you change (or your workflow engine changes) the HOME environment variable, you should move or copy /root/.ncbi/user-settings.mkfg to ${HOME}/.ncbi/user-settings.mkfg. Otherwise, you will get a message saying that the toolkit requires configuration.

For example:

% mkdir $HOME/.ncbi
% docker run -t --rm -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools cp /root/.ncbi/user-settings.mkfg .ncbi

And you can verify the path for the configuration file with vdb-config -o n NCBI_SETTINGS, e.g.

% docker run -t --rm -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools vdb-config -o n NCBI_SETTINGS
NCBI_SETTINGS = "/home/user/.ncbi/user-settings.mkfg"

Running multiple related jobs

When running multiple related jobs on clusters, you can (and probably should) point every instance to the same configuration file (e.g. use a common HOME). Additionally, you can (and probably should) create a common repository for SRA data and reference files. For example:

% mkdir /mnt/bigdisk/sra-data-repo
% mkdir /mnt/bigdisk/sratools-home/.ncbi
% HOME=/mnt/bigdisk/sratools-home docker run -t --rm -v /mnt/bigdisk/sra-data-repo:/repo:rw -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools
# cp /root/.ncbi/user-settings.mkfg .ncbi
# vdb-config --set /repository/user/main/public/root="/repo"
# prefetch ERR036591
... snip ...
 # ls -l /repo/sra
total 2165156
-rw-r--r--    1 root     root     1807665268 Sep  2 15:28 ERR036591.sra
-rw-r--r--    1 root     root      17284537 Sep  2 15:28 ERR036591.sra.vdbcache

After prefetch-ing all the data, you can run all the processing jobs as:

% HOME=/mnt/bigdisk/sratools-home docker run -t --rm -v /mnt/bigdisk/sra-data-repo:/repo:r -v ${HOME}:${HOME}:rw -w ${HOME} --env HOME ncbi/sra-tools
# fastq-dump ERR036591

Example usage:

Simple fasterq-dump

% docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools fasterq-dump -e 2 -p SRR10985476
Unable to find image 'ncbi/sra-tools:latest' locally
latest: Pulling from ncbi/sra-tools
c9b1b535fdd9: Already exists 
0a6856f8fd06: Pull complete 
2d9bc7db21a2: Pull complete 
3de524257044: Pull complete 
Digest: sha256:631578b15625cc5390928772f1bf945847ce2981a81a95042729a47579396099
Status: Downloaded newer image for ncbi/sra-tools:latest
lookup :|-------------------------------------------------- 100%   
merge  : 16319508
join   :|-------------------------------------------------- 100%   
concat :|-------------------------------------------------- 100%   
spots read      : 14,965,183
reads read      : 14,965,183
reads written   : 14,965,183

Please note these suggested options included in the examples:

  • creating a host volume to write to: -v $PWD:/output:rw
  • setting the container working directory to the host volume: -w /output

Most tools write to the current working directory unless told otherwise, and you probably do not want the tools to write into the container's file system. So, please set the working directory to a host volume.

prefetch + fasterq-dump

% docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools prefetch SRR10985476

2020-06-23T18:07:35 prefetch.2.10.8: 1) Downloading 'SRR10985476'...
2020-06-23T18:07:35 prefetch.2.10.8:  Downloading via HTTPS...
2020-06-23T18:07:45 prefetch.2.10.8:  HTTPS download succeed
2020-06-23T18:07:45 prefetch.2.10.8:  'SRR10985476' is valid
2020-06-23T18:07:45 prefetch.2.10.8: 1) 'SRR10985476' was downloaded successfully
2020-06-23T18:08:27 prefetch.2.10.8: 'SRR10985476' has 454 unresolved dependencies
2020-06-23T18:08:27 prefetch.2.10.8: 2) Downloading 'ncbi-acc:NC_000001.11?vdb-ctx=refseq'...
2020-06-23T18:08:27 prefetch.2.10.8:  Downloading via HTTPS...
2020-06-23T18:08:33 prefetch.2.10.8:  HTTPS download succeed
2020-06-23T18:08:33 prefetch.2.10.8: 2) 'ncbi-acc:NC_000001.11?vdb-ctx=refseq' was downloaded successfully

...

2020-06-23T18:10:25 prefetch.2.10.8: 455) Downloading 'ncbi-acc:NW_004504305.1?vdb-ctx=refseq'...
2020-06-23T18:10:25 prefetch.2.10.8:  Downloading via HTTPS...
2020-06-23T18:10:25 prefetch.2.10.8:  HTTPS download succeed
2020-06-23T18:10:25 prefetch.2.10.8: 455) 'ncbi-acc:NW_004504305.1?vdb-ctx=refseq' was downloaded successfully
% docker run -t --rm -v $PWD:/output:rw -w /output ncbi/sra-tools fasterq-dump -p SRR10985476     
lookup :|-------------------------------------------------- 100%   
merge  : 17976103
join   :|-------------------------------------------------- 100%   
concat :|-------------------------------------------------- 100%   
spots read      : 14,965,183
reads read      : 14,965,183
reads written   : 14,965,183

Please note that both commands are using the same host volume for the working directory. This allows the files that prefetch retrieved to be found by fasterq-dump.

Known issues and work-arounds:

TLS failures

We have seen TLS errors when running on AWS, like these:

2020-06-19T15:50:53 prefetch.2.10.7:  Downloading via HTTPS...
2020-06-19T15:50:53 prefetch.2.10.7 sys: mbedtls_ssl_get_verify_result returned 0x8 (  !! The certificate is not correctly signed by the trusted CA  )
2020-06-19T15:50:53 prefetch.2.10.7 int: connection failed while opening file within cryptographic module - Cannot KClientHttpRequestGET: /scratch/SRR5709848/SRR5709848.sra
2020-06-19T15:50:53 prefetch.2.10.7:  HTTPS download failed

The solution is to make the host's certificates visible inside the container:

docker run -v /etc/pki:/etc/pki:ro -v /etc/ssl:/etc/ssl:ro ...