Skip to content

08. prefetch and fasterq dump

Andrew Klymenko edited this page Sep 13, 2023 · 9 revisions

How to use prefetch and fasterq-dump to extract FASTQ-files from SRA run accessions

The combination of prefetch + fasterq-dump is the fastest way to extract FASTQ-files from SRA-accessions. The prefetch tool downloads all necessary files to your computer. The prefetch - tool can be invoked multiple times if the download did not succeed. It will not start from the beginning every time; instead, it will pick up from where the last invocation failed.

After the download, you have the option to test the downloaded data with the vdb-validate tool. After the successful download, there is no need for network-connectivity. You can move the folder created by prefetch to a different location to perform the conversion to the fastq-format somewhere else (for instance to a compute-cluster without internet access).

There are a couple of points here:

  • The prefetch-tool downloads to a directory named by accession. E.g. prefetch SRR000001 will create a directory named SRR000001 in the current directory. Make sure that if you move the SRR000001 directory, you don't rename it as the conversion-tool will need to find the original directory.

  • If you don't have internet access - run vdb-config -i and make sure that Enable Remote Access is not checked.

Into what location will the prefetch save the downloaded files?

This will depend on the configuration of the toolkit. There are 3 options:

  1. in the current working directory
  2. in the user-repository
  3. user-defined location

You can choose between options 1 and 2 with the vdb-config - tool:

  • $vdb-config --prefetch-to-cwd
  • $vdb-config --prefetch-to-user-repo

The 3rd option is to use the interactive mode of the 'vdb-config' - tool:

  • $vdb-config -i This will show a screen where you can make your selection on the 'TOOLS'-page.

This 3rd option is applied directly to the 'prefetch' - tool itself:

  • $prefetch SRR000001 -O /path/to/be/used Make sure the last directory of /path/to/be/used is the accession itself. E.g. prefetch SRR000001 -O /path/to/be/used/SRR000001 The SRA tools expect that all files of the run SRR000001 are stored in a directory having the same name as the accession: SRR000001. It is called "Accession as Directory".

Check the maximum-size limit of the 'prefetch'-tool

The prefetch-tool has a default maximum download-size of 20G. If the requested accession is bigger than 20G, you will need to increase that limit. You can specify an extremely high limit no matter how large the requested accession is. You can also query the accession-size using the vdb-dump-tool and the --info option. For instance, vdb-dump SRR000001 --info tells you how large this accession is ( among other information ). The accession SRR000001 has 932,308,473 bytes, which is below the default limit, so no further action is necessary. The accession SRR1951777 has 410,112,373,995 bytes. To download this accession you have to lift the limit above that size:

  • $prefetch SRR1951777 --max-size 420000000000

You can specify the limit in:

  • kilobytes (default): --max-size 10 == --max-size 10k : 10 kilobytes,
  • megabytes: --max-size 10m : 10 megabytes,
  • gigabytes: --max-size 10g : 10 gigabytes,
  • terabytes: --max-size 10t : 10 terabytes,
  • unlimited: --max-size u.

Extract fastq-file(s) from SRA - accessions

Before you perform the extraction, you should make a quick estimation about the hard-drive space required. The final fastq-files will be approximately 7 times the size of the accession. The fasterq-dump-tool needs temporary space ( scratch space ) of about 1.5 times the amount of the final fastq-files during the conversion. Overall, the space you need during the conversion is approximately 17 times the size of the accession. You can check how much space you have by running the $df -h . command. Under the 4th column ( Avail ), you see the amount of space you have available. Please take into consideration that there might be quotas set by your administrator which are not always visible. If the limit is exceeded, the 'fasterq-dump'-tool will fail and a message will be displayed.

The simplest way to run fasterq-dump is:

  • $fasterq-dump SRR000001

This assumes that you have previously 'prefetched' the accession into the current working directory. If the directory SRR000001 is not there, the tool will try to access the accession over the network. This will be much slower and might eventually fail due to network timeouts.

Notice that you use the accession as a command line argument. The tool will use the current directory as scratch-space and will also put the output-files there. When finished, the tool will delete all temporary files it created. You will now have 3 files in your working directory:

  • SRR000001.fastq
  • SRR000001_1.fastq
  • SRR000001_2.fastq

If you want to have the output files created in a different directory, use the --outdir option.

The fasterq-dump-tool performs a split-3 operation by default. The fasterq-dump-tool is not identical to the former fastq-dump-tool with regards to command line options. The following is a comparison between fastq-dump and fasterq-dump:

split-3

  • $fastq-dump SRR000001 --split-3 --skip-technical
  • $fasterq-dump SRR000001

split-spot

  • $fastq-dump SRR000001 --split-spot --skip-technical
  • $fasterq-dump SRR000001 --split-spot

split-files

  • $fastq-dump SRR000001 --split-files --skip-technical
  • $fasterq-dump SRR000001 --split-files

concatenated

  • $fastq-dump SRR000001
  • $fasterq-dump SRR000001 --concatenate-reads --include-technical

Important differences to fastq-dump include the following:

  • The -Z|--stdout option does not work for split-3 and split-files. The tool will fall back to producing files in these cases.
  • There is no --gzip|--bizp2 option. You have to compress your files explicitly after they have been written.
  • There is no -A option for the accession, only the ability to specify the accession or a path directly. The tool will extract the output-name from the given accession or path.
  • The fasterq-dump-tool does not take multiple accessions, just one.
  • There is no -N|--minSpotId and no -X|--maxSpotId option. The tool always processes the entire accession.

Summary

prefetch

By default, prefetch <accession> will download the <accession> run file and its dependencies into the <accession> directory. E.g., prefetch SRR000001 will create a directory SRR000001 in the current directory.

If prefetch fails - run the same prefetch command again and the download will resume.

Running prefetch <accession> when the <accession> directory already exists, will download missing reference sequence files into the <accession> directory.

Currently there is no way to download a missing vdbcache file - it is needed to speed up the processing of some accessions. If a vdbcache-file is available remotely, it will be used. If there is no internet access and the vdbcache-file exists for a given accession, the conversion of the accession will take a significant amount of time.

fasterq-dump

By default, run the fasterq-dump [options] <accession> in the same directory where you ran prefetch <accession>. The fastq-files will be created in the current directory. Use the --outdir-option if you want these output-files to be created in a different directory.

If you need to move the result of the prefetch <accession> download, move the entire <accession> directory - don't rename it. Then cd to the parent directory of the <accession> directory and run the fasterq-dump - tool in this directory.

If you prefetched all files and don't have internet access - run vdb-config -i and turn off Remote Access.

N.B. Accessions that are mentioned in this document are run accessions. E.g., SRR000001, DRR000002, ERR000003.