Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation to include db_type #535

Merged
merged 13 commits into from
Oct 1, 2024
29 changes: 17 additions & 12 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,9 @@ Databases can be supplied either in the form of a compressed `.tar.gz` archive o
nf-core/taxprofiler does not provide any databases by default, nor does it currently generate them for you. This must be performed manually by the user. See bottom of this section for more information of the expected database files, or the [building databases](usage/tutorials#retrieving-databases-or-building-custom-databases) tutorial.
:::

The pipeline takes the paths and specific classification/profiling parameters of the tool of these databases as input via a four column comma-separated sheet.
The pipeline takes the paths and specific classification/profiling parameters of the tool of these databases as input via a four (or five) column comma-separated sheet.

The optional `db_type` column allows to use specific database/parameters against specific data types. By specifying if a database is for short-or long-reads (or even both), the samples sequenced with Illumina are combined with the short-read databases and the samples sequenced with Nanopore are combined with long-read databases. If `db_type` is not provided, it is assumed the database and parameters are applicable for both short and long read data.

:::warning
To allow user freedom, nf-core/taxprofiler does not check for mandatory or the validity of non-file database parameters for correct execution of the tool - excluding options offered via pipeline level parameters! Please validate your database parameters (cross-referencing [parameters](https://nf-co.re/taxprofiler/parameters), and the given tool documentation) before submitting the database sheet! For example, if you don't use the default read length - Bracken will require `-r <read_length>` in the `db_params` column.
Expand All @@ -127,17 +129,17 @@ An example database sheet can look as follows, where 7 tools are being used, and
`kraken2` will be run twice even though only having a single 'dedicated' database because specifying `bracken` implies first running `kraken2` on the `bracken` database, as required by `bracken`.

```csv
tool,db_name,db_params,db_path
malt,malt85,-id 85,/<path>/<to>/malt/testdb-malt/
malt,malt95,-id 90,/<path>/<to>/malt/testdb-malt.tar.gz
bracken,db1,;-r 150,/<path>/<to>/bracken/testdb-bracken.tar.gz
kraken2,db2,--quick,/<path>/<to>/kraken2/testdb-kraken2.tar.gz
krakenuniq,db3,,/<path>/<to>/krakenuniq/testdb-krakenuniq.tar.gz
centrifuge,db1,,/<path>/<to>/centrifuge/minigut_cf.tar.gz
metaphlan,db1,,/<path>/<to>/metaphlan/metaphlan_database/
motus,db_mOTU,,/<path>/<to>/motus/motus_database/
ganon,db1,,/<path>/<to>/ganon/test-db-ganon.tar.gz
kmcp,db1,;-I 20,/<path>/<to>/kmcp/test-db-kmcp.tar.gz
tool,db_name,db_params,db_type,db_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Myabe give a second example without the db_type column

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this for malt.

malt,malt85,-id 85,short,/<path>/<to>/malt/testdb-malt/
malt,malt95,-id 90,,/<path>/<to>/malt/testdb-malt.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the column is in, it has to be filled (@LilyAnderssonLee do you remember). If you want both you need short;long as befote.

See my comment below about what I had actually meant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the db_type column is included in the database.csv, it should be filled with one of the following values: short, long, or short;long. If the db_type column is missing from the database.csv, it will take the default short;long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR based on your comments.

bracken,db1,;-r 150,short,/<path>/<to>/bracken/testdb-bracken.tar.gz
kraken2,db2,--quick,short,/<path>/<to>/kraken2/testdb-kraken2.tar.gz
krakenuniq,db3,,short;long,/<path>/<to>/krakenuniq/testdb-krakenuniq.tar.gz
centrifuge,db1,,short,/<path>/<to>/centrifuge/minigut_cf.tar.gz
metaphlan,db1,,short,/<path>/<to>/metaphlan/metaphlan_database/
motus,db_mOTU,,long,/<path>/<to>/motus/motus_database/
ganon,db1,,short,/<path>/<to>/ganon/test-db-ganon.tar.gz
kmcp,db1,;-I 20,short,/<path>/<to>/kmcp/test-db-kmcp.tar.gz
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a second csv example block but without the db_type column (essentially the one from before you edited).

Sorry this is what I meant before about having an example without this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR with two example blocks.


:::warning
Expand All @@ -157,6 +159,7 @@ Column specifications are as follows:
| `tool` | Taxonomic profiling tool (supported by nf-core/taxprofiler) that the database has been indexed for [required]. Please note that `bracken` also implies running `kraken2` on the same database. |
| `db_name` | A unique name per tool for the particular database [required]. Please note that names need to be unique across both `kraken2` and `bracken` as well, even if re-using the same database. |
| `db_params` | Any parameters of the given taxonomic classifier/profiler that you wish to specify that the taxonomic classifier/profiling tool should use when profiling against this specific database. Can be empty to use taxonomic classifier/profiler defaults. Must not be surrounded by quotes [required]. We generally do not recommend specifying parameters here that turn on/off saving of output files or specifying particular file extensions - this should be already addressed via pipeline parameters. For Bracken databases, must at a minimum contain a `;` separating Kraken2 from Bracken parameters. |
| `db_type` | A column to distinguish between short- and long-read databases. If the column is empty, the pipeline will assume all databases (and their settings specified in `db_params`!) will be applicable for both short and long read data [optional]. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LilyAnderssonLee what are the valid values ehre? short long and short;long?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. And the default is short;long

sofstam marked this conversation as resolved.
Show resolved Hide resolved
| `db_path` | Path to the database. Can either be a path to a directory containing the database index files or a `.tar.gz` file which contains the compressed database directory with the same name as the tar archive, minus `.tar.gz` [required]. |

:::tip
Expand All @@ -165,6 +168,8 @@ You can also specify the same database directory/file twice (ensuring unique `db

nf-core/taxprofiler will automatically decompress and extract any compressed archives for you.

The optional `db_type` column enables the use of specific databases or parameters for different data types. By specifying if a database is for short-reads, long-reads, or both, Illumina samples are combined with short-read databases, while Nanopore samples are combined with long-read databases.

:::tip
Click the links in the list below for short quick-reference tutorials how to generate download 'pre-made' and/or custom databases for each tool.
:::
Expand Down
14 changes: 8 additions & 6 deletions docs/usage/tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,20 +93,22 @@ If you had placed your FASTQ files elsewhere, you would give the full path (i.e.
#### Database sheet

For the database(s), you also supply these via a `.csv` file.
This 4 column table contains the tool the database has been built for, a database name, the parameters you wish reads to be queried against the given database with, and a path to a `.tar.gz` archive file or a directory containing the database files.
This 4 (or 5) column table contains the tool the database has been built for, a database name, the parameters you wish reads to be queried against the given database with, an optional column to distinguish between short- and long-read databases, and a path to a `.tar.gz` archive file or a directory containing the database files.

Open a text editor, and create a file called `database.csv`.
Copy and paste the following csv file into the file and save it.

```csv title="database.csv"
tool,db_name,db_params,db_path
kraken2,db1,--quick,testdb-kraken2.tar.gz
centrifuge,db2,,test-db-centrifuge.tar.gz
centrifuge,db2_trimmed,--trim5 2 --trim3 2,test-db-centrifuge.tar.gz
kaiju,db3,,kaiju/
tool,db_name,db_params,db_type,db_path
kraken2,db1,--quick,short,testdb-kraken2.tar.gz
centrifuge,db2,,short,test-db-centrifuge.tar.gz
centrifuge,db2_trimmed,--trim5 2 --trim3 2,long,test-db-centrifuge.tar.gz
kaiju,db3,,short;long,kaiju/
```

You can see here we have specified the Centrifuge database twice, to allow comparison of different settings.
We have also specified different profiling parameters depending on whether a database is for short-read or long-read use.
sofstam marked this conversation as resolved.
Show resolved Hide resolved
If we don't specify this, the pipeline will assume all databases (and their settings specified in `db_params`!) will be applicable for both short and long read data.
Note that the each database of the same tool has a unique name.
Furthermore, while the Kraken2 and Centrifuge databases have been supplied as `.tar.gz` archives, the Kaiju database has been supplied as a directory.

Expand Down
Loading