Skip to content

4. Configuration Files for Workflows

Rauf Salamzade edited this page Dec 22, 2020 · 1 revision

If you don't want to create a whole new workflow, but merely tweak parameters in established workflows, you've come to the right page. The easiest way to do this is to use general and sample-specific config files.

For information on which parameters of workflows are adaptable, please refer to the established workflows page.

Here I will give an overview of how the configurable parameters for the basic workflow can be adapted by users without modifying the actual workflow.

All of these adjustable options in the workflow are corresponding to the Centrifuge module.

General (NOT Sample-Specific) Configuration of Workflow Parameters:

Here is how a user could provide a general config file, not sample-specific, passed to sheppard.py via the -a/--config argument:

parameter-identifier=parameter-value

And so for instance, a file with such configurations could look like:

centrifuge_index = /path/to/centrifuge-index/
centrifuge_timelimit = 00:05:00
centrifuge_memory = 24
centrifuge_threads = 1

Sample-Specific Configuration of Workflow Parameters:

And here is how one can set specific settings for each sample, not that you would want to for the basic workflow, but you could:

sample_id <tab> parameter-identifier
sample-identifier <tab> sample-specific-value

The first line acts as a header which lists the parameters you are hoping to adapt. The first column of this line should be 'sample_id'.

Here is an example sample-specific configuration file which you can pass to sheppard.py using -s/--sample_config:

sample_id    centrifuge_memory    centrifuge_threads
SampleA            24                    1
SampleB            12                    2
SampleC            8                     3

So with such a sample-specific configuration provided, sheppard will run all three samples through the same workflow, in this case basic, but with different parameter configurations for the two adjusted parameters. SampleA will be run with 1 core and 24Gb per core, while SampleB will be run with 2 cores and 12Gb per core, and so on.

Wait, that's not all, you can also use multiple parameter-sets for the same sample! That's right folks, you can use it for pain-free, though extremely computationally unfriendly, benchmarking. To see this functionality start to be useful, imagine we were interested in Centrifuge results after subsampling different depths of reads (e.g. 10K reads, 100K reads, 1M reads ...). This can be done with the fast_basic workflow for instance, where there is a configurable parameter called read_subsampling in the workflow. Here is how our sample configuration file might look:

sample_id    read_subsampling
SampleA            10000
SampleA            100000
SampleA            1000000