Skip to content

5. Running sheppard.py

Rauf Salamzade edited this page Dec 22, 2020 · 1 revision

sheppard.py is one of the two main programs in the seQuoia suite. Currently it uses Python's multiprocessing library to spawn a pool of jobs where each job corresponds to the passage of some sequencing data through a user provided workflow. Further, each workflow simply serves as a watcher of jobs submitted to the cluster (the individual modules which actually require some computational resources). Tasks can thus be specified with different resource needs. Examples of tasks include FastQC and Centrifuge. While workflows are resource friendly they do have true plumbing capabilities which allow for simultaneous tasks to be run (parallelization of jobs within samples); however, due to resource limitations this should not be an issue for seQuoia when used on the Broad servers. That is we will be using enough resources by just parallelizing across samples.

Definitions:

  • sample - an individual sample (e.g. the genomic data for a strain, an aggregate biome, etc.)

  • workflow - QC and processing of a sample

  • task/module - individual job in a workflow

The usage for sheppard.py can be found at the bottom of the page.

To submit it simply enter screen in one of the login nodes and start the command:

# load cluster of choice
reuse UGER or UGES

# run sheppard using bash wrapper
bash /path/to/seQuoia/bin/sheppard.sh 
                       --meta full_meta_information.txt \ 
                       --illumina sequencing_data/ \      
                         --workflow /path/to/workflow.py \ 
                       --illumina_data_format illumina-paired \
                       --outdir seQc_repos/ \  
                       --poolsize 30 \   (limit to 30 samples at a time)
                       --cluster UGES

Currently, sheppard uses a custom written cluster monitor named hUGE. hUGE automatically runs jobs in the broad queue with gscid project when the cluster is specified as UGER and runs projects in the gscid queue if the cluster is set to UGES.

The 4 Rules of Sheppard-ing Modules

There are 4 rules I aim to have for each module (some only apply to processing modules such as Cutadapt or Trimmomatic which filter reads based on certain criteria):

  1. No module should alter the input FASTQ file passed to it. While I encourage people to use the "mgseq" role user account for storing sequencing data securely, I will aim to never modify a passing FASTQ file (e.g. uncompressing it). Rather, I tend to make a local copy and uncompress that if need be.
  2. Gzip compression of resulting FASTQ files is an option for each processing module and run as default.
  3. Each module can be strung together or be run independently via its respective Bash wrapper. This will be important if we decide to switch to a different pipeline schema e.g. WDL or NextFlow.
  4. Logging should capture the executable commands run by each module. This is critical for efficient debugging of problems.

Usage

usage: sheppard.py [-h] -m META [-i ILLUMINA]
                   [-f {illumina-paired,illumina-single,bam,gp-directory}]
                   [-n NANOPORE] -w WORKFLOW -o OUTDIR [-p POOLSIZE]
                   [-a CONFIG] [-s SAMPLE_CONFIG] [-c CLUSTER]

    This program serves as a wrapper to setup a general sequencing 
    batch repository and acts as a sheppard for a pool of workflows running simultaneously. Should be started in a screen
    session or as an UGER/UGES instance.
    
    If started in a cluster node, please make sure to "use" (Dotkit lingo) that cluster for the individual processes as well. 
    
    There are currently four ways to input short read Illumina data:
        - illumina-paired: this is when you have forward and reverse reads
        - illumina-single: this is when you have single end reads
        - gp-directory : this is when you have a directory from the Broad Genomics Platform.
        - bam : this is when you just have the path to a bam (aligned/unlaigned).
                
    WARNING: CURRENT IMPLEMENTATION OF CLUSTER SUBMISSION ASSUMES USER HAS PERMISSIONS TO GSCID PROJECT/QUEUE in UGES/UGER!
    

optional arguments:
  -h, --help            show this help message and exit
  -m META, --meta META  Input sample meta-data table. Tab-delimited. Headers necessary and first column must be called "sample_id" baring the names of samples up until _R1.fastq (and _R2.fastq if paired-end data).
  -i ILLUMINA, --illumina ILLUMINA
                        Input directory containing sequencing data. Names should match sample_id column in input. Alternatively, user can provide a tab-delimited file listing the input files. No column headers, but first column must match sampfrom meta
  -f {illumina-paired,illumina-single,bam,gp-directory}, --illumina_data_format {illumina-paired,illumina-single,bam,gp-directory}
                        Specify the format for the illumina data.
  -n NANOPORE, --nanopore NANOPORE
                        A tab-delimited file with three columns: (1) sample_id, (2) Albacore results directory for run, and (3) sample barcode ID for run.
  -w WORKFLOW, --workflow WORKFLOW
                        Provide path to established workflow. Alternatively, provide your own workflow file!
  -o OUTDIR, --outdir OUTDIR
                        Path to the output directory.
  -p POOLSIZE, --poolsize POOLSIZE
                        Pool size. Number of samples to process simultaneously.
  -a CONFIG, --config CONFIG
                        Specify parameter configurations to change in workflow. Must match IDs used in workflow.
  -s SAMPLE_CONFIG, --sample_config SAMPLE_CONFIG
                        Specify parameter configurations to change in workflow, per sample. Config parameters must match IDs used in workflow and sample IDs must match those provided in meta input.
  -c CLUSTER, --cluster CLUSTER
                        specify cluster to use. Default is UGER.