Snakemake pipeline for assembling TCR data, and perform a saturation analysis.
To download from the command line using the command:
git clone [email protected]:Ong-Research/SatTCR.git
The SatTCR pipeline requires:
- Docker: https://www.docker.com/
- Snakemake: https://snakemake.readthedocs.io/en/stable/
It uses Snakemake to schedule the jobs to run the pipeline, and every job is run in a different container.
To pull Docker images that are going to be utilized by the pipeline, using the following commands:
cd SatTCR
docker pull staphb/fastqc # FastQC image
docker pull staphb/multiqc # MultiQC image
docker pull staphb/trimmomatic # Trimmomatic image
docker build -t tcr/sat - < Dockerfile # R and Quarto image
docker pull ghcr.io/milaboratory/mixcr/mixcr:latest # MIXCR image
For MIXCR to work, it is necessary to get a license from https://mixcr.com/mixcr/getting-started/milm/ and save it into a file.
Create a comma-separated value (csv) with 2 columns:
sample_name
: The name of the samplesample_file
: The prefix of the files until before the_R1
and_R2
parts, e.g. if the pair of RNA-seq files are data/sample1_R1_L001.fastq.gz and data/sample1_R2_L001.fastq.gz, then this column isdata/sample1
.
Edit the config/config.yaml
file. This file is divided by pieces in order to easily configure running the pipeline:
General configuration parameters:
threads
: Max. # of parallel threads used per process.samplefile
: Location of the file with the samples.seed
: Seed number for random number generation and sequence sampling during saturation analysis.run_*
: Logical indicators to determine if running a stage of the pipelinesuffix
: This is regarding to thesamplefile
. If the pair of RNA-seq files aredata/sample1_R1_L001.fastq.gz
anddata/sample1_R2_L001.fastq.gz
. The suffix would be the remaining part after the R1/R2 parts, i.e._L001.fastq.gz
.
Docker configuration parameters:
run_line
: This is the docker command used to run every rule.fastqc
,multiqc
,trimmomatic
,rquarto
andmixcr
are the names of the images that were pulled before.
In general, it is not necessary to modify these parameters unless a different image name is used or a specific need to configure how docker runs in the user’s system.
Trimmomatic configuration parameters:
trimmer
: A vector with thetrimmomatic
configuration to use. More information is available in http://www.usadellab.org/cms/?page=trimmomatic. But the general idea is to remove the low-quality nucleotides at the end of the sequences, or very short sequences.
MIXCR configuration parameters:
params
: The configuration line used to control MIXCR behavior. We used the line below to assemble the clonotypes analyzed used in this manuscriptrna-seq –species dog -b imgt.202214-2 –rna
. MIXCR provides a comprehensive list of preset configuration in https://mixcr.com/mixcr/reference/overview-built-in-presets/.license_file
: Location of the file with the license. The pipeline uses this file to run MIXCR in a docker container.
Saturation configuration parameters:
samples
: A vector with the sample keys for which the saturation analysis is going to be processedblock_size
ornblocks
: Either the # of sequences that are going to be sampled by block or the # of blocks of sequences used to split the original sequence files.bootstrap_replicates
: The number of times that the block bootstrap sampling procedure is going to be repeated. This rule is computationally intensive, because in total there are going to be sampledn_blocks-1 x n_boot_reps
pairs of sequence files and then MIXCR is used for each pair of files.
In the instructions below, the flag -c{k}
stands for running the rule with {k}
parallel threads.
- Quality control:
snakemake -c{k} qc
. The output of this rule are an html report generated with MultiQC and quality profiles generated with the R package dada2 (Callahan et al. 2016). Either one of these analyses will depict quality score summaries at each position of the sequence files. - Trim sequences:
snakemake -c{k} trim
. The output of this rule are the trimmed versions for every raw sequence file. - Clonotype assembly with MIXCR:
snakemake -c{k} mixcr
. The output of this rule is a tsv file according to the AIRR format (https://docs.airr-community.org/en/stable/datarep/overview.html) for every set of RNA-seq paired files. - Block bootstrap sampling:
snakemake -c{k} saturation
. This rule generatesn_blocks-1 x n_boot_reps
pairs of compressed fastq files. - Saturation analysis:
snakemake -c{k} saturation
- Generate the report:
snakemake -c{k} report
. This rule produces an html report compiled byquarto
summarizing the results of the analysis.