This is the novel-chassis alignment and QC pipeline. It trims paired reads, aligns to a reference genome, annotates aligned reads, aggregates annotated alignments into a dataframe of counts, pulls metadata from the science_table
, and runs QC metrics which are then appended to the counts dataframes. On overview of the data flow is provided here:
The pipeline is triggered by a metadata upload from the labs, and each step of the analysis has an reactor/app pair. The reactor responds to events and submits jobs to an application, which runs on the HPC cluster, after the application completes it messages the auditing reactor. The auditing reactor checks the integrity of the outputs for individual jobs, resubmitting jobs when necessary, monitoring job status, and handling the order of operations for data processing: for example, all pre-processing and alignment jobs must complete before they can be aggregated into a dataframe. The pipeline is triggered by the ingest reactor, and jobs report out their status to the pipeline jobs manager (which in turn writes out job status to the data catalog). For more information about the cyberinfrastructure that these apps and reactors are built on, check out the:
- TAPIS documentation
- SD2E Data Catalog documentation
- Abaco documentation, or the Abaco API Technical Reference
Ignoring all the middleware and error checking (which is the majority of the code here), the actual commands for trimming, alignment, etc. are as follows; first reads are trimmed, see near line 122 here:
java -jar ${TRIMMOMATIC_DIR}/trimmomatic-0.36.jar PE -phred33 \
${R1} ${R2} ${R1NAME}_trimmed.fastq ${R1NAME}_unpaired.fastq \
${R2NAME}_trimmed.fastq ${R2NAME}_unpaired.fastq \
ILLUMINACLIP:${TRIMMOMATIC_DIR}/adapters/${ADAPTER} \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 \
MINLEN:${MINLEN}
To learn more about these options and what they mean, please check the Trimmomatic documentation.
There is an optional step which includes running sortmerna
to remove rRNA from the readmates, but this option is turned off by default in the novel-chassis pipeline. After trimming, the reads are run through FastQC to generate quality scores for each read-mate:
${FASTQC_DIR}/fastqc ${R1NAME}_trimmed.fastq &
${FASTQC_DIR}/fastqc ${R2NAME}_trimmed.fastq
For more information on interpreting the reports generated by FastQC, check out the individual report guides here.
Next, the trimmed reads are passed to the novel_chassis_app for alignment with bwa, see near line 67 here:
bwa mem -t 40 ${REF} ${forward} ${reverse} > ${TEMPDIR}/${outsample}.sam
For more on the options and commands availavle to bwa, see the BWA reference guide.
After alignment with bwa, the resulting sam file is sorted by PICARD tools function SortSam
, and then AddOrReplaceGroups
is run on the sorted sam. The final .bam that is used for quantifying gene expression is the *.RG.bam
(a product of the AddOrReplaceGroups
function):
java ${JAVAOPTS1} -jar ${PICARDDIR} \
SortSam SORT_ORDER=coordinate \
INPUT=${TEMPDIR}/${outsample}.sam \
OUTPUT=${TEMPDIR}/${outsample}.sorted.sam \
TMP_DIR=${TEMPDIR} VALIDATION_STRINGENCY=LENIENT
java ${JAVAOPTS1} -jar ${PICARDDIR} \
AddOrReplaceReadGroups \
${PICARDOPTS} \
RGLB=${outsample} \
RGPL=Illumina \
RGPU=${outsample} \
RGSM=${outsample} \
INPUT=${TEMPDIR}/${outsample}.sorted.sam \
OUTPUT=${TEMPDIR}/${outsample}.RG.bam
This pipeline uses PICARD 2.18.15
, but I've found the GATK documentation to be the most helpful (GATK has since shipped with PICARD tools included), see here for the GATK Docs.
For DNA-seq analysis I would suggest using the .indel_cleaned.bam
that is output by the novel-chassis app. The indel_cleaned bam is produced from the RG.bam by using MarkDuplicates
from PICARD, and the IndelRealigner
from GATK:
java ${JAVAOPTS1} -jar ${PICARDDIR} \
MarkDuplicates \
TMP_DIR=${TEMPDIR} \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=TRUE REMOVE_DUPLICATES=FALSE \
TAG_DUPLICATE_SET_MEMBERS=TRUE \
CREATE_MD5_FILE=TRUE \
INPUT=${TEMPDIR}/${outsample}.RG.bam \
OUTPUT=${TEMPDIR}/${outsample}.aligned.duplicates_marked.bam \
METRICS_FILE=${TEMPDIR}/${outsample}.duplicate_metrics.txt
java -Djava.io.tmpdir=${TEMPDIR} ${ASIO}=true ${TL}=50 ${HFL}=10 -Xmx5000m -jar ${GATK} \
-T IndelRealigner -U -R ${REF} \
-o ${TEMPDIR}/${outsample}.aligned.duplicates_marked.indel_cleaned.bam \
-I ${TEMPDIR}/${outsample}.aligned.duplicates_marked.bam \
-compress 1 \
-targetIntervals ${TARGETS}
The novel-chassis-app does try to create .vcf files from this bam file, however it often fails because this version of GATK can not handle non ATGC basepairs in the reference genome (like N's for example). Therefore, if you'd like to generate a VCF, I recommend using the GATK VCF app(or the deepvaraint VCF app). The GATK app runs the following commands:
gatk --java-options "-Xmx4g" HaplotypeCaller \
-R ${FASTA} \
-I ${BAM} \
-O ${OUTNAME}.g.vcf.gz \
-ERC GVCF
gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R ${FASTA} \
-V ${OUTNAME}.g.vcf.gz \
-O ${OUTNAME}.vcf.gz
gatk --java-options "-Xmx4g" VariantAnnotator \
-R ${FASTA} \
-I ${BAM} \
-V ${OUTNAME}.vcf.gz \
-O ${OUTNAME}.final_output.vcf \
-A Coverage
And deepvariant is run with:
singularity exec ${IMAGE} \
/opt/deepvariant/bin/run_deepvariant \
--model_type=${MODEL_TYPE} \
--ref=${REF} \
--reads=${BAM} \
--output_vcf=output.vcf.gz \
--output_gvcf=output.g.vcf.gz \
--num_shards=24
Ok, so next for RNA-seq is annotating/aggregating counts. All the alignments are aggreagated into a single dataframe by the RNA-seq postprocessing app. A json of "$SAMPLE_ID": "$FILE_PATH"
is sent to the application along with a .gff annotation. The app uses featureCounts to quantify and aggregate the alignments, see around line 103 here:
fc <- featureCounts(files=files_location,
annot.ext=gtf,
isGTFAnnotationFile=TRUE,
GTF.featureType="CDS",
GTF.attrType="Name",
isPairedEnd=TRUE,
requireBothEndsMapped=FALSE)
More documentation on the featureCounts
function is available in the Rsubread
documentation. This application also calculates TPM and FPKM counts from the raw counts file generated by Rsubread
with the following functions:
fpkm <- function(counts, lengths) {
pm <- sum(counts) /1e6
rpm <- counts/pm
rpm/(lengths/1000)
}
tpm <- function(counts, lengths) {
rpk <- counts/(lengths/1000)
coef <- sum(rpk) / 1e6
rpk/coef
}
And the output is six counts dataframes: raw counts, FPKM, TPM, and transposed versions of those three outputs.
Finally we arrive at the QC metadata application, which queries the science_table
and jobs
table for metadata, creates groups from the metadata and calculates correlations within groups, and pulls quality metrics out of the log files for preprocessing and alignment. The source code is here. The function for calculating correlations is near line 99 here.