NGS_RNA pipeline

Description of the different steps used in the RNA analysis pipeline

Gene expression quantification

The trimmed fastQ files were aligned to a reference genome using Star [1] with default settings. Before gene quantification SAMtools [2] was used to sort the aligned reads. The gene level quantification was performed by HTSeq-count [3] using --mode=union. The gene annotation database which is included in the results dir in folder expression/. Deseq2 was used for differential expression analysis on STAR bams. For experimental group conditions the 'condition' column in the samplesheet was used the distinct groups within the samples.

Calculate QC metrics on raw and aligned data

Quality control (QC) metrics are calculated for the raw sequencing data. This is done using the tool FastQC [4]. QC metrics are calculated for the aligned reads using Picard-tools [5], CollectRnaSeqMetrics, MarkDuplicates, CollectInsertSize- Metrics and SAMtools flagstat.

Splicing event calling using Leafcutter

Leafcutter quantifies RNA splicing variation detection.

GATK variant calling

Variant calling was done using GATK. First, we use a GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions. The variant calling itself was done using HaplotypeCaller in GVCF mode. All samples are then jointly genotyped by taking the gVCFs produced earlier and running GenotypeGVCFs on all of them together to create a set of raw SNP and indel calls. [6]

Results archive The zipped archive contains the following data and subfolders:

alignment: merged BAM file with index, md5sums and alignment statistics (.Log.final.out)
expression: textfiles with gene level quantification per sample and per project.
fastqc: FastQC output
qcmetrics: Multiple qcMetrics and images generated with Picard-tools or SAMTools Flagstat.
leafcutter: Leafcutter and RegTools output files.
expression/Deseq2 differential expression analysis.
multiqc_data: Combined MultiQC tables used for multiqc report html.
variants: Variants calls using GATK.
rawdata: raw sequence file in the form of a gzipped fastq file (.fq.gz)

The root of the results directory contains the final QC report, README.txt, analysis results from each tool, and the samplesheet which formed the basis for this analysis.

Alexander Dobin 1 , Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R Gingeras: STAR: ultrafast universal RNA-seq aligner 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup 1000 Genome Project Data Processing: The Sequence Alignment/Map format and SAMtools. Bioinforma 2009, 25 (16):2078–2079.
Anders S, Pyl PT, Huber W: HTSeq – A Python framework to work with high-throughput sequencing data HTSeq – A Python framework to work with high-throughput sequencing data. 2014:0–5.
Andrews, S. (2010). FastQC a Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ ${samtoolsVersion}
Picard Sourceforge Web site. http://picard.sourceforge.net/ ${picardVersion}
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. McKenna A et al.2010 GENOME RESEARCH 20:1297-303, Version: ${gatkVersion}
Li YI, Knowles DA, Humphrey J, et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat Genet. 2018;50(1):151-158. doi:10.1038/s41588-017-0004-9

Manual

1) Copy rawdata to raw data ngs folder

scp –r SEQSTARTDATE_SEQ_RUNTEST_FLOWCELLXX username@yourcluster:${root}/groups/$groupname/${tmpDir}/rawdata/ngs/YOURDIR

2) Create a folder in the generatedscripts folder

mkdir ${root}/groups/$groupname/${tmpDir}/generatedscripts/TestRun

3) Copy samplesheet to generatedscripts folder

scp –r TestRun.csv username@yourcluster:/groups/$groupname/${tmpDir}/generatedscripts/

Note: the name of the folder should be the same as samplesheet (.csv) file. Note2: Example samplesheet can be found in $EBROOTNGS_RNA/templates/externalSamplesheet.csv

4) Run the generate script

module load NGS_RNA
cd ${root}/groups/$groupname/${tmpDir}/generatedscripts/TestRun
cp $EBROOTNGS_RNA/generate_template.sh .
bash generate_template.sh
cd scripts

Note: If you want to run the pipeline locally, you should change the backend in the CreateInhouseProjects.sh script (this can be done almost at the end of the script where you have something like: sh ${EBROOTMOLGENISMINCOMPUTE}/molgenis_compute.sh search for –b slurm and change it into –b localhost

bash submit.sh

5) Submit jobs

Navigate to jobs folder. The location of the jobs folder will be outputted at the step before this one (step 4).

bash submit.sh

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
check		check
protocols		protocols
report		report
scripts		scripts
templates		templates
test		test
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
checkEnvironment.sh		checkEnvironment.sh
chromosomes.homo_sapiens.csv		chromosomes.homo_sapiens.csv
chromosomes.mus_musculus.csv		chromosomes.mus_musculus.csv
chromosomes.rattus_norvegicus.csv		chromosomes.rattus_norvegicus.csv
convertParametersGitToMolgenis.pl		convertParametersGitToMolgenis.pl
create_external_samples_ngs_projects_workflow.csv		create_external_samples_ngs_projects_workflow.csv
create_in-house_ngs_projects_workflow.csv		create_in-house_ngs_projects_workflow.csv
environment_checks.txt		environment_checks.txt
parameters.GRCh37.csv		parameters.GRCh37.csv
parameters.GRCh38.csv		parameters.GRCh38.csv
parameters.boxy.csv		parameters.boxy.csv
parameters.calculon.csv		parameters.calculon.csv
parameters.callithrix_jacchus.csv		parameters.callithrix_jacchus.csv
parameters.csv		parameters.csv
parameters.fender.csv		parameters.fender.csv
parameters.gearshift.csv		parameters.gearshift.csv
parameters.hisat.csv		parameters.hisat.csv
parameters.homo_sapiens.csv		parameters.homo_sapiens.csv
parameters.mus_musculus.csv		parameters.mus_musculus.csv
parameters.rattus_norvegicus.csv		parameters.rattus_norvegicus.csv
parameters.zinc-finger.csv		parameters.zinc-finger.csv
workflow_STAR.csv		workflow_STAR.csv
workflow_adapterTrimming.csv		workflow_adapterTrimming.csv
workflow_hisat.csv		workflow_hisat.csv
workflow_lexogen.csv		workflow_lexogen.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NGS_RNA pipeline

Description of the different steps used in the RNA analysis pipeline

Gene expression quantification

Calculate QC metrics on raw and aligned data

Splicing event calling using Leafcutter

GATK variant calling

Manual

1) Copy rawdata to raw data ngs folder

2) Create a folder in the generatedscripts folder

3) Copy samplesheet to generatedscripts folder

4) Run the generate script

5) Submit jobs

About

Releases

Packages

Languages

License

scimerman/NGS_RNA

Folders and files

Latest commit

History

Repository files navigation

NGS_RNA pipeline

Description of the different steps used in the RNA analysis pipeline

Gene expression quantification

Calculate QC metrics on raw and aligned data

Splicing event calling using Leafcutter

GATK variant calling

Manual

1) Copy rawdata to raw data ngs folder

2) Create a folder in the generatedscripts folder

3) Copy samplesheet to generatedscripts folder

4) Run the generate script

5) Submit jobs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages