Please visit this site to see web version of this git repository.
https://kennethjhan.github.io/Genome-Analysis-Tutorial/
This page will help you to learn how to make pipeline based on GATK3 BestPractice.
From this page, https://kennethjhan.github.io/Genome-Analysis-Tutorial/resource you can download resorce file such as reference fasta, samtools, bwa, picard, GATK3, snpEff and raw fastq file.
YES! This tutorial is based on GATK3 which is stable and wide used tool from Broad Institute.
I hope everyone learns how to analyze gene data and contributes scientific field.
IF YOU WANT GATK4-BEST PRACTICE PIPELINE PLEASE CLICK HERE.
ENJOY!!
Download all the resorces from this page. https://kennethjhan.github.io/Genome-Analysis-Tutorial/resource
# 1) Download BWA (bwa-0.7.17.tar.bz2)
$ wget https://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.17.tar.bz2
# 2) Extract bwa-0.7.17.tar.bz2
$ tar xf bwa-0.7.17.tar.bz2
# 3) Install BWA
$ make
# 4) Confirm Installation
$ ./bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <[email protected]>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
...
# 1) Download Samtools
$ wget https://downloads.sourceforge.net/project/samtools/samtools/1.9/samtools-1.9.tar.bz2
# 2) Extract samtools-1.9.tar.bz2
$ tar xf samtools-1.9.tar.bz2
# 3) Install samtools
$ ./configure --prefix=/YOUR/DIRECTORY/TO/INSTALL
$ make && make install
# 4) Confirm Installation
$ ./samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.9 (using htslib 1.9)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
...
# 1) Download Picard
$ wget https://github.com/broadinstitute/picard/releases/download/2.18.17/picard.jar
# 2) Run Picard
$ java -jar picard.jar
## RUNNING ON JAVA8 (NOT JAVA7)
# 1) Download GATK3
https://software.broadinstitute.org/gatk/download/archive
Download GenomeAnalysisTK-3.8-1-0-gf15c1c3ef.tar.bz2
# 2) Extract
$ tar xf GenomeAnalysisTK-3.8-1-0-gf15c1c3ef.tar.bz2
# 3) Run GATK3
$ java -jar gatk-package-4.0.11.0-local.jar
$ jaga -jar GenomeAnalysisTK.jar
## RUNNING ON JAVA8 (NOT JAVA7)
# 1) Download SnpEff
$ wget https://downloads.sourceforge.net/project/snpeff/snpEff_latest_core.zip
# 2) Extract
$ unzip snpEff_latest_core.zip
# 3) Run SnpEff
$ java -jar snpEff.jar
## RUNNING ON JAVA8 (NOT JAVA7)
$ bwa index -a bwtsw ucsc.hg19.fasta
[bwa_index] Pack FASTA... 28.10 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=6191387966, availableWord=447648912
[BWTIncConstructFromPacked] 10 iterations done. 99999998 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999998 characters processed.
……
[BWTIncConstructFromPacked] 690 iterations done. 6264217232 characters
processed.
[bwt_gen] Finished constructing BWT in 695 iterations.
[bwa_index] 2802.55 seconds elapse.
[bwa_index] Update BWT... 28.60 sec
[bwa_index] Pack forward-only FASTA... 16.69 sec
[bwa_index] Construct SA from BWT and Occ... 1029.86 sec
[main] Version: 0.7.15-r1140
[main] CMD: /Users/jhan/Tools/bwa-0.7.15/bwa index -a bwtsw ucsc.hg19.fasta
[main] Real time: 3929.996 sec; CPU: 3905.257 sec
$ ls -l
-rw-r--r-- 1 jhan staff 3.0G May 9 12:44 ucsc.hg19.fasta
-rw-r--r-- 1 jhan staff 8.4K May 9 13:34 ucsc.hg19.fasta.amb
-rw-r--r-- 1 jhan staff 3.9K May 9 13:34 ucsc.hg19.fasta.ann
-rw-r--r-- 1 jhan staff 2.9G May 9 13:33 ucsc.hg19.fasta.bwt
-rw-r--r-- 1 jhan staff 748M May 9 13:34 ucsc.hg19.fasta.pac
-rw-r--r-- 1 jhan staff 1.5G May 9 13:51 ucsc.hg19.fasta.sa
$ samtools faidx ucsc.hg19.fasta
$ ls –l
-rw-r--r-- 1 jhan staff 3.5K May 9 14:39 ucsc.hg19.
fasta.fai
$ java -jar picard.jar CreateSequenceDictionary REFERENCE=ucsc.hg19.fasta OUTPUT=ucsc.hg19.dict
[Tue May 09 14:53:07 KST 2017] picard.sam.
CreateSequenceDictionary REFERENCE=ucsc.hg19.fasta
OUTPUT=ucsc.hg19.dict TRUNCATE_NAMES_AT_WHITESPACE=true
NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false
VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_
RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_
FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue May 09 14:53:07 KST 2017] Executing as jhan@JOOHYUNui-
MacBook-Pro.local on Mac OS X 10.12.3 x86_64; Java
HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Picard version:
2.9.1-SNAPSHOT
[Tue May 09 14:53:25 KST 2017] picard.sam.
CreateSequenceDictionary done. Elapsed time: 0.30 minutes.
Runtime.totalMemory()=660602880
$ ls –l
-rw-r--r-- 1 jhan staff 12K May 9 14:53 ucsc.hg19.dict
#!/bin/bash
if [ $# -ne 1 ];then
echo "#usage: sh $0 [samplename]"
exit
fi
SAMPLE=$1
BWA="/data/etc/bwa/bwa"
SAMTOOLS="/data/etc/samtools/bin/samtools"
REFERENCE="/data/reference/ucsc.hg19.fasta"
JAVA="/usr/bin/java"
PICARD="/data/etc/picard/picard.jar"
GATK="/data/etc/gatk/GenomeAnalysisTK.jar"
SNPEFF="/data/etc/snpEff/snpEff.jar"
MILLS="/data/etc/bundle/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf"
A1KG="/data/etc/bundle/1000G_phase1.indels.hg19.sites.vcf"
DBSNP138="/data/etc/bundle/dbsnp_138.hg19.vcf"
$ bwa mem -R "@RG\tID:test\tSM:${SAMPLE}\tPL:ILLUMINA" ucsc.hg19.fasta ${SAMPLE}_1.filt.fastq.gz ${SAMPLE}_2.filt.fastq.gz > ${SAMPLE}.mapped.sam
$ samtools view –Sb ${SAMPLE}.mapped.sam > ${SAMPLE}.mapped.bam
$ samtools sort –o ${SAMPLE}.sorted.bam ${SAMPLE}.mapped.bam
$ samtools view ${SAMPLE}.mapped.bam | head
${SAMPLE}.26 131 chrX 26266805
${SAMPLE}.32 65 chr13 75944171
${SAMPLE}.32 129 chrX 138770110
${SAMPLE}.34 115 chr2 190054626
……
$ samtools view ${SAMPLE}.sorted.bam | head
${SAMPLE}.91192 115 chrM
……
${SAMPLE}.186880 65 chrM
${SAMPLE}.434476 115 chr1
${SAMPLE}.434476 179 chr1
# Sort chromosome ordered
${BWA} mem -R "@RG\tID:test\tSM:${SAMPLE}\tPL:ILLUMINA" ${REFERENCE} ${SAMPLE}_1.filt.fastq.gz ${SAMPLE}_2.filt.fastq.gz | ${SAMTOOLS} view -Sb - | ${SAMTOOLS} sort - > ${SAMPLE}.sorted.bam
$ ${JAVA} –jar picard.jar MarkDuplicates I=${SAMPLE}.
sorted.bam O=${SAMPLE}.markdup.bam
M=${SAMPLE}.markdup.metrics.txt
Futher resource: https://broadinstitute.github.io/picard/command-lineoverview.html#MarkDuplicates
$ samtools index ${SAMPLE}.markdup.bam
${JAVA} -jar ${GATK} -T RealignerTargetCreator -R ${REFERENCE} -I ${SAMPLE}.markdup.bam -known ${MILLS} -known ${A1KG} -o ${SAMPLE}.intervals #-L bed
${JAVA} -jar ${GATK} -T IndelRealigner -R ${REFERENCE} -I ${SAMPLE}.markdup.bam -known ${MILLS} -known ${A1KG} -targetIntervals ${SAMPLE}.intervals -o ${SAMPLE}.realign.bam #-L bed
${JAVA} -jar ${GATK} -T BaseRecalibrator -R ${REFERENCE} -I ${SAMPLE}.realign.bam -knownSites ${MILLS} -knownSites ${A1KG} -knownSites ${DBSNP138} -o ${SAMPLE}.table #-L bed
${JAVA} -jar ${GATK} -T PrintReads -R ${REFERENCE} -I ${SAMPLE}.realign.bam -o ${SAMPLE}.recal.bam -BQSR ${SAMPLE}.table #-L bed
${JAVA} -jar ${GATK} -T HaplotypeCaller -R ${REFERENCE} -I ${SAMPLE}.recal.bam --emitRefConfidence GVCF --dbsnp ${DBSNP138} -o ${SAMPLE}.g.vcf #-L bed
${JAVA} -jar ${GATK} -T GenotypeGVCFs -R ${REFERENCE} -V ${SAMPLE}.g.vcf -o ${SAMPLE}.raw.vcf #-L bed
# Select SNP
${JAVA} -jar ${GATK} -T SelectVariants -R ${REFERENCE} -V ${SAMPLE}.raw.vcf -o ${SAMPLE}.raw.snp.vcf --selectTypeToInclude SNP
# Filter SNP
${JAVA} -jar ${GATK} -T VariantFiltration -R ${REFERENCE} -V ${SAMPLE}.raw.snp.vcf -o ${SAMPLE}.filtered.snp.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName SNP_FILTER
# Select INDEL
${JAVA} -jar ${GATK} -T SelectVariants -R ${REFERENCE} -V ${SAMPLE}.raw.vcf -o ${SAMPLE}.raw.indel.vcf --selectTypeToInclude INDEL
# Filter INDEL
${JAVA} -jar ${GATK} -T VariantFiltration -R ${REFERENCE} -V ${SAMPLE}.raw.indel.vcf -o ${SAMPLE}.filtered.indel.vcf --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName INDEL_FILTER
# Combine SNPs and INDELs
${JAVA} -jar ${GATK} -T CombineVariants -R ${REFERENCE} -V ${SAMPLE}.filtered.snp.vcf -V ${SAMPLE}.filtered.indel.vcf -o ${SAMPLE}.filtered.vcf -genotypeMergeOptions UNSORTED
# Detailed Filter options are here. https://software.broadinstitute.org/gatk/documentation/article?id=2806
${JAVA} -jar -Xmx4g ${SNPEFF} -v hg19 ${SAMPLE}.filtered.vcf > ${SAMPLE}.filtered.annotated.vcf