-
Notifications
You must be signed in to change notification settings - Fork 9
/
EXAMPLE
127 lines (99 loc) · 7.19 KB
/
EXAMPLE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
**ANALYSIS OF DUPLEX SEQUENCING EXPERIMENT USING Y-SHAPED ADAPTORS**
**BAM file pre-processing**
Assumes that the submitted library was of type 'Bidirectional Duplex-seq'. In this example, the library has been sequenced on one HiSeq 2500 run.
1. Extract tags from fastq files
```bash
python extract-tags.py -a /lustre/scratch119/casm/team78/ro4/drseq/70/1_R1.fastq -b /lustre/scratch119/casm/team78/ro4/drseq/70/1_R2.fastq -c 70#1R1.fastq -d 70#1R2.fastq -m 3 -s 4 -l 151
```
The script reads in one read1 and one read2 fastq file. For each read-pair, the script removes the NNN degenerate 3-nucleotide tag (-m), and skips invariant tag bases (-s). The pair of tags is appended to the read header using the rb (read barcode) and mb (mate barcode) auxiliary tags.
2. Compress fastq files
```bash
gzip 70#1R1.fastq
gzip 70#1R2.fastq
```
3. Align reads using bwa mem
```bash
bsub -q basement -G team78-grp -o out -e log -M20000 -R"span[hosts=1] select[mem>20000] rusage[mem=20000]" "/software/CGP/external-apps/bwa-0.7.5a/bwa mem -C /lustre/scratch117/casm/team78/ro4/hs37d5/hs37d5.fa 70#1R1.fastq.gz 70#1R2.fastq > 70#1.sam"
```
Note use of the -C option to add the mb and rb auxiliary tags.
4. Sort and add auxiliary tags
```bash
bsub -q basement -G team78-grp -o out -e log -M20000 -R"span[hosts=1] select[mem>20000] rusage[mem=20000]" "/software/CGP/external-apps/biobambam2-2.0.76/bin/bamsormadup inputformat=sam rcsupport=1 threads=1 < 70#1.sam > 70#1.bam"
```
bamsormadup both sorts the file and with rcsupport=1 adds the rc (read-coordinate) and mc (mate-coordinate) auxiliary tags. Once satisfied that the script has completed successfully delete the temporary log and out files. Also, reduce memory requirements by deleting 70#1.sam.
5. Mark optical duplicates
```bash
bsub -q basement -G team78-grp -o out -e log -M20000 -R"span[hosts=1] select[mem>20000] rusage[mem=20000]" "/software/CGP/external-apps/biobambam2-2.0.76/bin/bammarkduplicatesopt I=70#1.bam O=70#1.marked.bam"
```
Once satisfied that the script has completed successfully delete the temporary log and out files. Also, reduce memory requirements by deleting 70#1.bam.
4. Index
```bash
samtools index 70#1.marked.bam
```
**Table building**
This step builds very large tables. Once parameters are optimised, the code can be modified to avoid writing out tables and instead directly call variants. At present, we are unsure of parameters and so the table building and variant calling steps are separate.
1. Build tables for a specific interval\
Three BED files are used: 'noise' and 'snp' are used for downstream filtering. The 'intervals' file contains coordinates from hs37d5 divided into one megabase chunks. One LSF job is created for each chunk in the 'intervals' file. this creates ~8,000 LSF jobs, which have low memory requirements and run in 15-30 mins. Two BAM files are used. One is a matched-normal BAM, which has been aligned using bwa mem to hs37d6. The second is a duplex BAM, which has been pre-processed as dsecribed in "BAM file pre-processing". The output from each LSF job is a BED file, where each row represents one position in one read-bundle. The output directory therefore contains around 8,000 BED files. In the following example, I assume that you run the scripts from within the tables directory and that you have mkdir the output directory.
```bash
genome="/lustre/scratch117/casm/team78/ro4/hs37d5/hs37d5.fa"
intervals="../bed/1.megabase.intervals.bed"
noise="../bed/NOISE.sorted.bed.gz"
snp="../bed/SNP.sorted.bed.gz"
dplx="70#1.sorted.bam"
bulk="/nfs/cancer_ref01/nst_links/live/1321/BMb/BMb.sample.dupmarked.bam"
outpath="tables-70-1"
while IFS=$'\t' read -r -a coords
do
log=$outpath"/"${coords[0]}.${coords[1]}.${coords[2]}".log"
out=$outpath"/"${coords[0]}.${coords[1]}.${coords[2]}".out"
bed=$outpath"/"${coords[0]}.${coords[1]}.${coords[2]}".bed"
bsub -q normal -o $out -e $log -M200 -R"span[hosts=1] select[mem>200] rusage[mem=200]" "./dsa -A $bulk -B $dplx -C $snp -R $genome -D $noise -r ${coords[0]} -b ${coords[1]} -e ${coords[2]} > $bed"
done < $intervals
done
```
2. Check that LSF jobs have completed properly\
I use a Python script that returns errors for missing files, or jobs that have not completed properly.\
```python ../tools/check_lsf_batch.py -d out/```
**Grid search**
If you already know the parameters for variant calling then this script is not needed. If you are unsure then this script can help optimise parameters by iterating through 34,922 different parameter combinations. These include filters that operate at the read-bundle level (mean proportion of reads with 5' soft clipped bases, AS-XS and NM), filters that operate at one pileup column in one read-bundle (bases from read 5', bases from read 3', proportion of reads with an indel, consensus base quality), and filters that operate on one pileup column in the matched normal (number of reads on each strand, proportion of reads that are properly paired).
1. Call variants using grid-search parameters
This uses the BED tables built in the "Table building" section. The output is one csv file per BED file. I assume that you run this script from within the gridsearch folder, that the links to the table directory is correct and that you have mkdir the output directory.
```bash
indir="tables-70-1"
outdir="gridsearch-70-1"
for fn in $indir/*.bed
do
base1=$(basename $fn)
base2=${base1%.*}
out=$outdir"/"$base2".out"
log=$outdir"/"$base2".log"
csv=$outdir"/"$base2".csv"
bsub -q normal -o $out -e $log -M200 -R"span[hosts=1] select[mem>200] rusage[mem=200]" "./gridsearch -B $fn > $csv"
done
```
2. Check that LSF jobs have completed properly\
To do this I use a Python script that returns errors for missing files, or jobs that have not completed properly.\
```python ../tools/check_lsf_batch.py -d outpath/```
3. Merge results from multiple different csv files\
```python gridsearch_merge.py -d gridsearch-70-1/ > gridsearch-70-1.csv```
4. Select parameters\
gridsearch-70-1.csv contains columns for each of the tested parameters and for variant, reference and mutation burden. You can use this file to select suitable parameters for variant calling; for example, parameters that result in mutation burdens within the expected range of a benchmarked sample.
**Variant calling**
This step calls variants and produces QC metrics. I asumme that you run these scripts from within the variantcaller folder.
1. Run variantcaller\
This uses the BED tables built in the "Table building" section. There are many different parameters that can be used for filtering. To see all of these use ```./variantcaller -h```. You can optimise the parameters using a grid-search as described in the "Grid search" section. An example bash script follows.
```bash
indir="tables-70-1"
outdir="variantcaller-70-1"
for fn in $indir/*.bed
do
base1=$(basename $fn)
base2=${base1%.*}
bsub -q normal -o $outdir"/"$base2".out" -e $outdir"/"$base2".log" -M200 -R"span[hosts=1] select[mem>200] rusage[mem=200]" "./variantcaller -B $fn -c 0 -q 90 -p 0 > $outdir"/"$base1"
done
```
2. Merge results from multiple output files\
This generates csv files that work with the R scripts.\
```python variantcaller_merge.py -d variantcaller-70-1/```
3. Use R to generate tables and reports\
```Rscript variantcaller.R variantcaller-70-1/```