-
Notifications
You must be signed in to change notification settings - Fork 11
/
version_notes.txt
1513 lines (1047 loc) · 53 KB
/
version_notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
04/04/19:
- Changed all versions to 1.1.4
- Tagged 1.1.4
04/02/2019;
When merging the RNA and DNA results, always provide as much info as possible.
- The --rnaOnly flag is currently not used. By default, merge the RNA and
DNA results in order to provide as much info as possible.
- Removed the --dnaHeaderOnly flag from the mergeRnaAndDnaFiles.py script.
Always try to merge the DNA results if they exist.
04/01/2019:
Fixed problems when the --dnaHeaderOnly flag is true (mergeRnaAndDnaFiles.py)
A KeyError was occurring when the --rnaOnly flag in filterRadia.py
was set to true resulting in the --dnaHeaderOnly flag here to
be set to true. When merging the rna and dna filters, we need
to check if only the dna header is loaded from the dnaFile.
In addition, the non-overlapping RNA calls that don’t pass the
germline and pseudogene filters were being omitted from the
final output. They are now added with the 'dnacall' filter
to show that they are non-passing.
03/26/2019:
Add the default . for each allele on allele specific tags. (myVCF.py)
Removed requirement for a header file. (createBlatFile.py, filterRadia.py)
Fixed pep8 issues. (filterRadia.py)
Fixed pep8 issues. (filterByReadSupport.py)
Added handling of pedigree tags in the metadata. (radia.py, filterByMpileupSupport.py)
- Fixed how the sample tags are output.
Check the stranded base when grouping reads by name (filterByReadSupport.py)
When grouping reads by name, be sure to check that the
stranded base is supporting the ref or alt from the call
being evaluated. Otherwise, this can be problematic when
there are multiple isoforms contributing to a call. For
example, the 'RGPD5' gene has 3 isoforms: one on the positive
strand and 2 on the negative strand. In the previous code,
the reads from all 3 isoforms would be grouped by the read
name. Then in the find_non_overlapping_reads() function,
it would be discovered that the reads overlapped and the
read with the highest base quality would be chosen. If all
overlapping reads had the same high base quality, then a
random read (e.g. the first read) would be chosen. Later,
the stranded base is compared to the genomic ref and alt.
Since the first overlapping read was randomly chosen, the
stranded base seemingly came from the positive strand and
did not match the genomic ref nor alt. In actuality, the
call came from the negative strand from reads that were
not randomly chosen. Now the code checks at the beginning
to make sure that the stranded base matches the genomic
ref or alt, and only groups reads by name that support
the call being evaluated.
When checking to see if a read supports the alt allele,
make sure it is supporting the alt allele from the call
being evaluated and not an additional alt allele.
Check the stranded base when grouping reads by name (createBlatFile.py)
The 'RGPD5' gene has 3 isoforms: one on the positive strand
and 2 on the negative strand. In the previous code, the reads
from all 3 isoforms would be grouped by the read name. Then
in the find_non_overlapping_reads() function, it would be
discovered that the reads overlapped and the read with the
highest base quality would be chosen. If all overlapping
reads had the same high base quality, then a random read
(e.g. the first read) was chosen. This is okay for all genes
that have all isoforms on the same strand, but for this one
gene, it resulted in no reads being output to the blatInput.fa
file. In the write_to_blat_file() function, the stranded base
is compared to the genomic ref and alts. Since the first
overlapping read was randomly chosen, the stranded base
seemingly came from the positive strand and did not match
the genomic ref nor alt. In actuality, the call came from
the negative strand from reads that were not randomly chosen.
Now the code checks at the beginning to make sure that
the stranded base matches the genomic ref or alt, and
only groups reads by name that support the call.
02/25/2019:
Find the stranded RNA base and correct read length (createBlatFile.py)
- The base on the read is being compared to the forward
genomic strand REF and ALT columns, so the base needs
to be converted to the forward genomic strand if necessary.
- Some of the read lengths were negative and based off of
the insert size. Now the actual read length is used.
- Need to check for empty lines before removing the newline.
- Create a blat file for all supported mod types.
Check the stranded base when grouping reads by name (createBlatFile.py)
The 'RGPD5' gene has 3 isoforms: one on the positive strand
and 2 on the negative strand. In the previous code, the reads
from all 3 isoforms would be grouped by the read name. Then
in the find_non_overlapping_reads() function, it would be
discovered that the reads overlapped and the read with the
highest base quality would be chosen. If all overlapping
reads had the same high base quality, then a random read
(e.g. the first read) was chosen. This is okay for all genes
that have all isoforms on the same strand, but for this one
gene, it resulted in no reads being output to the blatInput.fa
file. In the write_to_blat_file() function, the stranded base
is compared to the genomic ref and alts. Since the first
overlapping read was randomly chosen, the stranded base
seemingly came from the positive strand and did not match
the genomic ref nor alt. In actuality, the call came from
the negative strand from reads that were not randomly chosen.
Now the code checks at the beginning to make sure that
the stranded base matches the genomic ref or alt, and
only groups reads by name that support the call.
Properly handle RNA aligned to the transcriptome and other fixes (filterByBlat.py)
- The base on the read is being compared to the forward
genomic strand REF and ALT columns, so the base needs
to be converted to the forward genomic strand if necessary.
- Some of the read lengths were negative and based off of
the insert size. Now the actual read length is used.
- The start and stop coordinates on the blat hits can be
reversed, so check for overlaps in both directions.
- The previous code was keeping a list of 'other' values,
and then finding the min or max at the end, but it is much
easier to just keep the top 'overlapping' value and the
top 'other' value as each blat hit is processed. The old
way also had a bug if no 'other' value was set for the
max overlapping identity.
- The previous code loaded all of the blat hits into memory
at the beginning and then looped over the VCF lines to
process each call. In order to avoid using a ton of memory
by loading all the blat hits at once, now the VCF line and
the corresponding blat hits are loaded into memory one call
at a time, using python generators.
- The previous code required the blat alignment length to be
at least half the read length to set the maxOverlapIdentity,
but it did not make the same requirement to set the
maxOverlapEValue. Now both require the blat alignment length
to be at least half the read length.
- The transcript name, coordinate, and strand are needed
for RNA that is aligned to the transcriptome.
- The RNA secondary alignments flag is needed when the RNA
is aligned to the transcriptome.
- Skip the PSL header lines.
- Changed some parameter names to be consistent with
other code.
- Added an order of magnitude flag.
- Added ability to filter DNA somatic calls by blat.
- Need to check for empty lines before removing the newline.
Added some parameters for the filterByBlat.py command (filterRadia.py)
- The filterByBlat.py command needs the transcript name,
coordinate, and strand parameters for RNA that is
aligned to the transcriptome.
- The filterByBlat.py command needs the RNA secondary
alignments flag for RNA that is aligned to the
transcriptome.
- Include the header lines when extracting the passing
calls for filtering by blat.
- The latest version of snpEff has a new format. In
the future, use the new format, but for now stick
with the old one.
Use the length of the query sequence for the read length (filterByReadSupport.py)
In order to be consistent across the code, use the
length of the query sequence as the read length.
02/01/2019:
Avoid repeated filters in FILTER column (filterByReadSupport.py, myVCF.py)
Since we are filtering by DNA and by RNA, it is
possible to have a call fail for the same reason,
but we only want to report the filter once. The
filters should be a ‘set’ instead of a ‘list’ to
avoid the repeated filters.
Removed the max read depth parameter (createBlatFile.py)
The max read depth parameter was originally added to limit
the number of reads that were processed with the samtools
view command. The code has switched from the old samtools
view command to pysam, and pysam has a default max depth
of 8,000 reads. In order to be consistent with the code
in filterByReadSupport.py, remove the max depth parameter
that is specified here.
When calculating the score, check to see if the data exists (fitlerByReadSupport.py)
For some calls, there might not be data for all
samples (e.g. germline). Also, if the —outputAllData
flag is used, then all data is output even if no
variant exists.
Added ability to run installed commands (filterRadia.py)
In some environments, the RADIA package may be installed
(e.g. Continuous Integration), and the commands should be run
as ’command.py’ instead of ‘python /scriptsDir/command.py’.
In order to maintain backwards compatibility, an
-—ignoreScriptsDir flag has been added to run the commands
as ‘command.py’. Since the API currently requires the
script directory, the user should provide a dummy directory.
In a future version, the API will change such that no
dummy directory is necessary.
01/23/2019:
Only ignore nearby A>G mismatches on perfect reads for RNA editing (filterByReadSupport.py)
A>G RNA editing events occur in clusters. Theoretically, somatic
mutations could occur in regions where A>G editing clusters exist,
but practically speaking, ignoring A>G mismatches on perfect reads
for somatic mutations leads to an increase in false positives. It
is better to count the A>G mismatches as true mismatches for
somatic mutations and only ignore the A>G mismatches for RNA
editing events. In the future, it would be good to exclude high
confidence A>G RNA editing events for both somatic mutations and
RNA editing events, when counting mismatches on perfect reads.
Create an empty list if the "MMP" INFO tag is missing (filterByReadSupport.py)
This code calculates the Multi Mapping Percent (MMP) for all
calls that it analyses and stores it in the INFO. If the
"MMP" INFO tag does not exist, then just create an empty list
to be filled with the calculations.
01/17/2019:
Add metavar names for the transcript tags (filterRadia.py)
Fixed pep8 issues (filterByPybed.py, pybed.py)
Included renaming some functions
Count all mismatches on perfect reads except A>G editing (filterByReadSupport.py)
A>G RNA Editing events occur in clusters. The
previous code did not count the number of
mismatches when determining if a read was perfect
for all types of RNA editing. Since other types
of RNA editing are not known to occur in clusters,
now all non A>G mismatches are counted for all RNA
editing events.
Adjust the coordinate for the previous and next ref bases (filterByReadSupport.py)
For the previous and next ref bases stored in the INFO,
the VCFs are 1-based and pysam requires a 0-based coordinate
Fixed pep8 issues (filterByReadSupport.py)
Added read1, read2, reverse, and mate_reverse flags
to the readDict in order to debug the stranded RNA
11/29/2018:
Added previous and next reference bases to INFO (radia.py and filterByReadSupport.py)
Added DP4 field (radia.py)
- Added DP4 field: The 'Number' field in the header
for 'DP4' should be '4' and the VCF lines should
only specify the ref-forward, ref-reverse,
alt-forward and alt-reverse values of the passing
call, but we are allowing an unspecified number
of values for the 'DP4' field. The initial RADIA
script outputs variants at very low depth and
frequency, which results in many calls with multiple
alleles. Before filtering, it is impossible to know
which alternative allele will be part of the final
passing call when multiple alleles are present.
During the filtering process, we know which alternative
allele to choose from the passing call and could limit
the values at that point, but for all non-passing calls,
we would need to choose an arbitrary alternative allele
to limit it to just 4 values. In order to provide the
most complete information, all forward and reverse values
for all alleles is provided in the specified order.
This makes it easy to extract the final alt-forward and
alt-reverse values for any alternative allele.
- Added newline after p-value header line
- Changed the 'Number' for several INFO
and FORMAT fields from the generic '.'
to 'A' for all alternative alleles and
'R' for all alleles including the
reference.
Added generic read and write functions (radiaUtil.py)
- Added generic functions to get the read
and write file handlers.
- Fixed pep8 issues
Lower the min total depth from 10 to 8 (filterByMpileupSupport.py)
- Lower the minimum required total depth
from 10 to 8 for all samples
- Lower the minimum alternative allele
frequency for germline calls from
10% to 3%.
Added parameters for normal and tumor purity (filterByMpileupSupport.py)
The minimum alternative allele frequency will be
altered according to the purity parameters provided.
For somatic calls, the minimum variant frequency is
decreased in order to detect true somatic calls that
have a lower frequency due to the normal-content in
the tumor sample (low tumor purity).
For germline calls, the minimum variant frequency
is increased in order to avoid making false positive
germline calls due to tumor-content in the normal
sample (low normal purity).
Add overlaps with RADAR and DARNED (filterRadia.py)
- Add number of COSMIC overlaps to INFO
- Add overlaps with RADAR database
- Add overlaps with DARNED database
- Replaced filterByCoordinate.py script with
filterByPybed.py for dbSNP overlaps. The
later script takes care of small INDELs
that are present in dbSNP.
Add counts or text to INFO tags (filterByPybed.py)
- Added ability to add a count as the value of an
INFO tag
- Added ability to add a text as the value of an
INFO tag
10/17/2018:
Add target regions to INFO instead of FILTER (filterRadia.py)
Instead of filtering calls that are not in the
GENCODE gene regions, just add a tag to the INFO.
Renamed script to follow coding conventions (filterByReadSupport.py, myvcf.py)
Renamed myvcf.py to myVCF.py to adhere to
coding conventions.
When no ALT exists, use the default '.' (radia.py)
If the --outputAllData flag is used to output all data
even when no variant exists, then there might not be
an ALT allele. Per the VCF specifications, the column
should contain the default '.' value.
Added 'MMP' to the sample data (filterByReadSupport.py, radia.py, myvcf.py)
- Moved parsing of the FORMAT field and sample fields to
myvcf.py Parsing is needed in order to add the 'MMP'
field to the samples, and if a call passes, parsing is
needed for calculating the p-value and somatic score.
- Redid how the filtering method is called: pass in the
entire currData object to eliminate the number of method
parameters.
- Only calculate the multi-mapping percent for the specific
allele in the call instead of all alternative alleles.
- Moved scoring into the Club class
- Changed the way the INFO was parsed to be consistent
with rest of code
- Changed variable names to be consistent with rest of code
- Made the reference a single base instead of a list
Added multi-mapping filter (filterByReadSupport.py)
- If secondary RNA alignments are allowed, there will be
many false positive calls where nearly 100% of the reads
supporting the alternative allele are from secondary RNA
alignments (not primary alignments). Added a filter now
for calls where 90% or more of the reads supporting the
alternative allele come from secondary mappings. In other
words, at least 10% of the reads need to be the primary
mapping. This filter is mostly relevant for RNA variants.
- Renamed variable names to be consistent with rest of code
- Renamed filter names to be consistent
- The get_pvalue() method does not need the aMaxSize param
- If the number of VCF columns for a particular line in a
VCF does not equal the number of VCF header columns,
then output a warning and skip the line.
Downgrade warning message about multiple passing mod types (mergeRnaAndDnaFiles.py)
Very rarely, there are calls that have evidence of
a germline mutation (e.g. A>T) and then an RNA editing
event for a different alternative allele (e.g. A>G).
These calls will be output as germline mutations and
are likely false positives to be removed by other
filters.
Allow RNA_TUM_VARs to PASS (radiaCompare.py, mergeRnaAndDnaFiles.py, filterRadia.py)
- In general, RADIA can be used on tumor RNA without
any DNA to output RNA variants. Obviously, further
analysis is needed to classify the RNA variants as
germline, somatic, or editing. RNA variants will now
pass. If an RNA variant has enough total DNA (by
default 10 reads), then they will be classified as
germline, somatic, or editing based on the DNA. If
an RNA variant has little or no DNA, it may pass
as an RNA_TUM_VAR.
- The 'compareList' parameter for radiaCompare.py only
needs to compare somatic calls. Only the *passing*
RNA calls are being compared to the *passing* DNA
calls. There will never be a passing RNA Editing
or RNA variant in the DNA. Only somatic calls can
pass by both filtering methods. The overlaps file
will contain the somatic calls that passed by both
filtering methods. The non-overlaps file will contain
the RNA Rescue, RNA Editing, and RNA variant calls
that do not pass the DNA filters.
- Fixed propagating grep and awk error messages up
to the user. Also, if we pipe the grep output to
gzip, the return code and error messages from grep
get overwritten by gzip, and we no longer detect when
an error occurs.
- Fixed the command that selects the RNA variants for
further filtering. Instead of selecting somatic and
editing, just remove any germline. This will allow
the RNA_TUM_VARs to proceed. Also, changed the awk
command to be more flexible to allow variable sample
columns.
Reclassify germline mutations and update filter names (filterByMpileupSupport.py)
- Some germline variants that were homozygous for the ALT
base were being classified and filtered as somatic calls.
Now they are classified as germline and filtered as
germline. This code change resulted in a roughly five
percent increase in passing germline calls.
- The modification filters ('MF' in the INFO) provide the
filters for each modification type tested. Instead of
only providing the filters from this script, we want to
include filters (e.g. 'blck') from previous scripts
as well. In addition, even if a call passes this script,
previous filters should be added to the mod filters if
they exist.
- Updated filter names to be consistent with tags in the
FORMAT field. Updated some filter descriptions to
improve consistency.
Parse items in the INFO field that contain a '=' (myvcf.py)
- Most INFO field items have a Tag=Value(s) format,
but some annotations such as a silent protein change
specified as 'ProtCh=p.T394=' can exist. When
parsing the INFO field, catch these uncommon cases,
and parse them properly.
- Renamed some variables to be more consistent
Performance improvement when loading coordinates from a file (radia.py)
- If a coordinates file is given, the user can choose to
either load and loop through each coordinate separately,
or include the --loadCoordinatesRange flag to load all
the coordinates in the range and then process them each
separately. Including the flag can provide a great
performance improvement, depending on the number of
bases in the coordinate range. If the range is large,
a nominal performance improvement can be seen. If the
range is small, a great performance improvement can be
seen.
When --outputAllData is True, output all data that exists (radia.py)
- Previously when the --outputAllData flag was set,
the data was only output when it passed the basic
filters. We actually want to output all data,
if any data exists. The overall affect from this
change is not great. Mostly the calls that were
previously not output did not have enough ALT bases
(or no tumor data at all) and the 'mba' filter was
being set.
- The 'MQ0' tag in the INFO did not contain the sum
across all samples. It only had the count from the
last sample with data.
- Commented out very detailed debug messages
Query all coordinates when a coordinates file is given (radia.py)
If a coordinates file is provided, then we do not
want to break out of the processing loop when no
data is present.
Remove processing of 'N' bases (radia.py)
Very rarely, a read will contain the base 'N'.
There are usually no more than 2 and mostly just
1 read at a particular location containing an 'N'.
This happens mostly in the DNA reads and not very
often in the RNA reads.
Added INDEL filter and set max read depth to 8,000 (filterByMpileupSupport.py)
- Added INDEL filtering across all samples as well as
sample specific INDEL filtering. This filter greatly
improves performance, because it removes problematic
calls from further filtering scripts.
- Set the max total bases parameter for each sample
to 8,000. The samtools mpileup and pysam pileup
function have a parameter for the max read depth
(8,000 by default), but sometimes more reads are
selected. The read depth can be very high (in
the 100,000-200,000 range) in regions with many
indels, especially if RNA secondary alignments
are included. To improve performance, filter
out calls with more than 8,000 reads. We may
later add genes of interest to a white list to
avoid this filter being applied.
- Changed the description for the strand bias filters
- Code enhancement: group the general parameters for
a method together in a dictionary to reduce the
number of parameters in a method call.
Removed unused BLAT input file (filterByBlat.py)
The filterByBlat.py script actually does
not need the BLAT input file. It only uses
the results from the BLAT command stored
in the BLAT output file.
Removed old filterByPositionalBias.py code (filterRadia.py and createBlatFile.py)
The code for filtering by positional bias
has been moved to the filterByReadSupport.py
script and is no longer necessary as a
separate script. Eliminating this code
helps improve performance and simplifies
the filtering by positional bias process.
Use pysam to query bam files instead of samtools view (createBlatFile.py)
The samtools view command does not have a max
read depth parameter like the samtools mpileup
and pysam pileup functions. We can just select
the first 8,000 reads, but we are not guaranteed
that they are the same 8,000 reads selected by
the other commands. It is also cumbersome to
support 3 different ways of querying the same
data. In addition, the samtools view processing
takes much longer than the other functions.
Therefore, the samtools view command was replaced
with the pysam pileup function. Now the code is
similar to the code in the filterByReadSupport.py
script and performance has improved.
Removed obsolete code (createBlatFile.py)
Removed obsolete code that used the samtools view
command which has been replaced with the pysam
pileup function.
Commented out detailed debug messages (filterByReadSupport.py)
- To improve performance, some detailed debug
messages were commented out
- Add the reference bases to the debugging
- Changed variable names for pileup objects
to match style from rest of code
Propagate warning messages up and output them to logs (filterRadia.py)
- Propagate warning messages from subcommands up and
add them to the log files.
- Changed variable names for stdErr and stdOut
Added p-val, somatic score, and qual score (filterByReadSupport.py)
- Added a p-value, somatic score, and
qual score. Since computing scores is
computationally expensive, by default
a score is only calculated for passing
calls, but the user can flip the flag
to have all calls scored.
- Changed low_base_and_map_quals() to
low_base_or_map_quals() to make the
if statements clearer
- Changed variable names for pileup objects
to match style from rest of code
Minor code updates (myvcf.py)
- Stopped using python reserved words as var names
- Removed extra_headers that were never used
Process all reads from all transcript isoforms (filterByReadSupport.py)
- If calls are made on transcript aligned bams
instead of genome aligned bams, then calls from
multiple isoforms are merged into one call. When
filtering by read support, we need to query all
of the isoforms and then remove the duplicate
reads that mapped to more than one isoform.
- Simplified the ismut() and checkread() methods.
- Performance: eliminated looping over the reads
twice: once for the overlapping reads and then
once for the perfect reads.
Removed old soft-clipping filter (filterByReadSupport.py)
- A new soft-clipping filter was added a while
ago where a read is not 'perfect' if more than
10% of the read is soft-clipped.
- The old soft-clip filter kept track of the end
coordinate of a soft-clip at the beginning of a
read and the start coordinate of a soft-clip at
the end of a read. For all reads that were soft-
clipped at the beginning, the reads that had the
most common soft-clip end coordinate were no
longer considered 'perfect'. For all reads that
were soft-clipped at the end, the reads that had
the most common start coordinate were no longer
considered 'perfect'. Basically, all reads that
soft-clipped at the same coordinate (either at
the beginning or the end) were no longer
considered 'perfect'.
Added positional bias here (filterByReadSupport.py)
- We are looping through all of the reads here,
and it makes sense to calculate the positional
bias here.
- The positional bias filter was previously a
separate filtering script. In order to save
on performance and to simplify the filtering
process, it has now been added here.
Moved grouping reads by name to separate method (filterByReadSupport.py)
Created a group_reads_by_name() method and
moved that part of the code out of the
checkfilter() method for better readability.
No functional changes were made.
Moved checking neighborhood base quals to separate method (filterByReadSupport.py)
For performance reasons, the code to check the base quals
of the up- and down-stream neighbors of a mutation was
moved into a separate method.
Major bug-fixes for selecting the right base from the reads (createBlatFile.py)
- According to the SAM documentation, the POS column
(1-based leftmost mapping POSition) corresponds to
the first CIGAR operation that "consumes" a ref
base. The CIGAR operations that can consume a ref
base are: M, D, N, =, and X. The old code relied
on an 'M' for the first match. Now the code properly
accounts for all CIGAR operations that can consume
a ref base.
- The following CIGAR operations do not consume the
ref nor the query:
- 'H' (hard-clipping sequences not in the query)
- 'P' (silent deletion from padded ref)
- The following CIGAR operations consume both the
ref and the query sequence:
- 'M' (alignment match, sequence match or mismatch)
- '=' (sequence match)
- 'X' (sequence mismatch)
- The following CIGAR operations consume the ref
but not the query:
- 'D' (deletion)
- 'N' (a skipped region from the ref)
- The following CIGAR operations consume the query
but not the ref:
- 'I' (insertion)
- 'S' (soft-clipping)
- When looping through the reads, the goal is to
find the cigar operation and index of the query
sequence that aligns with the ref. The number
of ref bases that need to be consumed is the
transcript or genomic coordinate minus the
start of the read. Ref bases are consumed by
cigar ops: M, =, X, D, N. The query index
is increased when a query base is consumed
by an M, =, X, I, S.
- If calls are made on transcript aligned bams
instead of genome aligned bams, then calls from
multiple isoforms are merged into one call. When
creating the blat file, we need to query all
of the isoforms and then remove the duplicate
reads that mapped to more than one isoform.
- For consistent counting, only count the refs that
have a high enough base quality, since that is
how we count the alts.
- The samtools mpileup and pysam pileup functions
have a parameter for the max read depth (8,000 by
default). The samtools view command does not, but
it is a good idea to limit the number of reads that
are selected. The read depth can be very high (in
the 100,000-200,000 range) in regions with many
indels, especially if RNA secondary alignments
are included.
Query and count the reads with mapping quality zero (radia.py)
When initially executing the samtools mpileup
command, set the mapping quality and base
quality params to zero (-Q 0 -q 0). Then we
can count the number of reads that initially
had mapping quality zero. The sample specific
MapQual and BaseQual params are then used to
eliminate reads (by default both are set to 10).
Major performance enhancements (filterByReadSupport.py)
- There are many reasons for a read to be
deemed not 'perfect'. There are some easy
checks and some that take longer. Do all
easy checks first.
- The easy checks are for improper pairs,
insertions and deletions.
- Checking the neighbor base quals only
evaluates the 5 bases up- and down-stream
from the mutation, so do this before
checking the whole read.
- When evaluating each base in the read,
return as soon as the maximum number
of mutations are found.
- When evaluating each cigar operation,
return as soon as the maximum number
of mutations are found and/or when the
soft-clip percentage reaches the maximum
allowed.
- RNA-Editing events occur in clusters, so
do not apply the maximum mutations filter
to these calls.
- Fixed a bug with how the maximum soft-clip
percentage was calculated.
Fixed off-by-one error (filterByPositionalBias.py)
- We have enough reads when the depth is greater
than or equal to the cutoff. The previous code
was only checking for greater than the cutoff.
- Updated comments of example VCF lines by moving
the 'AD' and 'AF' fields to the front of the
FORMAT and sample columns.
- Updated comments of example VCF lines by converting
the 'INDEL' field into two separate fields 'INS' and
'DEL'.
Updated comments of example VCF lines (filterByReadSupport.py)
- Updated comments of example VCF lines by converting
the 'INDEL' field into two separate fields 'INS' and
'DEL'.
- Removed obsolete code.
Rearrange FORMAT field (radia.py)
- For improved readability, move the AD and AF fields
to the front of the FORMAT column.
- Update comments of example VCF lines by moving
the AD and AF fields to the front of the
FORMAT and sample columns.
- To provide more precise information, divide the
INDEL field into two separate INS and DEL fields.
- Update comments of example VCF lines by converting
the INDEL field into two separate fields INS and
DEL.
- Removed obsolete code.
- Moved 'AD' and 'AF' fields to the front
- Updated comments of example VCF lines by converting
the 'INDEL' field into two separate fields 'INS' and
'DEL'.
03/24/2018:
- Added mapping qualities (radia.py)
- Mapping quality scores are not output by samtools mpileup by
default, but we can add the -s option to output them
- Keep track of the number of mapping qualities that are 0 per
allele, and add them to the sample data using the 'MQ0' tag
- Keep track of the max mapping quality per allele, and add it
to the sample data using the 'MMQ' tag
- Keep track of the average mapping quality per allele, and add
it to the sample data using the 'MQA' tag
- Add the overall average mapping quality to the INFO column
- Add the total number of mapping quality zero reads to the
INFO column
- Added 'SST' to the INFO column to specify the somatic status
- Added mapping quality filters (filterRadia.py)
- Added '--rnaMpileupMinMapQual' and '--rnaMpileupMinAvgMapQual'
params to be passed through to the RNA mpileup filter
- When filtering RNA calls via mpileup:
- Set the MaxErrorPct to 100% for the DNA
- Make the 'MinAltAvgBaseQual' parameter sample specific
- Set the 'MinAltMapQual' and 'MinAltAvgMapQual' for the RNA
- When filtering RNA calls via the DNA:
- Set the filter field to 'PASS' so that these calls pass
when merged with the DNA calls. If you don't filter by
BLAT or by positional bias, then the filter column does
not get reset to passing.
- Added mapping quality filters (filterByMpileupSupport.py)
- Added sample specific mapping quality filters:
- Minimum average mapping quality
- Minimum mapping quality for at least 1 read
- Maximum percentage of reads with mapping quality 0
- Made the 'minAltAvgBaseQual' parameter sample specific
- Fixed header output: In general, each filter script adds
its own filters to the header, but this script is typically
run more than once, which sometimes lead to duplicate header
output. Now the filters from the input are parsed and no
longer duplicated.
- The overall QUAL score should be an integer. It was
being converted to a float here which is now fixed.
filterByReadSupport.py
- Updated descriptions for the perfect filters
mergeRnaAndDnaFiles.py
- Added an 'SST' field to the INFO column to easily