version_notes.txt

04/04/19:
- Changed all versions to 1.1.4
- Tagged 1.1.4


04/02/2019;

When merging the RNA and DNA results, always provide as much info as possible.
    
    - The --rnaOnly flag is currently not used.  By default, merge the RNA and
      DNA results in order to provide as much info as possible.
    
    - Removed the --dnaHeaderOnly flag from the mergeRnaAndDnaFiles.py script.
      Always try to merge the DNA results if they exist.


04/01/2019:

Fixed problems when the --dnaHeaderOnly flag is true (mergeRnaAndDnaFiles.py)
    
    A KeyError was occurring when the --rnaOnly flag in filterRadia.py
    was set to true resulting in the --dnaHeaderOnly flag here to
    be set to true.  When merging the rna and dna filters, we need
    to check if only the dna header is loaded from the dnaFile.
    In addition, the non-overlapping RNA calls that don’t pass the
    germline and pseudogene filters were being omitted from the
    final output. They are now added with the 'dnacall' filter
    to show that they are non-passing.


03/26/2019:

Add the default . for each allele on allele specific tags. (myVCF.py)


Removed requirement for a header file. (createBlatFile.py, filterRadia.py)


Fixed pep8 issues. (filterRadia.py)


Fixed pep8 issues. (filterByReadSupport.py)


Added handling of pedigree tags in the metadata. (radia.py, filterByMpileupSupport.py)

    - Fixed how the sample tags are output.


Check the stranded base when grouping reads by name (filterByReadSupport.py)

    When grouping reads by name, be sure to check that the
    stranded base is supporting the ref or alt from the call
    being evaluated. Otherwise, this can be problematic when
    there are multiple isoforms contributing to a call. For
    example, the 'RGPD5' gene has 3 isoforms: one on the positive
    strand and 2 on the negative strand.  In the previous code,
    the reads from all 3 isoforms would be grouped by the read
    name. Then in the find_non_overlapping_reads() function,
    it would be discovered that the reads overlapped and the
    read with the highest base quality would be chosen. If all
    overlapping reads had the same high base quality, then a
    random read (e.g. the first read) would be chosen. Later,
    the stranded base is compared to the genomic ref and alt.
    Since the first overlapping read was randomly chosen, the
    stranded base seemingly came from the positive strand and
    did not match the genomic ref nor alt. In actuality, the
    call came from the negative strand from reads that were
    not randomly chosen. Now the code checks at the beginning
    to make sure that the stranded base matches the genomic
    ref or alt, and only groups reads by name that support
    the call being evaluated.

    When checking to see if a read supports the alt allele,
    make sure it is supporting the alt allele from the call
    being evaluated and not an additional alt allele.


Check the stranded base when grouping reads by name (createBlatFile.py)

    The 'RGPD5' gene has 3 isoforms: one on the positive strand
    and 2 on the negative strand.  In the previous code, the reads
    from all 3 isoforms would be grouped by the read name. Then
    in the find_non_overlapping_reads() function, it would be
    discovered that the reads overlapped and the read with the
    highest base quality would be chosen.  If all overlapping
    reads had the same high base quality, then a random read
    (e.g. the first read) was chosen. This is okay for all genes
    that have all isoforms on the same strand, but for this one
    gene, it resulted in no reads being output to the blatInput.fa
    file. In the write_to_blat_file() function, the stranded base
    is compared to the genomic ref and alts.  Since the first
    overlapping read was randomly chosen, the stranded base
    seemingly came from the positive strand and did not match
    the genomic ref nor alt. In actuality, the call came from
    the negative strand from reads that were not randomly chosen.
    Now the code checks at the beginning to make sure that
    the stranded base matches the genomic ref or alt, and
    only groups reads by name that support the call.


02/25/2019:

Find the stranded RNA base and correct read length (createBlatFile.py)

    - The base on the read is being compared to the forward
      genomic strand REF and ALT columns, so the base needs
      to be converted to the forward genomic strand if necessary.

    - Some of the read lengths were negative and based off of
      the insert size.  Now the actual read length is used.

    - Need to check for empty lines before removing the newline.

    - Create a blat file for all supported mod types.


Check the stranded base when grouping reads by name (createBlatFile.py)

    The 'RGPD5' gene has 3 isoforms: one on the positive strand
    and 2 on the negative strand.  In the previous code, the reads
    from all 3 isoforms would be grouped by the read name. Then
    in the find_non_overlapping_reads() function, it would be
    discovered that the reads overlapped and the read with the
    highest base quality would be chosen.  If all overlapping
    reads had the same high base quality, then a random read
    (e.g. the first read) was chosen. This is okay for all genes
    that have all isoforms on the same strand, but for this one
    gene, it resulted in no reads being output to the blatInput.fa
    file. In the write_to_blat_file() function, the stranded base
    is compared to the genomic ref and alts.  Since the first
    overlapping read was randomly chosen, the stranded base
    seemingly came from the positive strand and did not match
    the genomic ref nor alt. In actuality, the call came from
    the negative strand from reads that were not randomly chosen.
    Now the code checks at the beginning to make sure that
    the stranded base matches the genomic ref or alt, and
    only groups reads by name that support the call.


Properly handle RNA aligned to the transcriptome and other fixes (filterByBlat.py)

    - The base on the read is being compared to the forward
      genomic strand REF and ALT columns, so the base needs
      to be converted to the forward genomic strand if necessary.

    - Some of the read lengths were negative and based off of
      the insert size.  Now the actual read length is used.

    - The start and stop coordinates on the blat hits can be
      reversed, so check for overlaps in both directions.

    - The previous code was keeping a list of 'other' values,
      and then finding the min or max at the end, but it is much
      easier to just keep the top 'overlapping' value and the
      top 'other' value as each blat hit is processed. The old
      way also had a bug if no 'other' value was set for the
      max overlapping identity.

    - The previous code loaded all of the blat hits into memory
      at the beginning and then looped over the VCF lines to
      process each call.  In order to avoid using a ton of memory
      by loading all the blat hits at once, now the VCF line and
      the corresponding blat hits are loaded into memory one call
      at a time, using python generators.

    - The previous code required the blat alignment length to be
      at least half the read length to set the maxOverlapIdentity,
      but it did not make the same requirement to set the
      maxOverlapEValue. Now both require the blat alignment length
      to be at least half the read length.

    - The transcript name, coordinate, and strand are needed
      for RNA that is aligned to the transcriptome.

    - The RNA secondary alignments flag is needed when the RNA
      is aligned to the transcriptome.

    - Skip the PSL header lines.

    - Changed some parameter names to be consistent with
      other code.

    - Added an order of magnitude flag.

    - Added ability to filter DNA somatic calls by blat.

    - Need to check for empty lines before removing the newline.


Added some parameters for the filterByBlat.py command (filterRadia.py)

    - The filterByBlat.py command needs the transcript name,
      coordinate, and strand parameters for RNA that is
      aligned to the transcriptome.

    - The filterByBlat.py command needs the RNA secondary
      alignments flag for RNA that is aligned to the
      transcriptome.

    - Include the header lines when extracting the passing
      calls for filtering by blat.

    - The latest version of snpEff has a new format.  In
      the future, use the new format, but for now stick
      with the old one.


Use the length of the query sequence for the read length (filterByReadSupport.py)

    In order to be consistent across the code, use the
    length of the query sequence as the read length.


02/01/2019:

Avoid repeated filters in FILTER column (filterByReadSupport.py, myVCF.py)

    Since we are filtering by DNA and by RNA, it is
    possible to have a call fail for the same reason,
    but we only want to report the filter once. The
    filters should be a ‘set’ instead of a ‘list’ to
    avoid the repeated filters.


Removed the max read depth parameter (createBlatFile.py)

    The max read depth parameter was originally added to limit
    the number of reads that were processed with the samtools
    view command. The code has switched from the old samtools
    view command to pysam, and pysam has a default max depth
    of 8,000 reads. In order to be consistent with the code
    in filterByReadSupport.py, remove the max depth parameter
    that is specified here.


When calculating the score, check to see if the data exists (fitlerByReadSupport.py)

    For some calls, there might not be data for all
    samples (e.g. germline). Also, if the —outputAllData
    flag is used, then all data is output even if no
    variant exists.


Added ability to run installed commands (filterRadia.py)

    In some environments, the RADIA package may be installed
    (e.g. Continuous Integration), and the commands should be run
    as ’command.py’ instead of ‘python /scriptsDir/command.py’.
    In order to maintain backwards compatibility, an
    -—ignoreScriptsDir flag has been added to run the commands
    as ‘command.py’.  Since the API currently requires the
    script directory, the user should provide a dummy directory.
    In a future version, the API will change such that no
    dummy directory is necessary.


01/23/2019:

Only ignore nearby A>G mismatches on perfect reads for RNA editing (filterByReadSupport.py)

    A>G RNA editing events occur in clusters. Theoretically, somatic
    mutations could occur in regions where A>G editing clusters exist,
    but practically speaking, ignoring A>G mismatches on perfect reads
    for somatic mutations leads to an increase in false positives. It
    is better to count the A>G mismatches as true mismatches for
    somatic mutations and only ignore the A>G mismatches for RNA
    editing events. In the future, it would be good to exclude high
    confidence A>G RNA editing events for both somatic mutations and
    RNA editing events, when counting mismatches on perfect reads.


Create an empty list if the "MMP" INFO tag is missing (filterByReadSupport.py)

    This code calculates the Multi Mapping Percent (MMP) for all
    calls that it analyses and stores it in the INFO.  If the
    "MMP" INFO tag does not exist, then just create an empty list
    to be filled with the calculations.


01/17/2019:

Add metavar names for the transcript tags (filterRadia.py)


Fixed pep8 issues (filterByPybed.py, pybed.py)
    Included renaming some functions


Count all mismatches on perfect reads except A>G editing (filterByReadSupport.py)

    A>G RNA Editing events occur in clusters.  The
    previous code did not count the number of
    mismatches when determining if a read was perfect
    for all types of RNA editing.  Since other types
    of RNA editing are not known to occur in clusters,
    now all non A>G mismatches are counted for all RNA
    editing events.


Adjust the coordinate for the previous and next ref bases (filterByReadSupport.py)

    For the previous and next ref bases stored in the INFO,
    the VCFs are 1-based and pysam requires a 0-based coordinate


Fixed pep8 issues (filterByReadSupport.py)

    Added read1, read2, reverse, and mate_reverse flags
    to the readDict in order to debug the stranded RNA


11/29/2018:

Added previous and next reference bases to INFO (radia.py and filterByReadSupport.py)


Added DP4 field (radia.py)

    - Added DP4 field:  The 'Number' field in the header
      for 'DP4' should be '4' and the VCF lines should 
      only specify the ref-forward, ref-reverse, 
      alt-forward and alt-reverse values of the passing
      call, but we are allowing an unspecified number  
      of values for the 'DP4' field.  The initial RADIA 
      script outputs variants at very low depth and 
      frequency, which results in many calls with multiple 
      alleles.  Before filtering, it is impossible to know 
      which alternative allele will be part of the final 
      passing call when multiple alleles are present.  
      During the filtering process, we know which alternative 
      allele to choose from the passing call and could limit 
      the values at that point, but for all non-passing calls, 
      we would need to choose an arbitrary alternative allele 
      to limit it to just 4 values.  In order to provide the 
      most complete information, all forward and reverse values 
      for all alleles is provided in the specified order.  
      This makes it easy to extract the final alt-forward and 
      alt-reverse values for any alternative allele.

    - Added newline after p-value header line

    - Changed the 'Number' for several INFO
      and FORMAT fields from the generic '.'
      to 'A' for all alternative alleles and 
      'R' for all alleles including the 
      reference.


Added generic read and write functions (radiaUtil.py)

    - Added generic functions to get the read
      and write file handlers.

    - Fixed pep8 issues


Lower the min total depth from 10 to 8 (filterByMpileupSupport.py)

    - Lower the minimum required total depth
      from 10 to 8 for all samples

    - Lower the minimum alternative allele
      frequency for germline calls from
      10% to 3%.


Added parameters for normal and tumor purity (filterByMpileupSupport.py)

    The minimum alternative allele frequency will be
    altered according to the purity parameters provided.
    
    For somatic calls, the minimum variant frequency is 
    decreased in order to detect true somatic calls that 
    have a lower frequency due to the normal-content in 
    the tumor sample (low tumor purity).
    
    For germline calls, the minimum variant frequency
    is increased in order to avoid making false positive
    germline calls due to tumor-content in the normal 
    sample (low normal purity).


Add overlaps with RADAR and DARNED (filterRadia.py)

    - Add number of COSMIC overlaps to INFO

    - Add overlaps with RADAR database

    - Add overlaps with DARNED database

    - Replaced filterByCoordinate.py script with
      filterByPybed.py for dbSNP overlaps.  The
      later script takes care of small INDELs
      that are present in dbSNP.


Add counts or text to INFO tags (filterByPybed.py)

    - Added ability to add a count as the value of an 
      INFO tag

    - Added ability to add a text as the value of an 
      INFO tag


10/17/2018:

Add target regions to INFO instead of FILTER (filterRadia.py)

    Instead of filtering calls that are not in the
    GENCODE gene regions, just add a tag to the INFO.


Renamed script to follow coding conventions (filterByReadSupport.py, myvcf.py)

    Renamed myvcf.py to myVCF.py to adhere to
    coding conventions.


When no ALT exists, use the default '.' (radia.py)

    If the --outputAllData flag is used to output all data
    even when no variant exists, then there might not be
    an ALT allele.  Per the VCF specifications, the column
    should contain the default '.' value.


Added 'MMP' to the sample data (filterByReadSupport.py, radia.py, myvcf.py)

    - Moved parsing of the FORMAT field and sample fields to
      myvcf.py  Parsing is needed in order to add the 'MMP' 
      field to the samples, and if a call passes, parsing is 
      needed for calculating the p-value and somatic score.

    - Redid how the filtering method is called:  pass in the
      entire currData object to eliminate the number of method
      parameters.

    - Only calculate the multi-mapping percent for the specific
      allele in the call instead of all alternative alleles.

    - Moved scoring into the Club class

    - Changed the way the INFO was parsed to be consistent
      with rest of code

    - Changed variable names to be consistent with rest of code

    - Made the reference a single base instead of a list


Added multi-mapping filter (filterByReadSupport.py)
    - If secondary RNA alignments are allowed, there will be
      many false positive calls where nearly 100% of the reads
      supporting the alternative allele are from secondary RNA
      alignments (not primary alignments).  Added a filter now
      for calls where 90% or more of the reads supporting the 
      alternative allele come from secondary mappings.  In other
      words, at least 10% of the reads need to be the primary
      mapping.  This filter is mostly relevant for RNA variants.

    - Renamed variable names to be consistent with rest of code
    
    - Renamed filter names to be consistent
    
    - The get_pvalue() method does not need the aMaxSize param
    
    - If the number of VCF columns for a particular line in a 
      VCF does not equal the number of VCF header columns, 
      then output a warning and skip the line.


Downgrade warning message about multiple passing mod types (mergeRnaAndDnaFiles.py)

    Very rarely, there are calls that have evidence of
    a germline mutation (e.g. A>T) and then an RNA editing
    event for a different alternative allele (e.g. A>G).
    These calls will be output as germline mutations and
    are likely false positives to be removed by other
    filters.


Allow RNA_TUM_VARs to PASS (radiaCompare.py, mergeRnaAndDnaFiles.py, filterRadia.py)
    - In general, RADIA can be used on tumor RNA without 
      any DNA to output RNA variants.  Obviously, further
      analysis is needed to classify the RNA variants as 
      germline, somatic, or editing.  RNA variants will now
      pass.  If an RNA variant has enough total DNA (by 
      default 10 reads), then they will be classified as
      germline, somatic, or editing based on the DNA.  If
      an RNA variant has little or no DNA, it may pass
      as an RNA_TUM_VAR.

    - The 'compareList' parameter for radiaCompare.py only
      needs to compare somatic calls.  Only the *passing* 
      RNA calls are being compared to the *passing* DNA 
      calls.  There will never be a passing RNA Editing
      or RNA variant in the DNA.  Only somatic calls can
      pass by both filtering methods.  The overlaps file
      will contain the somatic calls that passed by both
      filtering methods.  The non-overlaps file will contain
      the RNA Rescue, RNA Editing, and RNA variant calls 
      that do not pass the DNA filters.

    - Fixed propagating grep and awk error messages up
      to the user.  Also, if we pipe the grep output to 
      gzip, the return code and error messages from grep
      get overwritten by gzip, and we no longer detect when 
      an error occurs.

    - Fixed the command that selects the RNA variants for
      further filtering.  Instead of selecting somatic and
      editing, just remove any germline.  This will allow
      the RNA_TUM_VARs to proceed.  Also, changed the awk
      command to be more flexible to allow variable sample
      columns.


Reclassify germline mutations and update filter names (filterByMpileupSupport.py)

    - Some germline variants that were homozygous for the ALT
      base were being classified and filtered as somatic calls.
      Now they are classified as germline and filtered as
      germline.  This code change resulted in a roughly five
      percent increase in passing germline calls.

    - The modification filters ('MF' in the INFO) provide the
      filters for each modification type tested.  Instead of 
      only providing the filters from this script, we want to 
      include filters (e.g. 'blck') from previous scripts 
      as well.  In addition, even if a call passes this script,
      previous filters should be added to the mod filters if 
      they exist.

    - Updated filter names to be consistent with tags in the
      FORMAT field.  Updated some filter descriptions to
      improve consistency.


Parse items in the INFO field that contain a '=' (myvcf.py)

    - Most INFO field items have a Tag=Value(s) format,
      but some annotations such as a silent protein change
      specified as 'ProtCh=p.T394=' can exist.  When
      parsing the INFO field, catch these uncommon cases,
      and parse them properly.

    - Renamed some variables to be more consistent


Performance improvement when loading coordinates from a file (radia.py)

    - If a coordinates file is given, the user can choose to
      either load and loop through each coordinate separately, 
      or include the --loadCoordinatesRange flag to load all 
      the coordinates in the range and then process them each
      separately.  Including the flag can provide a great
      performance improvement, depending on the number of
      bases in the coordinate range.  If the range is large,
      a nominal performance improvement can be seen.  If the 
      range is small, a great performance improvement can be
      seen.


When --outputAllData is True, output all data that exists (radia.py)
 
    - Previously when the --outputAllData flag was set,
      the data was only output when it passed the basic
      filters.  We actually want to output all data,
      if any data exists.  The overall affect from this
      change is not great.  Mostly the calls that were
      previously not output did not have enough ALT bases
      (or no tumor data at all) and the 'mba' filter was 
      being set.

    - The 'MQ0' tag in the INFO did not contain the sum
      across all samples. It only had the count from the
      last sample with data.

    - Commented out very detailed debug messages


Query all coordinates when a coordinates file is given (radia.py)

    If a coordinates file is provided, then we do not
    want to break out of the processing loop when no
    data is present.


Remove processing of 'N' bases (radia.py)

    Very rarely, a read will contain the base 'N'.
    There are usually no more than 2 and mostly just
    1 read at a particular location containing an 'N'.
    This happens mostly in the DNA reads and not very
    often in the RNA reads.


Added INDEL filter and set max read depth to 8,000 (filterByMpileupSupport.py)

    - Added INDEL filtering across all samples as well as 
      sample specific INDEL filtering.  This filter greatly 
      improves performance, because it removes problematic
      calls from further filtering scripts.

    - Set the max total bases parameter for each sample
      to 8,000.  The samtools mpileup and pysam pileup 
      function have a parameter for the max read depth 
      (8,000 by default), but sometimes more reads are
      selected.  The read depth can be very high (in
      the 100,000-200,000 range) in regions with many 
      indels, especially if RNA secondary alignments 
      are included.  To improve performance, filter
      out calls with more than 8,000 reads.  We may
      later add genes of interest to a white list to
      avoid this filter being applied. 

    - Changed the description for the strand bias filters

    - Code enhancement:  group the general parameters for
      a method together in a dictionary to reduce the
      number of parameters in a method call.


Removed unused BLAT input file (filterByBlat.py)

    The filterByBlat.py script actually does
    not need the BLAT input file.  It only uses
    the results from the BLAT command stored
    in the BLAT output file.


Removed old filterByPositionalBias.py code (filterRadia.py and createBlatFile.py)

    The code for filtering by positional bias
    has been moved to the filterByReadSupport.py
    script and is no longer necessary as a 
    separate script.  Eliminating this code
    helps improve performance and simplifies
    the filtering by positional bias process.
    

Use pysam to query bam files instead of samtools view (createBlatFile.py)

    The samtools view command does not have a max
    read depth parameter like the samtools mpileup
    and pysam pileup functions.  We can just select
    the first 8,000 reads, but we are not guaranteed
    that they are the same 8,000 reads selected by
    the other commands.  It is also cumbersome to
    support 3 different ways of querying the same
    data.  In addition, the samtools view processing
    takes much longer than the other functions.
    Therefore, the samtools view command was replaced
    with the pysam pileup function.  Now the code is
    similar to the code in the filterByReadSupport.py
    script and performance has improved.


Removed obsolete code (createBlatFile.py)

    Removed obsolete code that used the samtools view
    command which has been replaced with the pysam
    pileup function.


Commented out detailed debug messages (filterByReadSupport.py)

    - To improve performance, some detailed debug
      messages were commented out

    - Add the reference bases to the debugging

    - Changed variable names for pileup objects
      to match style from rest of code


Propagate warning messages up and output them to logs (filterRadia.py)

    - Propagate warning messages from subcommands up and
      add them to the log files.

    - Changed variable names for stdErr and stdOut


Added p-val, somatic score, and qual score (filterByReadSupport.py)

    - Added a p-value, somatic score, and 
      qual score. Since computing scores is
      computationally expensive, by default 
      a score is only calculated for passing
      calls, but the user can flip the flag 
      to have all calls scored.

    - Changed low_base_and_map_quals() to
      low_base_or_map_quals() to make the
      if statements clearer

    - Changed variable names for pileup objects
      to match style from rest of code


Minor code updates (myvcf.py)

    - Stopped using python reserved words as var names

    - Removed extra_headers that were never used


Process all reads from all transcript isoforms (filterByReadSupport.py)

    - If calls are made on transcript aligned bams 
      instead of genome aligned bams, then calls from
      multiple isoforms are merged into one call.  When
      filtering by read support, we need to query all
      of the isoforms and then remove the duplicate
      reads that mapped to more than one isoform.

    - Simplified the ismut() and checkread() methods.

    - Performance:  eliminated looping over the reads
      twice:  once for the overlapping reads and then 
      once for the perfect reads.


Removed old soft-clipping filter (filterByReadSupport.py)

    - A new soft-clipping filter was added a while
      ago where a read is not 'perfect' if more than
      10% of the read is soft-clipped.

    - The old soft-clip filter kept track of the end 
      coordinate of a soft-clip at the beginning of a
      read and the start coordinate of a soft-clip at
      the end of a read.  For all reads that were soft-
      clipped at the beginning, the reads that had the
      most common soft-clip end coordinate were no
      longer considered 'perfect'.  For all reads that
      were soft-clipped at the end, the reads that had 
      the most common start coordinate were no longer
      considered 'perfect'.  Basically, all reads that
      soft-clipped at the same coordinate (either at 
      the beginning or the end) were no longer 
      considered 'perfect'.


Added positional bias here (filterByReadSupport.py)

    - We are looping through all of the reads here,
      and it makes sense to calculate the positional
      bias here.

    - The positional bias filter was previously a
      separate filtering script.  In order to save 
      on performance and to simplify the filtering
      process, it has now been added here.


Moved grouping reads by name to separate method (filterByReadSupport.py)

    Created a group_reads_by_name() method and
    moved that part of the code out of the 
    checkfilter() method for better readability. 
    No functional changes were made.


Moved checking neighborhood base quals to separate method (filterByReadSupport.py)

    For performance reasons, the code to check the base quals
    of the up- and down-stream neighbors of a mutation was
    moved into a separate method.


Major bug-fixes for selecting the right base from the reads (createBlatFile.py)

    - According to the SAM documentation, the POS column
      (1-based leftmost mapping POSition) corresponds to
      the first CIGAR operation that "consumes" a ref
      base. The CIGAR operations that can consume a ref
      base are: M, D, N, =, and X.  The old code relied
      on an 'M' for the first match. Now the code properly 
      accounts for all CIGAR operations that can consume 
      a ref base.

    - The following CIGAR operations do not consume the
      ref nor the query:
        - 'H' (hard-clipping sequences not in the query)
        - 'P' (silent deletion from padded ref)

    - The following CIGAR operations consume both the
      ref and the query sequence:
        - 'M' (alignment match, sequence match or mismatch)
        - '=' (sequence match)
        - 'X' (sequence mismatch)

    - The following CIGAR operations consume the ref
      but not the query:
        - 'D' (deletion)
        - 'N' (a skipped region from the ref)

    - The following CIGAR operations consume the query
      but not the ref:
        - 'I' (insertion)
        - 'S' (soft-clipping)

    - When looping through the reads, the goal is to 
      find the cigar operation and index of the query 
      sequence that aligns with the ref.  The number
      of ref bases that need to be consumed is the
      transcript or genomic coordinate minus the 
      start of the read.  Ref bases are consumed by
      cigar ops:  M, =, X, D, N.  The query index 
      is increased when a query base is consumed
      by an M, =, X, I, S.

    - If calls are made on transcript aligned bams 
      instead of genome aligned bams, then calls from
      multiple isoforms are merged into one call.  When
      creating the blat file, we need to query all
      of the isoforms and then remove the duplicate
      reads that mapped to more than one isoform.

    - For consistent counting, only count the refs that
      have a high enough base quality, since that is
      how we count the alts.

    - The samtools mpileup and pysam pileup functions
      have a parameter for the max read depth (8,000 by
      default).  The samtools view command does not, but 
      it is a good idea to limit the number of reads that
      are selected.  The read depth can be very high (in
      the 100,000-200,000 range) in regions with many 
      indels, especially if RNA secondary alignments 
      are included.


Query and count the reads with mapping quality zero (radia.py)

    When initially executing the samtools mpileup 
    command, set the mapping quality and base
    quality params to zero (-Q 0 -q 0).  Then we
    can count the number of reads that initially 
    had mapping quality zero.  The sample specific
    MapQual and BaseQual params are then used to
    eliminate reads (by default both are set to 10).


Major performance enhancements (filterByReadSupport.py)

    - There are many reasons for a read to be
      deemed not 'perfect'.  There are some easy
      checks and some that take longer.  Do all
      easy checks first.

    - The easy checks are for improper pairs,
      insertions and deletions.

    - Checking the neighbor base quals only
      evaluates the 5 bases up- and down-stream
      from the mutation, so do this before
      checking the whole read.

    - When evaluating each base in the read, 
      return as soon as the maximum number
      of mutations are found.

    - When evaluating each cigar operation,
      return as soon as the maximum number 
      of mutations are found and/or when the
      soft-clip percentage reaches the maximum
      allowed.

    - RNA-Editing events occur in clusters, so
      do not apply the maximum mutations filter
      to these calls.

    - Fixed a bug with how the maximum soft-clip
      percentage was calculated.


Fixed off-by-one error (filterByPositionalBias.py)

    - We have enough reads when the depth is greater
      than or equal to the cutoff. The previous code
      was only checking for greater than the cutoff.

    - Updated comments of example VCF lines by moving
      the 'AD' and 'AF' fields to the front of the 
      FORMAT and sample columns.

    - Updated comments of example VCF lines by converting
      the 'INDEL' field into two separate fields 'INS' and 
      'DEL'.


Updated comments of example VCF lines (filterByReadSupport.py)

    - Updated comments of example VCF lines by converting
      the 'INDEL' field into two separate fields 'INS' and 
      'DEL'.

    - Removed obsolete code.


Rearrange FORMAT field (radia.py)

    - For improved readability, move the AD and AF fields
      to the front of the FORMAT column.

    - Update comments of example VCF lines by moving
      the AD and AF fields to the front of the 
      FORMAT and sample columns.

    - To provide more precise information, divide the 
      INDEL field into two separate INS and DEL fields.

    - Update comments of example VCF lines by converting
      the INDEL field into two separate fields INS and 
      DEL.

    - Removed obsolete code.

    - Moved 'AD' and 'AF' fields to the front

    - Updated comments of example VCF lines by converting
      the 'INDEL' field into two separate fields 'INS' and 
      'DEL'.


03/24/2018:

- Added mapping qualities (radia.py)
    
    - Mapping quality scores are not output by samtools mpileup by
      default, but we can add the -s option to output them

    - Keep track of the number of mapping qualities that are 0 per
      allele, and add them to the sample data using the 'MQ0' tag
    
    - Keep track of the max mapping quality per allele, and add it
      to the sample data using the 'MMQ' tag

    - Keep track of the average mapping quality per allele, and add
      it to the sample data using the 'MQA' tag

    - Add the overall average mapping quality to the INFO column

    - Add the total number of mapping quality zero reads to the 
      INFO column

    - Added 'SST' to the INFO column to specify the somatic status

- Added mapping quality filters (filterRadia.py)

     - Added '--rnaMpileupMinMapQual' and '--rnaMpileupMinAvgMapQual'
       params to be passed through to the RNA mpileup filter

     - When filtering RNA calls via mpileup: 
        - Set the MaxErrorPct to 100% for the DNA 
        - Make the 'MinAltAvgBaseQual' parameter sample specific
        - Set the 'MinAltMapQual' and 'MinAltAvgMapQual' for the RNA 

     - When filtering RNA calls via the DNA:
        - Set the filter field to 'PASS' so that these calls pass
          when merged with the DNA calls.  If you don't filter by
          BLAT or by positional bias, then the filter column does
          not get reset to passing.

- Added mapping quality filters (filterByMpileupSupport.py)

    - Added sample specific mapping quality filters:
	- Minimum average mapping quality
	- Minimum mapping quality for at least 1 read
	- Maximum percentage of reads with mapping quality 0

    - Made the 'minAltAvgBaseQual' parameter sample specific

    - Fixed header output:  In general, each filter script adds 
      its own filters to the header, but this script is typically 
      run more than once, which sometimes lead to duplicate header 
      output. Now the filters from the input are parsed and no
      longer duplicated.

    - The overall QUAL score should be an integer.  It was 
      being converted to a float here which is now fixed.

filterByReadSupport.py
    - Updated descriptions for the perfect filters


mergeRnaAndDnaFiles.py
    - Added an 'SST' field to the INFO column to easily
      distinguish the type of call being made.

    - The RNA and DNA mpileup filters are now merged
      correctly.


01/24/2018:
- Deleted outdated code that re-queried base quality scores 

    Around 2 years ago, some RADIA results were generated
    without using the BAQ (Base Alignment Quality) scores from
    samtools. Then the switch was made to use BAQ which is
    recommended by default, and the base quality scores for 
    the old results needed to be updated. Instead of re-running
    the whole pipeline, it was faster to just re-filter and
    adjust the base quality scores only for the passing somatic
    calls. This code hasn't been used in years and is no longer
    needed.

- Change the default genotype from '.' to './.' for diploid calls

    If a sample has no data, then the default genotype should be '.' for 
    haploid calls (e.g. chrom Y) and './.' for diploid calls.

- Make BQ an Integer

    The BQ (Average Base Quality) field should be an integer instead of
    a float. This is the average of the BAQ Phred-scaled base quality
    scores returned from samtools.


01/23/2018:
- Add a trailing forward slash to the input directory if necessary
    
    If the user doesn't include a trailing forward slash on the input
    directory, then glob.glob won't be able to find any files to merge.

- Update the 'SS' field to be compliant with TCGA VCF version 1.1
    
    The TCGA VCF specification for the Somatic Status or 'SS'
    field changed from version 1.0 to version 1.1.
    
    TCGA VCF version 1.0:
    Variant status relative to non-adjacent Normal,
    0=wildtype,1=germline,2=somatic,3=LOH,4=unknown
    
    TCGA VCF version 1.1:
    Variant status relative to non-adjacent Normal,
    0=wildtype,1=germline,2=somatic,3=LOH,4=post-transcriptional
    modification,5=unknown
    
    Since RADIA was focused on the RNA from the start, it added a
    variant status for RNA-Editing as follows:
    Variant status relative to non-adjacent Normal,
    0=wildtype,1=germline,2=somatic,3=LOH,4=unknown,5=rnaEditing
    
    This fix updates the 'SS' field to be compliant with the TCGA
    VCF version 1.1 and handles all of the downstream filtering
    that may have used the 'SS' field.

- Allow snpEff errors to be reported when filtering
    
    By default, snpEff does not gzip the output. If we pipe the snpEff
    output to gzip, the return code and error messages from snpEff get
    overwritten by gzip, and we no longer detect when there's a problem
    with snpEff. Instead of piping the snpEff output to gzip, just keep
    the default output from snpEff in unzipped form. These temporary
    files are deleted in the end anyway. It's more important to know
    when snpEff fails.
    
    Note: with snpEff version 4.3, they do now accept input that has
    been gzipped. They also have a -o option for the output, but I'm
    not sure when that option was introduced, so I'm leaving the
    command as is where it writes to stdout and is piped to the
    output file.


01/22/2018:

- Updated to the latest hg19 SnpEff database
- Updated hg19 pseudo genes to latest Encode Gencode version (v27)
- Updated hg38 pseudo genes to latest Encode Gencode version (v27)
- Updated hg38 to the latest GENCODE version (v27)
- Updated hg19 to the latest GENCODE version (v27)
- Updated to the latest 1000 Genomes phase3 accessibiltiy mask files for hg19
- Updated to dbSNP 150 Common SNPs
- Updated from dbSNP 147 to 150  
     Switching from the entire dbSNP database to just the common SNPs.
     Common SNPs are mapped to just one location in the genome and
     have a minor allele frequency of at least 1%. This will allow
     us to find more RNA Rescue variants, especially those that may
     have been mis-classified as Germline variants in dbSNP. It will
     also allow in more false positives that need to be filtered by
     subsequent filters in the pypeline.

- Removed obsolete LOH code    
    - Calculating LOHs should occur in a script further down the pypeline


01/15/18:

- Updated copyright for 2018

- filterRadia.py
    - In order to merge the RNA and DNA filters, we need to pass 
      through the filtered RNA calls from the 
      filterByMpileupSupport.py command by adding
      anRnaFilename to the mergeRnaAndDnaFiles.py command

    - The user can specify a prefix for the temporary files, 
      this is useful when filtering on parts of chromosomes
    
    - The user can specify an outputFilename, otherwise a file 
      will be automatically created in the outputDir. The file 
      format will be (prefix + '_chr' + chrom + '.vcf') if a 
      prefix is specified or (patientId + '_chr' + chrom + '.vcf')
    
    - Added the --readSupportMinMapQual parameter to pass through 
      to the filterByReadSupport.py command
    
    - Added the transcript name, coordinate, and strand tags to 
      the createBlatFile.py and filterByReadSupport.py commands 
    
    - Added --rnaIncludeSecondaryAlignments to the createBlatFile.py 
      and filterByReadSupport.py commands
    
    - Added a --keepPreviousFilters option to the 
      filterByPositionalBias.py command

- mergeRnaAndDnaFiles.py
    - Add the RNA file with the mpileup filters and output the 
      superset of all RNA and DNA filters for a call that doesn't 
      pass
    
    - Merge the modification filters and filter types in the INFO

- filterByBlat.py
    - Update the modification filters and filter types with the 
      blat filter as well

- filterByPositionalBias.py
    - Added ability to keep the previous filters.  This is needed 
      in the pipeline in order to output the superset of all filters 
      applied to this call.
    
    - Update the modification filters and filter types with the 
      pbias filter as well

- createBlatFile.py:
    - Deal with overlapping read-pairs and avoid double counting
    
    - Added ability to include RNA secondary alignments
    
    - Added the transcript name, coordinate, and strand when 
      processing the transcriptome
    
    - Fix for issue #4: hard-clipping does not add to the cigarList, 
      so skipsBeforeFirstMatch should not be incremented

- filterByReadSupport.py
    - Deal with overlapping read-pairs and avoid double counting
    
    - Added ability to include RNA secondary alignments
    
    - Added the transcript name, coordinate, and strand when 
      processing the transcriptome
    
    - Return all filters that apply to a call, instead of just the 
      first filter
    
    - Added logging to the script
    
    - Fixed a bug that affects how the germline SNPs were counted: 
	
	- A read can only have a certain number of mismatches across 
	  its entirety, but we don't want to count legitimate 
	  mismatches from germline SNPs.  
	
	- Keep track of the transcript germline SNPs separately.
    
    - Made several parameters user-defined:  
    	- minBaseQual
	- minMapQual
	- numNeighborBases
	- minNeighborBaseQual
	- maxMutsPerRead
	- maxReadSoftClipPct
	- maxStrandBias
	- minStrandBiasDepth
	- minPerfectReads
	- minPerfectReadsPct
    
    - Added a perfect read percent filter
    
    - Added a maximum percentage of bases that can be soft-clipped 
      for a read to be considered perfect, default = 10%
    
    - Added a positional bias on perfect reads after a certain 
      depth is reached
    
    - Fixed how the header is output so that INFO and FILTER 
      are grouped together

- filterByMpileupSupport.py

    - Merge the new INFO and FILTER headers with the existing ones
    
    - Output the superset of all filters for all modification types 
      in the FILTER column
    
    - In order to see why each mod type failed, add the mod types and
      mod filters for each mod type to the INFO column

    - Always allow one error read before applying the MaxErrPct filter
    
    - Added ability to filter based on the RNA_NOR_VAR and RNA_TUM_VAR 
      types
    
    - Insert dummy filters for things that don't make sense:
	
	- Now that we are outputting the superset of filters, 
	  it will be easier to see why a call failed
	
	- We need the dummy filters so that the calls fail 
	  and are not processed any further in the pipeline.
	
	- If you have a NOR_EDIT or RNA_NOR_VAR mod type but 
	  the --filterUsingRNA flag is not set, then the filter 
	  was set to 'rnmntb'. This filter is misleading and 
	  has been changed to 'rnacall'.
	
	- If you have a TUM_EDIT or RNA_TUM_VAR mod type but 
	  the --filterUsingRNA flag is not set, then the filter 
	  was set to 'rtmntb'. This filter is misleading and 
	  has been changed to 'rnacall'. 
	
	- If you have a GERM call and the --filterUsingRNA flag 
	  is set, but no normal RNA exists, then the filter was 
	  set to 'rnmntb'. This filter is misleading and has 
	  been changed to 'dnacall'.

	- If a call has a dummy filter, then it should also have the
          other filters to show why it failed.  For example, a call
          with the 'rnacall' filter should also show why it failed
          in the DNA.
    
    - In pre_filter_mod_types:
	
	- If the radia.py script is run with lower variant
	  allele depth and percent requirements than the 
	  filterByMpileupSupport.py script (default), 
	  then you can have situations where calls were 
	  mis-labeled and should be considered as a 
	  different modification type.
	
	- If RADIA is first run on the RNA alone (e.g. if 
	  the RNA is aligned to the transcript first for 
	  variant calling), and then filtered on the DNA 
	  plus RNA, then all RNA calls will be labeled as 
	  RNA_NOR_VAR and RNA_TUM_VAR, and we need 
	  to re-classify them using the DNA and RNA).
	    
	      - Check to see if an RNA_NOR_VAR could be a 
	        GERM or NOR_EDIT
	      
	      - Check to see if an RNA_TUM_VAR could be a 
	        SOM or TUM_EDIT
	      
	      - Check to see if a NOR_EDIT could be a GERM (a lot 
	        of support in the normal RNA and a little support 
		in the normal DNA)
	      
	      - Check to see if a NOR_EDIT could be a TUM_EDIT (not 
	        enough support for a GERM or NOR_EDIT, but enough 
		support in the tumor RNA)

-radia.py
    - Completely changed how the selection and processing of 
      coordinates is done:
        
	- If only a chromosome is specified, then the 
	  selection is done on the whole chromosome
	
	- If a chromosome, start and stop are specified, 
	  then the selection is done only on the region 
	  on the chromosome for the start and stop
	
	- If a coordinatesFilename is specified, then selection 
	  is done on each (chrom, start, stop) specified
    
    - Added ability to input a coordinates file that specifies 
      single coordinates or coordinate ranges to query 
      (3 column tab delimited (chr, start, stop)).  
    
    - The -c or --chromSizesFilename has been deprecated.  
    	
	- The user has to specify at least one fasta file with the 
	  fasta index file that contains the lengths of the 
	  chromosome sizes, so this parameter is not needed.  
    	
	- Instead, the -c or --coordinatesFilename are now used to 
	  specify single coordinates or coordinate ranges to query 
	  (3 column tab delimited (chr, start, stop)).
    
    - Fixed problem with user specifying .mpileup files as input. 
      The .mpileup files were ignored and instead the .bam files 
      were queried.
    
    - If we find a variant, make sure the possible parent allele 
      count is above the previous aMinAltNumBases
    
    - To improve performance, added new processing of raw reads 
      that significantly reduces runtime
    
    - Added a break for the case when no more data is available
    
    - Added the modification filters 'MF' and modification filter 
      types 'MFT' to the header
    
    - Added an option to suppress the outputting of the header
    
    - Added an option to output all data at a position, no matter 
      if a variant is present or not
    
    - Added some flags for outputting VCF lines when debugging
    
    - Added ability to query one single position.
    
    - Added ability to include RNA secondary alignments.
    
    - Added analyte to the SAMPLE tag in the header.
    
    - The AD format field should be an Integer instead of a Float.


01/11/18:

- mergeChroms.py
    - Only output header information if it exists


04/12/17:
- Delete the temporary files when filtering and not debugging
  The debug flag was set for both log==DEBUG and log==INFO.  
  This was useful when debugging to allow another layer of 
  granularity where some info messages were only output while 
  debugging.  Ultimately, this becomes convoluted, so all 
  'info while debugging' messages got changed to regular 
  debug messages, and the overall debug flag is set when 
  the user specifies log==DEBUG.

- Changes to radia.py
    - Fixed problem with user specifying .mpileup files as input
        - The .mpileup files were ignored and instead the 
	  .bam files were queried. Now a user can specify 
	  an .mpileup file, and it will be used instead of the .bam file
	
	- Added a break for the case when no more data is available
	
	- Added some flags for outputting VCF lines when debugging


04/06/17:
- Changes to radia.py
	- Added ability to query one single position:
	
	- Added ability to include secondary alignments
	
	- Added analyte to the SAMPLE tag in the header
	
	- The AD format field should be an Integer instead of a Float

- Changes to filterByMpileup.py
	- Added ability to filter based on the RNA_NOR_VAR and 
	  RNA_TUM_VAR types


04/04/17:
- createBlatFile.py

    - Deal with the samtools subprocess command in a better way by 
      using the communicate() method. The error messages weren't 
      getting raised due to the debug flag.  Now they are returned 
      to the filterRadia.py calling script.  Also, there was a 
      problem using the samtoolsCall.kill() method.  Not 100% sure 
      why, but it probably had to do with the environment changes 
      on the server.

- radia.py 
    
    - Changed the entries with no data from '.' to 
      '.:.:.:.:.:.:.:.:.'. This is TCGA compliant.

    - Added an --outputFilename option to the 
      filterByReadSupport.py script and changed 
      the command in the filterRadia.py script to 
      use it. Now error messages will be returned 
      to the filterRadia.py script.

    - The filterRadia.py script should call the .py 
      files instead of the .pyc files.


10/16/15:
- samtools now allows for all IUPAC nucleotide codes for INDELs.  We 
  need to account for that in the convert_raw_reads(...) method.  
  Users reported the following error:
      ERROR   Traceback: convert_raw_reads() Error at coordinate 
      hs37d5:31525191. The length 32 of the final pileup of reads 
      is != the length 9 of the final quality scores.  Original 
      Pileup=TTTT-2MWT-2MWT-2MWt-2mwT-2MWT-2MWT-2MWt-2mwTTTTTtTTTTT
      tttttTTTTt, Final Pileup=TTTTTTtTTTtTTTTTtTTTTTtttttTTTTt, 
      Original QualScores=888;;;;;;CB;BBCA;ABAABAAA@@@<>A@, Final 
      QualScores=888;;BBA<


10/15/15:
- Starting with pysam version 0.8.3, the query_position attribute 
  for a PileupRead is None if is_del or is_refskip are true.  This 
  lead to the following error:
  
  Traceback (most recent call last):
    File "filterByReadSupport.py", line 524, in <module>
        rnaFilter += club.checkfilter(curr_data.chrom, curr_data.pos-1, 
	curr_data.alt, curr_data.info["SS"], modType, "RNA")
    File "filterByReadSupport.py", line 386, in checkfilter
	if ismut(pileupread, chrom, pos, fasta, alt) and 
	checkread(pileupread):
    File "filterByReadSupport.py", line 217, in ismut
        if pileupread.alignment.seq[pileupread.query_position] == 
	fastafile.fetch(chrom, pos, pos+1):
    TypeError: string indices must be integers, not NoneType


06/10/15:
- Changed all versions to 1.1.3

- Added source code to public repo

- Tagged 1.1.3


05/12/15:
- Tagged 1.1.2

- Changed all versions to 1.1.2

- Require pysam version 0.8.1 or higher to deal with API 
  inconsistencies

- filterByReadSupport.py uses query_position instead of 
  qpos to deal with pysam API inconsistencies

- Added ability to specify pileups as input instead of bams 


03/15/15:
- Committed version 1.1.1

- Require pysam version 0.7.5 to stick with code using qpos


02/23/15:
- Changed RADIA VCF header version to 1.1.1 

- Added mageTab.py (tagged as v1.1.1)


02/06/15:
- Tagged v1.1.1

- Changed BLAT filter to use both the e-value and the identity 
  scores.  A read is valid if it has the highest of either of 
  the scores.

- Changed mpileup filter to re-classify germline mutations without 
  a high enough alternative allele percentage as possibly somatic.


12/05/14:
- Tagged v1.1.0

- Removed LOH code from radia.py

- Added script to merge all chroms

- Removed start and stop coordinates from header

- Fixed the summary numbers from radia mod types

- Fixed radia.py so that the version string will be printed when 
  --version is specified

- Added the source prefix (e.g. "DNA_NORMAL") to the debug messages 
  so you can see which file is being referenced

- Changed error message in radia.py that -a and -z cannot be equal, 
  -z must be larger


11/18/14:
- Removed dbSNP 130, 132


11/03/14:
- Tagged v1.0.0

- Added ability to include the number of filters that a variant 
  overlaps with (e.g. number of self-chains)

- Added BLAT bug fix

- Added RNA editing filtering