Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make --indel-bias more sensitive to indels #1648

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 11 additions & 8 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ Changes affecting specific commands:
--rf, --incl-flags STR|INT Required flags: skip reads with mask bits unset
--ff, --excl-flags STR|INT Filter flags: skip reads with mask bits set

- New option --no-indelQ-tweaks to increase sensitivity for indels, especially
in long reads.

* bcftools query

- Make the `--samples` and `--samples-file` options work also in the `--list-samples`
Expand All @@ -78,7 +81,7 @@ Changes affecting specific commands:
Changes affecting the whole of bcftools, or multiple commands:

* New `--regions-overlap` and `--targets-overlap` options which address
a long-standing design problem with subsetting VCF files by region.
a long-standing design problem with subsetting VCF files by region.
BCFtools recognize two sets of options, one for streaming (`-t/-T`) and
one for index-gumping (`-r/-R`). They behave differently, the first
includes only records with POS coordinate within the regions, the other
Expand Down Expand Up @@ -106,11 +109,11 @@ Changes affecting specific commands:
by using `-c INFO/END`.

- add a new '.' modifier to control wheter missing values should be carried
over from a tab-delimited file or not. For example:
over from a tab-delimited file or not. For example:

-c TAG .. adds TAG if the source value is not missing. If TAG
exists in the target file, it will be overwritten

-c .TAG .. adds TAG even if the source value is missing. This
can overwrite non-missing values with a missing value
and can create empty VCF fields (`TAG=.`)
Expand Down Expand Up @@ -239,7 +242,7 @@ Changes affecting specific commands:
* bcftools +fill-tags:

- Generalization and better support for custom functions that allow
adding new INFO tags based on arbitrary `-i, --include` type of
adding new INFO tags based on arbitrary `-i, --include` type of
expressions. For example, to calculate a missing INFO/DP annotation
from FORMAT/AD, it is possible to use:

Expand Down Expand Up @@ -303,7 +306,7 @@ Changes affecting specific commands:

- Atomization of AD and QS tags now correctly updates occurrences of duplicate
alleles within different haplotypes

- Fix a bug in atomization of Number=A,R tags

* bcftools reheader:
Expand All @@ -315,7 +318,7 @@ Changes affecting specific commands:
- A wider range of genotypes can be set by the plugin by allowing
specifying custom genotypes. For example, to force a heterozygous
genotype it is now possible to use expressions like:

c:'m|M'
c:0/1
c:0
Expand All @@ -327,7 +330,7 @@ Changes affecting specific commands:
- Better handling of ambiguous keys such as INFO/AF and CSQ/AD. The
`-p, --annot-prefix` option is now applied before doing anything else
which allows its use with `-f, --format` and `-c, --columns` options.

- Some consequence field names may not constitute a valid tag name, such
as "pos(1-based)". Newly field names are trimmed to exclude brackets.

Expand Down Expand Up @@ -457,7 +460,7 @@ Changes affecting specific commands:

* bcftools csq:

- Fix a bug wich caused incorrect FORMAT/BCSQ formatting at sites with too
- Fix a bug wich caused incorrect FORMAT/BCSQ formatting at sites with too
many per-sample consequences

- Fix a bug which incorrectly handled the --ncsq parameter and could clash
Expand Down
6 changes: 4 additions & 2 deletions bam2bcf.h
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
/* bam2bcf.h -- variant calling.
mplp.indel_bias = 1.01;

Copyright (C) 2010-2012 Broad Institute.
Copyright (C) 2012-2021 Genome Research Ltd.
Copyright (C) 2012-2022 Genome Research Ltd.

Author: Heng Li <[email protected]>

Expand Down Expand Up @@ -99,7 +100,8 @@ typedef struct __bcf_callaux_t {
uint16_t *bases; // 5bit: unused, 6:quality, 1:is_rev, 4:2-bit base or indel allele (index to bcf_callaux_t.indel_types)
errmod_t *e;
void *rghash;
float indel_bias; // adjusts indel score threshold; lower => call more.
float indel_bias_inverted; // adjusts indel score threshold, 1/--indel-bias, so lower => call more.
int no_indelQ_tweaks;
} bcf_callaux_t;

// per-sample values
Expand Down
16 changes: 11 additions & 5 deletions bam2bcf_indel.c
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
/* bam2bcf_indel.c -- indel caller.

Copyright (C) 2010, 2011 Broad Institute.
Copyright (C) 2012-2014,2016-2017, 2021 Genome Research Ltd.
Copyright (C) 2012-2014,2016-2017,2021-2022 Genome Research Ltd.

Author: Heng Li <[email protected]>

Expand Down Expand Up @@ -540,7 +540,7 @@ static int bcf_cgp_align_score(bam_pileup1_t *p, bcf_callaux_t *bca,
}

// used for adjusting indelQ below
l = (int)(100. * sc / (qend - qbeg) + .499) * bca->indel_bias;
l = (int)((100. * sc / (qend - qbeg) + .499) * bca->indel_bias_inverted);
*score = sc<<8 | MIN(255, l);

rep_ele *reps, *elt, *tmp;
Expand Down Expand Up @@ -623,8 +623,14 @@ static int bcf_cgp_compute_indelQ(int n, int *n_plp, bam_pileup1_t **plp,
seqQ = est_seqQ(bca, types[sc[0]&0x3f], l_run);
}
tmp = sc[0]>>6 & 0xff;

// Don't know how this indelQ reduction threshold of 111 was derived,
// but it does not function well for longer reads that span multiple
// events.
//
// reduce indelQ
indelQ = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ + .499);
if ( !bca->no_indelQ_tweaks )
indelQ = tmp > 111? 0 : (int)((1. - tmp/111.) * indelQ + .499);

// Doesn't really help accuracy, but permits -h to take
// affect still.
Expand All @@ -633,7 +639,7 @@ static int bcf_cgp_compute_indelQ(int n, int *n_plp, bam_pileup1_t **plp,
if (seqQ > 255) seqQ = 255;
p->aux = (sc[0]&0x3f)<<16 | seqQ<<8 | indelQ; // use 22 bits in total
sumq[sc[0]&0x3f] += indelQ < seqQ? indelQ : seqQ;
// fprintf(stderr, "pos=%d read=%d:%d name=%s call=%d indelQ=%d seqQ=%d\n", pos, s, i, bam1_qname(p->b), types[sc[0]&0x3f], indelQ, seqQ);
// fprintf(stderr, " read=%d:%d name=%s call=%d indelQ=%d seqQ=%d\n", s, i, bam_get_qname(p->b), types[sc[0]&0x3f], indelQ, seqQ);
}
}
// determine bca->indel_types[] and bca->inscns
Expand Down Expand Up @@ -922,7 +928,7 @@ int bcf_call_gap_prep(int n, int *n_plp, bam_pileup1_t **plp, int pos,
fprintf(stderr, "pos=%d type=%d read=%d:%d name=%s "
"qbeg=%d tbeg=%d score=%d\n",
pos, types[t], s, i, bam_get_qname(p->b),
qbeg, tbeg, sc);
qbeg, tbeg, score[K*n_types + t]);
#endif
}
}
Expand Down
51 changes: 29 additions & 22 deletions doc/bcftools.txt
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ specific commands to see if they apply.
*--regions-overlap* '0'|'1'|'2'::
This option controls how overlapping records are determined:
set to *0* if the VCF record has to have POS inside a region
(this corresponds to the default behavior of *-t/-T*);
(this corresponds to the default behavior of *-t/-T*);
set to *1* if also overlapping records with POS outside a region
should be included (this is the default behavior of *-r/-R*); or set
to *2* to include only true overlapping variation (compare
Expand Down Expand Up @@ -278,7 +278,7 @@ The program ignores the first column and the last indicates sex (1=male, 2=femal

*-T, --targets-file* \[^]'FILE'::
Same *-t, --targets*, but reads regions from a file. Note that *-T*
cannot be used in combination with *-t*.
cannot be used in combination with *-t*.
+
With the *call -C* 'alleles' command, third column of the targets file must
be comma-separated list of alleles, starting with the reference allele.
Expand Down Expand Up @@ -478,7 +478,7 @@ Add or remove annotations.
*--single-overlaps*::
use this option to keep memory requirements low with very large annotation
files. Note, however, that this comes at a cost, only single overlapping intervals
are considered in this mode. This was the default mode until the commit
are considered in this mode. This was the default mode until the commit
af6f0c9 (Feb 24 2019).

*--threads* 'INT'::
Expand Down Expand Up @@ -633,7 +633,7 @@ demand. The original calling model can be invoked with the *-c* option.
text file with sample names in the first column and group names in the second column. If '-' is
given instead, no HWE assumption is made at all and single-sample calling is performed. (Note that
in low coverage data this inflates the rate of false positives.) The *-G* option requires the presence of
per-sample FORMAT/QS or FORMAT/AD tag generated with *bcftools mpileup -a QS* (or *-a AD*).
per-sample FORMAT/QS or FORMAT/AD tag generated with *bcftools mpileup -a QS* (or *-a AD*).

*-g, --gvcf* 'INT'::
output also gVCF blocks of homozygous REF calls. The parameter 'INT' is the
Expand Down Expand Up @@ -892,7 +892,7 @@ depth information, such as INFO/AD or FORMAT/AD. For that, consider using the

*-H, --haplotype* '1'|'2'|'R'|'A'|'I'|'LR'|'LA'|'SR'|'SA'|'1pIu'|'2pIu'::
choose which allele from the FORMAT/GT field to use (the codes are case-insensitive):

'1';;
the first allele, regardless of phasing

Expand Down Expand Up @@ -1018,8 +1018,8 @@ depth information, such as INFO/AD or FORMAT/AD. For that, consider using the
==== GEN/SAMPLE conversion:
*-G, --gensample2vcf* 'prefix' or 'gen-file','sample-file'::
convert IMPUTE2 output to VCF. One of the ID columns ("SNP ID" or "rsID" in
https://www.cog-genomics.org/plink/2.0/formats#gen) must be of the form
"CHROM:POS_REF_ALT" to detect possible strand swaps.
https://www.cog-genomics.org/plink/2.0/formats#gen) must be of the form
"CHROM:POS_REF_ALT" to detect possible strand swaps.
{nbsp} +
When the *--vcf-ids* option is given, the other column (autodetected) is used
to fill the ID column of the VCF.
Expand Down Expand Up @@ -1279,7 +1279,7 @@ output VCF and are ignored for the prediction analysis.
#
# Attributes required for
# gene lines:
# - ID=gene:<gene_id>
# - ID=gene:<gene_id>
# - biotype=<biotype>
# - Name=<gene_name> [optional]
#
Expand Down Expand Up @@ -1553,7 +1553,7 @@ Without the *-g* option, multi-sample cross-check of samples in 'query.vcf.gz' i
that average score is used to determine the top matches, not absolute values.

*--no-HWE-prob*::
Disable calculation of HWE probability to reduce memory requirements with
Disable calculation of HWE probability to reduce memory requirements with
comparisons between very large number of sample pairs.

*-p, --pairs* 'LIST'::
Expand Down Expand Up @@ -1622,11 +1622,11 @@ Without the *-g* option, multi-sample cross-check of samples in 'query.vcf.gz' i
// present, a constant value '99' is used for the unseen genotypes. With
// *-G*, the value '1' can be used instead; the discordance value then
// gives exactly the number of differing genotypes.
//
//
// ERR, error rate;;
// Pairwise error rate calculated as number of differences divided
// by the total number of comparisons.
//
//
// CLUSTER, TH, DOT;;
// In presence of multiple samples, related samples and outliers can be
// identified by clustering samples by error rate. A simple hierarchical
Expand Down Expand Up @@ -1861,7 +1861,7 @@ For "vertical" merge take a look at *<<concat,bcftools concat>>* or *<<norm,bcft
alternate alleles relevant (local) for the current sample. The number 'INT' gives the
maximum number of alternate alleles that can be included in the PL tag. The default value
is 0 which disables the feature and outputs values for all alternate alleles.

*-m, --merge* 'snps'|'indels'|'both'|'all'|'none'|'id'::
The option controls what types of multiallelic records can be created:
----
Expand Down Expand Up @@ -2150,8 +2150,8 @@ INFO/DPR .. Deprecated in favor of INFO/AD; Number of high-quality bases for

1.12 -Q13 -h100 -m1
illumina [ default values ]
ont -B -Q5 --max-BQ 30 -I
pacbio-ccs -D -Q5 --max-BQ 50 -F0.1 -o25 -e1 -M99999
ont -B -Q5 --max-BQ 30 --no-indelQ-tweaks -I
pacbio-ccs -D -Q5 --max-BQ 50 --no-indelQ-tweaks -F0.1 -o25 -e1 -M99999

*--ar, --ambig-reads* 'drop'|'incAD'|'incAD0'::
What to do with ambiguous indel reads that do not span an entire
Expand Down Expand Up @@ -2195,6 +2195,13 @@ INFO/DPR .. Deprecated in favor of INFO/AD; Number of high-quality bases for
Note that although the window size approximately corresponds to the maximum
indel size considered, it is not an exact threshold [110]

*--no-indelQ-tweaks*::
Increase sensitivity of indel calling, especially from long reads.
The indel calling algorithm was designed for short reads and uses heuristics
to estimate the maximum tolerable deviation of the query sequence
from the reference. However, for long reads this sometimes leads to incorrect
rejection of valid indels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should describe what it's doing as well as what the overall intention is.

From what I understand (this has taken a good hour and I thought it was bugged initially):

Score is top 18-bits total score, 8 bits normalised score, 6 bits type.
So score>>6 & 0xff is normalised score.

The ?: is scaling score from score(tmp=0) down to 0 (tmp>=111). I think high score is bad, but this means high scores get mapped to 0 while low scores say as they are, which feels back to front?

Regardless, it's basically taking into account the normalised per-base score rather than the total alignment score. This is important as not all alignments are the same length and so the total score varies. (This variation is more likely on short reads than long reads, as it's much more likely you're near the end of a read when it's short)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should describe what it's doing as well as what the overall intention is.

Please suggest wording you'd be happy with

Score is top 18-bits total score, 8 bits normalised score, 6 bits type. So score>>6 & 0xff is normalised score.

That is correct.

The ?: is scaling score from score(tmp=0) down to 0 (tmp>=111). I think high score is bad, but this means high scores get mapped to 0 while low scores say as they are, which feels back to front?

Yes. It takes the normalized score of the best alignment, higher score means worse alignment. If it's too high (>111), the indel is considered an artefact. If lower, the indelQ is scaled linearly so that perfect alignment (score=0) leaves the indelQ untouched.

*-I, --skip-indels*::
Do not perform INDEL calling

Expand Down Expand Up @@ -2256,7 +2263,7 @@ the *<<fasta_ref,--fasta-ref>>* option is supplied.
See also *--atom-overlaps* and *--old-rec-tag*.

*--atom-overlaps* '.'|'*'::
Alleles missing because of an overlapping variant can be set either
Alleles missing because of an overlapping variant can be set either
to missing (.) or to the star alele (*), as recommended by
the VCF specification. IMPORTANT: Note that asterisk is expaneded
by shell and must be put in quotes or escaped by a backslash:
Expand Down Expand Up @@ -2286,7 +2293,7 @@ the *<<fasta_ref,--fasta-ref>>* option is supplied.
can swap alleles and will update genotypes (GT) and AC counts,
but will not attempt to fix PL or other fields. Also note, and this
cannot be stressed enough, that 's' will NOT fix strand issues in
your VCF, do NOT use it for that purpose!!! (Instead see
your VCF, do NOT use it for that purpose!!! (Instead see
<http://samtools.github.io/bcftools/howtos/plugin.af-dist.html> and
<http://samtools.github.io/bcftools/howtos/plugin.fixref.html>.)

Expand Down Expand Up @@ -2330,7 +2337,7 @@ the *<<fasta_ref,--fasta-ref>>* option is supplied.

*--old-rec-tag* 'STR'::
Add INFO/STR annotation with the original record. The format of the
annotation is CHROM|POS|REF|ALT|USED_ALT_IDX.
annotation is CHROM|POS|REF|ALT|USED_ALT_IDX.

*-o, --output* 'FILE'::
see *<<common_options,Common Options>>*
Expand Down Expand Up @@ -2949,11 +2956,11 @@ Transition probabilities:

*-M, --rec-rate* 'FLOAT'::
constant recombination rate per bp. In combination with *--genetic-map*,
the *--rec-rate* parameter is interpreted differently, as 'FLOAT'-fold increase of
the *--rec-rate* parameter is interpreted differently, as 'FLOAT'-fold increase of
transition probabilities, which allows the model to become more sensitive
yet still account for recombination hotspots. Note that also the range
of the values is therefore different in both cases: normally the
parameter will be in the range (1e-3,1e-9) but with *--genetic-map*
parameter will be in the range (1e-3,1e-9) but with *--genetic-map*
it will be in the range (10,1000).

*-o, --output* 'FILE'::
Expand Down Expand Up @@ -3192,7 +3199,7 @@ Convert between VCF and BCF. Former *bcftools subset*.
Note that filter options below dealing with counting the number of alleles
will, for speed, first check for the values of AC and AN in the INFO column to
avoid parsing all the genotype (FORMAT/GT) fields in the VCF. This means
that a filter like '--min-af 0.1' will be calculated from INFO/AC and INFO/AN
that a filter like '--min-af 0.1' will be calculated from INFO/AC and INFO/AN
when available or FORMAT/GT otherwise. However, it will not attempt to use any other existing
field, like INFO/AF for example. For that, use '--exclude AF<0.1' instead.

Expand Down Expand Up @@ -3411,7 +3418,7 @@ to require that all alleles are of the given type. Compare
* array subscripts (0-based), "*" for any element, "-" to indicate a range. Note that
for querying FORMAT vectors, the colon ":" can be used to select a sample and an
element of the vector, as shown in the examples below

INFO/AF[0] > 0.3 .. first AF value bigger than 0.3
FORMAT/AD[0:0] > 30 .. first AD value of the first sample bigger than 30
FORMAT/AD[0:1] .. first sample, second AD value
Expand Down Expand Up @@ -3524,7 +3531,7 @@ used on the result. For example, when querying "TAG=1,2,3,4", it will be evaluat

TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)

COUNT(GT="hom")=0 .. no homozygous genotypes at the site
COUNT(GT="hom")=0 .. no homozygous genotypes at the site

AVG(GQ)>50 .. average (arithmetic mean) of genotype qualities bigger than 50

Expand Down
Loading