Make --indel-bias more sensitive to indels #1648

pd3 · 2022-01-28T08:22:53Z

by switching off an existing heuristics which reduces indelQ
and, in the extreme case, has the power to discard an indel
entirely when the reference alignment has low score.

This is a problem for long reads, so newly --indel-bias 1.01
is added to 'ont' and 'pacbio-ccs' presets.

Tentative fix to #1459

by switching off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score. This is a problem for long reads, so newly `--indel-bias 1.01` is added to 'ont' and 'pacbio-ccs' presets. Tentative fix to #1459

jkbonfield · 2022-01-28T09:54:30Z

Unfortunately repurposing an existing option is trashing the intention of the original option.

(Edit: updated with entirety of chr1.)

Chr1 on SynDip with --indel-bias 1.0 (the default) gives:

SNP          Q>0 /   Q>=30 / Filtered
SNP   TP  265363 /  264144 /  264062
SNP   FP    5666 /    4052 /    3341
SNP   GT     676 /     565 /     529
SNP   FN    4292 /    5511 /    5593

InDel TP   38284 /   37737 /   37611
InDel FP    1669 /    1267 /    1172
InDel GT    1632 /    1555 /    1522
InDel FN    7752 /    8299 /    8425

With --indel-bias 1.1, to favour higher recall rate at the expense of lower precision gives:

InDel TP   38349 /   37832 /   37702
InDel FP    1723 /    1309 /    1207
InDel GT    1626 /    1551 /    1517
InDel FN    7687 /    8204 /    8334

So we see at filtered Q30 we have +35 FP and -91 FN (plus 5 fewer with the incorrect genotype assignment). It's a small shift, but it was only a small tweak to the bias value too. It's quite a favourable FP vs FN tradeoff too if you prefer more recall.

Pushing it further, to --indel-bias 4.0 we see:

InDel TP   38669 /   38285 /   38145
InDel FP    2237 /    1740 /    1591
InDel GT    1637 /    1591 /    1560
InDel FN    7367 /    7751 /    7891

That's a substantial shift. Vs the default mode it's +419FP vs -534FN. Closer to parity between the +/-.

With this PR and --indel-bias 1.1 we get:

InDel TP   38758 /   38427 /   38287
InDel FP    2614 /    2026 /    1860
InDel GT    1663 /    1622 /    1590
InDel FN    7278 /    7609 /    7749

Vs the baseline that's now +654FP and -676FN. It may be beneficial depending on what people most highly prioritise, but it's removed a dial from peoples control and has gone all-in basically.

The evidence here is that we need to evaluate the impact of such changes. At the very least, this ought to be a new option so the ability to tune for FP vs FN priority is retained. I suspect the (somewhat hacky) heuristic needs some major tuning. I don't know what it's keying into and why it works for short read data and not long read, but it's likely overly sensitive to either error rates or read lengths, so the long term solution is probably to rewrite it. (I'm OK with a short term fix to optionally simply skip this line of code though.)

…lling The option switches off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score. This is a problem for long reads, so newly `--no-indelQ-tweaks` is added to 'ont' and 'pacbio-ccs' presets. Tentative fix to #1459

pd3 · 2022-02-08T14:46:26Z

This is now restructured into a separate option --no-indelQ-tweaks. Sadly, it does not help always, still investigating..

jkbonfield · 2022-02-10T16:09:32Z

doc/bcftools.txt

+    to estimate the maximum tolerable deviation of the query sequence
+    from the reference. However, for long reads this sometimes leads to incorrect
+    rejection of valid indels.
+


I think we should describe what it's doing as well as what the overall intention is.

From what I understand (this has taken a good hour and I thought it was bugged initially):

Score is top 18-bits total score, 8 bits normalised score, 6 bits type.
So score>>6 & 0xff is normalised score.

The ?: is scaling score from score(tmp=0) down to 0 (tmp>=111). I think high score is bad, but this means high scores get mapped to 0 while low scores say as they are, which feels back to front?

Regardless, it's basically taking into account the normalised per-base score rather than the total alignment score. This is important as not all alignments are the same length and so the total score varies. (This variation is more likely on short reads than long reads, as it's much more likely you're near the end of a read when it's short)

I think we should describe what it's doing as well as what the overall intention is.

Please suggest wording you'd be happy with

Score is top 18-bits total score, 8 bits normalised score, 6 bits type. So score>>6 & 0xff is normalised score.

That is correct.

The ?: is scaling score from score(tmp=0) down to 0 (tmp>=111). I think high score is bad, but this means high scores get mapped to 0 while low scores say as they are, which feels back to front?

Yes. It takes the normalized score of the best alignment, higher score means worse alignment. If it's too high (>111), the indel is considered an artefact. If lower, the indelQ is scaled linearly so that perfect alignment (score=0) leaves the indelQ untouched.

jkbonfield · 2022-02-11T09:57:21Z

My big concern with this is whether it'll actually last more than a few months as a half-solution.

It's not working right now on all the data files, so we need to find a better fix, which I'm researching. It's quite possible that an improved consensus construction algorithm in the bam2bcf_indel code may make this option completely redundant.

That work isn't going to appear in our next release, so if this option worked more robustly then I'd be inclined to add it anyway. However the fact it seems somewhere around a 50/50 success rate at rescuing these missing indels, while also having a major jump to false positives, makes me wonder if we should simply reject this in favour of a better implementation down the road.

pd3 · 2022-02-11T09:58:47Z

In the light of all recent tests I agree, this is not a viable solution.

jkbonfield reviewed Feb 10, 2022

View reviewed changes

jkbonfield mentioned this pull request Mar 15, 2022

Improvements to indel calling. #1679

Closed

pd3 force-pushed the develop branch from ab5de54 to 2191405 Compare February 13, 2023 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make --indel-bias more sensitive to indels #1648

Make --indel-bias more sensitive to indels #1648

pd3 commented Jan 28, 2022

jkbonfield commented Jan 28, 2022 •

edited

Loading

pd3 commented Feb 8, 2022

jkbonfield Feb 10, 2022

pd3 Feb 11, 2022

jkbonfield commented Feb 11, 2022 •

edited

Loading

pd3 commented Feb 11, 2022

Make --indel-bias more sensitive to indels #1648

Are you sure you want to change the base?

Make --indel-bias more sensitive to indels #1648

Conversation

pd3 commented Jan 28, 2022

jkbonfield commented Jan 28, 2022 • edited Loading

pd3 commented Feb 8, 2022

jkbonfield Feb 10, 2022

Choose a reason for hiding this comment

pd3 Feb 11, 2022

Choose a reason for hiding this comment

jkbonfield commented Feb 11, 2022 • edited Loading

pd3 commented Feb 11, 2022

jkbonfield commented Jan 28, 2022 •

edited

Loading

jkbonfield commented Feb 11, 2022 •

edited

Loading