Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make --indel-bias more sensitive to indels #1648

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open

Make --indel-bias more sensitive to indels #1648

wants to merge 2 commits into from

Conversation

pd3
Copy link
Member

@pd3 pd3 commented Jan 28, 2022

by switching off an existing heuristics which reduces indelQ
and, in the extreme case, has the power to discard an indel
entirely when the reference alignment has low score.

This is a problem for long reads, so newly --indel-bias 1.01
is added to 'ont' and 'pacbio-ccs' presets.

Tentative fix to #1459

by switching off an existing heuristics which reduces indelQ
and, in the extreme case, has the power to discard an indel
entirely when the reference alignment has low score.

This is a problem for long reads, so newly `--indel-bias 1.01`
is added to 'ont' and 'pacbio-ccs' presets.

Tentative fix to #1459
@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 28, 2022

Unfortunately repurposing an existing option is trashing the intention of the original option.

(Edit: updated with entirety of chr1.)

Chr1 on SynDip with --indel-bias 1.0 (the default) gives:

SNP          Q>0 /   Q>=30 / Filtered
SNP   TP  265363 /  264144 /  264062
SNP   FP    5666 /    4052 /    3341
SNP   GT     676 /     565 /     529
SNP   FN    4292 /    5511 /    5593

InDel TP   38284 /   37737 /   37611
InDel FP    1669 /    1267 /    1172
InDel GT    1632 /    1555 /    1522
InDel FN    7752 /    8299 /    8425

With --indel-bias 1.1, to favour higher recall rate at the expense of lower precision gives:

InDel TP   38349 /   37832 /   37702
InDel FP    1723 /    1309 /    1207
InDel GT    1626 /    1551 /    1517
InDel FN    7687 /    8204 /    8334

So we see at filtered Q30 we have +35 FP and -91 FN (plus 5 fewer with the incorrect genotype assignment). It's a small shift, but it was only a small tweak to the bias value too. It's quite a favourable FP vs FN tradeoff too if you prefer more recall.

Pushing it further, to --indel-bias 4.0 we see:

InDel TP   38669 /   38285 /   38145
InDel FP    2237 /    1740 /    1591
InDel GT    1637 /    1591 /    1560
InDel FN    7367 /    7751 /    7891

That's a substantial shift. Vs the default mode it's +419FP vs -534FN. Closer to parity between the +/-.

With this PR and --indel-bias 1.1 we get:

InDel TP   38758 /   38427 /   38287
InDel FP    2614 /    2026 /    1860
InDel GT    1663 /    1622 /    1590
InDel FN    7278 /    7609 /    7749

Vs the baseline that's now +654FP and -676FN. It may be beneficial depending on what people most highly prioritise, but it's removed a dial from peoples control and has gone all-in basically.

The evidence here is that we need to evaluate the impact of such changes. At the very least, this ought to be a new option so the ability to tune for FP vs FN priority is retained. I suspect the (somewhat hacky) heuristic needs some major tuning. I don't know what it's keying into and why it works for short read data and not long read, but it's likely overly sensitive to either error rates or read lengths, so the long term solution is probably to rewrite it. (I'm OK with a short term fix to optionally simply skip this line of code though.)

…lling

The option switches off an existing heuristics which reduces indelQ
and, in the extreme case, has the power to discard an indel
entirely when the reference alignment has low score.

This is a problem for long reads, so newly `--no-indelQ-tweaks`
is added to 'ont' and 'pacbio-ccs' presets.

Tentative fix to #1459
@pd3
Copy link
Member Author

pd3 commented Feb 8, 2022

This is now restructured into a separate option --no-indelQ-tweaks. Sadly, it does not help always, still investigating..

to estimate the maximum tolerable deviation of the query sequence
from the reference. However, for long reads this sometimes leads to incorrect
rejection of valid indels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should describe what it's doing as well as what the overall intention is.

From what I understand (this has taken a good hour and I thought it was bugged initially):

Score is top 18-bits total score, 8 bits normalised score, 6 bits type.
So score>>6 & 0xff is normalised score.

The ?: is scaling score from score(tmp=0) down to 0 (tmp>=111). I think high score is bad, but this means high scores get mapped to 0 while low scores say as they are, which feels back to front?

Regardless, it's basically taking into account the normalised per-base score rather than the total alignment score. This is important as not all alignments are the same length and so the total score varies. (This variation is more likely on short reads than long reads, as it's much more likely you're near the end of a read when it's short)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should describe what it's doing as well as what the overall intention is.

Please suggest wording you'd be happy with

Score is top 18-bits total score, 8 bits normalised score, 6 bits type. So score>>6 & 0xff is normalised score.

That is correct.

The ?: is scaling score from score(tmp=0) down to 0 (tmp>=111). I think high score is bad, but this means high scores get mapped to 0 while low scores say as they are, which feels back to front?

Yes. It takes the normalized score of the best alignment, higher score means worse alignment. If it's too high (>111), the indel is considered an artefact. If lower, the indelQ is scaled linearly so that perfect alignment (score=0) leaves the indelQ untouched.

@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 11, 2022

My big concern with this is whether it'll actually last more than a few months as a half-solution.

It's not working right now on all the data files, so we need to find a better fix, which I'm researching. It's quite possible that an improved consensus construction algorithm in the bam2bcf_indel code may make this option completely redundant.

That work isn't going to appear in our next release, so if this option worked more robustly then I'd be inclined to add it anyway. However the fact it seems somewhere around a 50/50 success rate at rescuing these missing indels, while also having a major jump to false positives, makes me wonder if we should simply reject this in favour of a better implementation down the road.

@pd3
Copy link
Member Author

pd3 commented Feb 11, 2022

In the light of all recent tests I agree, this is not a viable solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants