-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make --indel-bias more sensitive to indels #1648
base: develop
Are you sure you want to change the base?
Conversation
by switching off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score. This is a problem for long reads, so newly `--indel-bias 1.01` is added to 'ont' and 'pacbio-ccs' presets. Tentative fix to #1459
Unfortunately repurposing an existing option is trashing the intention of the original option. (Edit: updated with entirety of chr1.) Chr1 on SynDip with
With
So we see at filtered Q30 we have +35 FP and -91 FN (plus 5 fewer with the incorrect genotype assignment). It's a small shift, but it was only a small tweak to the bias value too. It's quite a favourable FP vs FN tradeoff too if you prefer more recall. Pushing it further, to
That's a substantial shift. Vs the default mode it's +419FP vs -534FN. Closer to parity between the +/-. With this PR and
Vs the baseline that's now +654FP and -676FN. It may be beneficial depending on what people most highly prioritise, but it's removed a dial from peoples control and has gone all-in basically. The evidence here is that we need to evaluate the impact of such changes. At the very least, this ought to be a new option so the ability to tune for FP vs FN priority is retained. I suspect the (somewhat hacky) heuristic needs some major tuning. I don't know what it's keying into and why it works for short read data and not long read, but it's likely overly sensitive to either error rates or read lengths, so the long term solution is probably to rewrite it. (I'm OK with a short term fix to optionally simply skip this line of code though.) |
…lling The option switches off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score. This is a problem for long reads, so newly `--no-indelQ-tweaks` is added to 'ont' and 'pacbio-ccs' presets. Tentative fix to #1459
This is now restructured into a separate option |
to estimate the maximum tolerable deviation of the query sequence | ||
from the reference. However, for long reads this sometimes leads to incorrect | ||
rejection of valid indels. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should describe what it's doing as well as what the overall intention is.
From what I understand (this has taken a good hour and I thought it was bugged initially):
Score is top 18-bits total score, 8 bits normalised score, 6 bits type.
So score>>6 & 0xff is normalised score.
The ?: is scaling score from score(tmp=0) down to 0 (tmp>=111). I think high score is bad, but this means high scores get mapped to 0 while low scores say as they are, which feels back to front?
Regardless, it's basically taking into account the normalised per-base score rather than the total alignment score. This is important as not all alignments are the same length and so the total score varies. (This variation is more likely on short reads than long reads, as it's much more likely you're near the end of a read when it's short)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should describe what it's doing as well as what the overall intention is.
Please suggest wording you'd be happy with
Score is top 18-bits total score, 8 bits normalised score, 6 bits type. So score>>6 & 0xff is normalised score.
That is correct.
The ?: is scaling score from score(tmp=0) down to 0 (tmp>=111). I think high score is bad, but this means high scores get mapped to 0 while low scores say as they are, which feels back to front?
Yes. It takes the normalized score of the best alignment, higher score means worse alignment. If it's too high (>111), the indel is considered an artefact. If lower, the indelQ is scaled linearly so that perfect alignment (score=0) leaves the indelQ untouched.
My big concern with this is whether it'll actually last more than a few months as a half-solution. It's not working right now on all the data files, so we need to find a better fix, which I'm researching. It's quite possible that an improved consensus construction algorithm in the bam2bcf_indel code may make this option completely redundant. That work isn't going to appear in our next release, so if this option worked more robustly then I'd be inclined to add it anyway. However the fact it seems somewhere around a 50/50 success rate at rescuing these missing indels, while also having a major jump to false positives, makes me wonder if we should simply reject this in favour of a better implementation down the road. |
In the light of all recent tests I agree, this is not a viable solution. |
by switching off an existing heuristics which reduces indelQ
and, in the extreme case, has the power to discard an indel
entirely when the reference alignment has low score.
This is a problem for long reads, so newly
--indel-bias 1.01
is added to 'ont' and 'pacbio-ccs' presets.
Tentative fix to #1459