-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canu + minimap for correction on Guppy v3.6 reads: uncorrected has better error profile #1715
Comments
Yes, we've seen some similar effects. I think with the increase in quality of 3.6.0 data, it makes sense to assemble them uncorrected. We're exploring similar strategies to HiFi data (compress, correct isolated errors, mask systematic errors) as well. You can try it by using the options:
|
@gringer Any such plots or outcomes from Bonito basecalled datasets? |
Sorry, missed this question. I've done a super-accuracy recall (with guppy v5.1.15 / 6.1) with my mouse cDNA reads, but not yet the nippo mtDNA reads. I'll put that on my mental list to do shortly. |
I've created an updated dataset, with a slightly more curated dataset that filters on the following conditions:
Hopefully this will exclude most of the nuclear mitochondrial copies, if they exist. Files can be found here:
|
Now for the accuracy comparisons. I'm checking the assembled genome using LAST (with default mapping parameters, because I expect it to be very similar to the corrected assembly), and measuring the gap-compressed identity of the longest alignment to the longest assembled contig (likely less than 100% of the genome).
So, at least for these tests, the longest-running canu assemblies didn't produce the best results, and there didn't seem to be any difference between using minimap2 or mhap for the overlapper. The best outcome was reached by doing pre-filtering of reads by mean accuracy, then assembling with correction. This seems inconsistent with my first result, but bear in mind that these are a subset of reads (trying to exclude as much haplotypic variation as possible), and canu may perform better with its own idea of "corrected" reads, even if they are less accurate relative to the reference, because the developers have optimised assembly for those reads. As something of an encore, I tried to see if I could get medaka + LAST to improve on the bestReads_mm2 assembly:
According to
In other words, yes, medaka was able to very slightly improve the accuracy of the assembly (bearing in mind that medaka is designed to work best with Flye and minimap2, rather than canu and LAST). I have now tried this with minimap2 instead of LAST, and unlike LAST, it couldn't make the assembly any more accurate:
|
A pileup of uncorrected Guppy v3.6 reads produces a better guess at the true sequence compared to corrected reads. The errors are in deletion variants and only in specific locations; for the most part, reads are more accurately called after correction (especially at SNPs). I acknowledge that this difference may be purely in the mapper (I used mmap, rather than the default).
I've started re-assembling my January 2017 reads from Nippostrongylus brasiliensis, recalled using Guppy v3.6, using Canu v2.0. We've got a pretty accurate mitochondrial genome that's been previously assembled from Illumina-corrected nanopore reads (confirmed by our own Illumina cDNA reads, and by the Sanger Institute's own Illumina reads), so I've been using that as a yard stick to measure the accuracy of basecalled unmethylated reads. I use LAST with a trained alignment matrix for this mapping, because it seems to have better mapping error profiles in comparison to minimap2.
Here is a combined coverage / variant plot showing Canu-corrected reads:
And here is a combined coverage / variant plot showing the uncorrected version of those same reads:
The identified variants with frequency >40% on both the forward and reverse strand are indicated on the outer portion of the plot. Only deletion variants exist in both plots. Note that there are more identified variants for the corrected reads (11) than for the uncorrected reads (1). Based on Illumina reads that I've looked at, I believe that uncorrected variant to be a true reflection of polymorphic mitochondrial sequence, rather than an error.
This is not an issue, as such. More of an observation, just in case it's helpful for improving and/or speeding up Canu.
In case anyone wants to have a look at these reads, the uncorrected Guppy v3.6-called reads that map to the mitochondrial genome can be found here. Coverage for the mitochondrial genome is about 650X.
In order to map reads to the mitochondrial genome I use LAST. Here's an example command sequence:
I have a semi-automated visualisation script that shows me coverage and variant frequencies, and determines any observed variants that are consistent on forward-mapped and reverse-mapped reads. I use this consistency check to exclude methylation-related variant signals (because it's been my experience that typically only one strand is methylated in the same place):
command:
~/install/canu/canu-2.0/Linux-amd64/bin/canu -nanopore-raw called_Nb_CFED_65bptrim_guppy_3.6.0.fq.gz -p Nb_ONTCFED
_guppy360_65bpTrim -d Nb_ONTCFED_guppy360_65bpTrim genomeSize=400M corOverlapper=minimap
version: Canu 2.0
system: Debian Linux desktop
Linux elegans 5.6.0-1-amd64 #1 SMP Debian 5.6.7-1 (2020-04-29) x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: