Canu + minimap for correction on Guppy v3.6 reads: uncorrected has better error profile #1715

gringer · 2020-05-15T14:17:59Z

A pileup of uncorrected Guppy v3.6 reads produces a better guess at the true sequence compared to corrected reads. The errors are in deletion variants and only in specific locations; for the most part, reads are more accurately called after correction (especially at SNPs). I acknowledge that this difference may be purely in the mapper (I used mmap, rather than the default).

I've started re-assembling my January 2017 reads from Nippostrongylus brasiliensis, recalled using Guppy v3.6, using Canu v2.0. We've got a pretty accurate mitochondrial genome that's been previously assembled from Illumina-corrected nanopore reads (confirmed by our own Illumina cDNA reads, and by the Sanger Institute's own Illumina reads), so I've been using that as a yard stick to measure the accuracy of basecalled unmethylated reads. I use LAST with a trained alignment matrix for this mapping, because it seems to have better mapping error profiles in comparison to minimap2.

Here is a combined coverage / variant plot showing Canu-corrected reads:

And here is a combined coverage / variant plot showing the uncorrected version of those same reads:

The identified variants with frequency >40% on both the forward and reverse strand are indicated on the outer portion of the plot. Only deletion variants exist in both plots. Note that there are more identified variants for the corrected reads (11) than for the uncorrected reads (1). Based on Illumina reads that I've looked at, I believe that uncorrected variant to be a true reflection of polymorphic mitochondrial sequence, rather than an error.

This is not an issue, as such. More of an observation, just in case it's helpful for improving and/or speeding up Canu.

In case anyone wants to have a look at these reads, the uncorrected Guppy v3.6-called reads that map to the mitochondrial genome can be found here. Coverage for the mitochondrial genome is about 650X.

In order to map reads to the mitochondrial genome I use LAST. Here's an example command sequence:

lastal -P 10 -p ~/db/last/bc.mat ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta uncorrectedReads_vs_Nb_mtDNA.fa.gz | last-split | maf-convert sam | samtools view -h --reference ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta | samtools sort > mtDNA_called_uncorrected_vs_nippo_mtDNA.bam

I have a semi-automated visualisation script that shows me coverage and variant frequencies, and determines any observed variants that are consistent on forward-mapped and reverse-mapped reads. I use this consistency check to exclude methylation-related variant signals (because it's been my experience that typically only one strand is methylated in the same place):

samtools view mtDNA_called_uncorrected_vs_nippo_mtDNA.bam -F 0x10 -b | samtools mpileup --reference ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta -d 100000 -Q 0 - | ~/scripts/readstomper.pl -c > stompedCounts_mtDNA_called_uncorrected_vs_nippo_mtDNA_fwd.csv
samtools view mtDNA_called_uncorrected_vs_nippo_mtDNA.bam -f 0x10 -b | samtools mpileup --reference ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta -d 100000 -Q 0 - | ~/scripts/readstomper.pl -c > stompedCounts_mtDNA_called_uncorrected_vs_nippo_mtDNA_rev.csv
~/scripts/stomp_plotter.r -f stompedCounts_fwd_4T1_ρ0SC_vs_chrM_9457.csv -r stompedCounts_rev_4T1_ρ0SC_vs_chrM_9457.csv -circular 16299 -adj 9457 -log -type png -scale -max 60000 -nodels -mindepth 2

command:
~/install/canu/canu-2.0/Linux-amd64/bin/canu -nanopore-raw called_Nb_CFED_65bptrim_guppy_3.6.0.fq.gz -p Nb_ONTCFED
_guppy360_65bpTrim -d Nb_ONTCFED_guppy360_65bpTrim genomeSize=400M corOverlapper=minimap

version: Canu 2.0

system: Debian Linux desktop
Linux elegans 5.6.0-1-amd64 #1 SMP Debian 5.6.7-1 (2020-04-29) x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

skoren · 2020-05-15T14:20:27Z

Yes, we've seen some similar effects. I think with the increase in quality of 3.6.0 data, it makes sense to assemble them uncorrected.

We're exploring similar strategies to HiFi data (compress, correct isolated errors, mask systematic errors) as well. You can try it by using the options:

-trim-assemble batOptions="-eg 0.12 -sb 0.01 -dg 12 -db 12 -dr 6 -ca 1000 -cp 10" "correctedErrorRate=0.12" -pacbio-hifi <your nanopore fastq>

harish0201 · 2021-04-05T17:28:28Z

@gringer Any such plots or outcomes from Bonito basecalled datasets?

gringer · 2022-04-28T21:59:10Z

Sorry, missed this question. I've done a super-accuracy recall (with guppy v5.1.15 / 6.1) with my mouse cDNA reads, but not yet the nippo mtDNA reads. I'll put that on my mental list to do shortly.

gringer · 2022-05-02T05:37:09Z

I've created an updated dataset, with a slightly more curated dataset that filters on the following conditions:

length >= 3kb
target (mitochondrial) match length >= 50% OR
query (read) match length >= 90%

Hopefully this will exclude most of the nuclear mitochondrial copies, if they exist. Files can be found here:

corrected reference chromosome (13 kB): Nb_mtDNA_MRSR_corrected.fasta
fast5 file (20 MB): reads_long_nippo_mtDNA.fast5
guppy v5.1.15 called reads (1.4 MB): guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz
100 most-accurate reads (according to guppy; 463 kB): 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz

gringer · 2022-05-02T06:08:13Z

Now for the accuracy comparisons. I'm checking the assembled genome using LAST (with default mapping parameters, because I expect it to be very similar to the corrected assembly), and measuring the gap-compressed identity of the longest alignment to the longest assembled contig (likely less than 100% of the genome).

# default parameters, all reads
$ canu-2.2/bin/canu -nanopore guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_allReads_default -d canu_allReads_default genomeSize=13k
# GCI: 99.87% (17 errors)
# default parameters, 100 most accurate reads
$ canu-2.2/bin/canu -nanopore 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_bestReads_default -d canu_bestReads_default genomeSize=13k
# GCI: 99.91% (12 errors)
# @skoren's recommended options, all reads
$ canu-2.2/bin/canu -p canu_allReads_skOpts -d canu_allReads_skOpts genomeSize=13k  -trim-assemble batOptions="-eg 0.12 -sb 0.01 -dg 12 -db 12 -dr 6 -ca 1000 -cp 10" "correctedErrorRate=0.12" -pacbio-hifi guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz
# GCI: 99.84% (21 errors) [... overlapErrorAdjustment took 270 s / 920 s, much longer than others]
# trim & assemble only, all reads
$ canu-2.2/bin/canu -trim-assemble -nanopore-corrected guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_allReads_trimAssemble -d canu_allReads_trimAssemble genomeSize=13k
# GCI: 99.85% (20 errors) [overlapErrorAdjustment took 35s]
# trim & assemble only, 100 most accurate reads
$ canu-2.2/bin/canu -trim-assemble -nanopore-corrected 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_bestReads_trimAssemble -d canu_bestReads_trimAssemble genomeSize=13k
# GCI: 99.83% (21 errors)
# minimap2, trim & assemble only, 100 most accurate reads
$ canu-2.2/bin/canu -trim-assemble -nanopore-corrected 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_bestReads_mm2TrimAssemble -d canu_bestReads_mm2TrimAssemble genomeSize=13k corOverlapper=minimap
# GCI: 99.83% (21 errors)
# minimap2 [with correction], 100 most accurate reads
$ canu-2.2/bin/canu -nanopore 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_bestReads_mm2 -d canu_bestReads_mm2 genomeSize=13k corOverlapper=minimap
# GCI: 99.91% (12 errors)

So, at least for these tests, the longest-running canu assemblies didn't produce the best results, and there didn't seem to be any difference between using minimap2 or mhap for the overlapper. The best outcome was reached by doing pre-filtering of reads by mean accuracy, then assembling with correction.

This seems inconsistent with my first result, but bear in mind that these are a subset of reads (trying to exclude as much haplotypic variation as possible), and canu may perform better with its own idea of "corrected" reads, even if they are less accurate relative to the reference, because the developers have optimised assembly for those reads.

As something of an encore, I tried to see if I could get medaka + LAST to improve on the bestReads_mm2 assembly:

$ lastdb -uNEAR -R01 canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta
$ last-train -P 10 -Q 1 canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz > trained.mat
$ lastal -p trained.mat -P 10 -j 7 canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz | maf-convert sam | samtools view -b -T canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta | samtools sort > calls_to_draft.bam
$ samtools index calls_to_draft.bam
$ mkdir -p results
$ medaka consensus calls_to_draft.bam results/contig1.hdf --model r941_min_sup_g507
$ medaka stitch results/contig1.hdf canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta results/polished.assembly.fasta
$ lastal Nb_mtDNA_MRSR_corrected.fasta results/polished.assembly.fasta | last-map-probs | ~/scripts/maf2csv.pl | cut -d ',' -f 14,15
# GCI: 99.93 (10 errors)

According to diffseq, the errors are all deletions. This is a known/expected issue for nanopore reads due to the increased chance of skipped signal, and it makes some sense to be a bit consistent because certain sequences may make DNA unwinding a little bit harder or easier:

# [I doubled the genome to make sure I could get a contiguous section of matching sequence. Duplicated matches have been removed]
########################################
# Program: diffseq
# Rundate: Tue  3 May 2022 01:11:48
# Commandline: diffseq
#    [-asequence] dbl_Nb_mtDNA_MRSR_corrected.fasta
#    [-bsequence] results/polished.assembly.fasta
# Report_format: diffseq
# Report_file: nb_mtdna.diffseq
# Additional_files: 2
# 1: Nb_mtDNA.diffgff (Feature file for first sequence)
# 2: tig00000001.diffgff (Feature file for second sequence)
########################################

#=======================================
#
# Sequence: Nb_mtDNA     from: 1   to: 26710
# HitCount: 14
#
# Compare: tig00000001     from: 1   to: 25293
# 
# Nb_mtDNA overlap starts at 1
# tig00000001 overlap starts at 7909
# 
#
#=======================================
Nb_mtDNA 730-730 Length: 1
Sequence: T
Sequence: 
tig00000001 8637 Length: 0

Nb_mtDNA 2842-2842 Length: 1
Sequence: T
Sequence: 
tig00000001 10748 Length: 0

Nb_mtDNA 3562-3563 Length: 2
Sequence: TT
Sequence: 
tig00000001 11467 Length: 0

Nb_mtDNA 4193-4193 Length: 1
Sequence: T
Sequence: 
tig00000001 12096 Length: 0

Nb_mtDNA 4919-4919 Length: 1
Sequence: T
Sequence: 
tig00000001 12821 Length: 0

Nb_mtDNA 5201-5201 Length: 1
Sequence: T
Sequence: 
tig00000001 13102 Length: 0

Nb_mtDNA 8253-8253 Length: 1
Sequence: T
Sequence: 
tig00000001 16153 Length: 0

Nb_mtDNA 8881-8882 Length: 2
Sequence: TT
Sequence: 
tig00000001 16780 Length: 0

Nb_mtDNA 9530-9530 Length: 1
Sequence: T
Sequence: 
tig00000001 17427 Length: 0

Nb_mtDNA 12316-12316 Length: 1
Sequence: T
Sequence: 
tig00000001 20212 Length: 0

In other words, yes, medaka was able to very slightly improve the accuracy of the assembly (bearing in mind that medaka is designed to work best with Flye and minimap2, rather than canu and LAST).

I have now tried this with minimap2 instead of LAST, and unlike LAST, it couldn't make the assembly any more accurate:

$ minimap2 -a Nb_mtDNA_MRSR_corrected.fasta results/polished.assembly.fasta | samtools sort > mm2_calls_to_draft.bam
$ samtools index mm2_calls_to_draft.bam
$ medaka consensus mm2_calls_to_draft.bam results/mm2_contig1.hdf --model r941_min_sup_g507
$ medaka stitch results/mm2_contig1.hdf canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta results/mm2_polished.assembly.fasta
...
[Copying contig 'tig00000001' verbatim from input]

gringer · 2022-05-02T13:06:01Z

Revisiting those plots... canu does a really good job at read correction, essentially eliminating the random SNP errors, and most INDELs. The problem (as previously observed) is that it leaves behind some consistent errors:

skoren mentioned this issue Feb 1, 2021

Low level of divergence in the sample being assembled #1889

Closed

Yutang-ETH mentioned this issue Jul 20, 2021

Assemble a diploid assembly with ONT long reads #1985

Closed

thallinger mentioned this issue Jun 5, 2022

Canu parameter settings to assemble ONT R10.4.1 with SQK-LSK112 data #2131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canu + minimap for correction on Guppy v3.6 reads: uncorrected has better error profile #1715

Canu + minimap for correction on Guppy v3.6 reads: uncorrected has better error profile #1715

gringer commented May 15, 2020 •

edited

Loading

skoren commented May 15, 2020 •

edited

Loading

harish0201 commented Apr 5, 2021

gringer commented Apr 28, 2022 •

edited

Loading

gringer commented May 2, 2022 •

edited

Loading

gringer commented May 2, 2022 •

edited

Loading

gringer commented May 2, 2022

Canu + minimap for correction on Guppy v3.6 reads: uncorrected has better error profile #1715

Canu + minimap for correction on Guppy v3.6 reads: uncorrected has better error profile #1715

Comments

gringer commented May 15, 2020 • edited Loading

skoren commented May 15, 2020 • edited Loading

harish0201 commented Apr 5, 2021

gringer commented Apr 28, 2022 • edited Loading

gringer commented May 2, 2022 • edited Loading

gringer commented May 2, 2022 • edited Loading

gringer commented May 2, 2022

gringer commented May 15, 2020 •

edited

Loading

skoren commented May 15, 2020 •

edited

Loading

gringer commented Apr 28, 2022 •

edited

Loading

gringer commented May 2, 2022 •

edited

Loading

gringer commented May 2, 2022 •

edited

Loading