Assemble a diploid assembly with ONT long reads #1985

Yutang-ETH · 2021-07-20T14:33:49Z

Hi,

Thank you very much for developing Canu and also thank you very much for supporting the scientific community.

I have some ONT long reads (42x) basedcalled using guppy v4 for a diploid plant genome, which is highly heterozygous. Recently, I tried to reconstruct both haplotypes for the gnome using a read-based phasing approach. However, with a mosaic haploid assembly, I cannot perfectly phase/split reads to two haplotype groups due to the alignment bias. Now, I am thinking about directly assembling a diploid assembly using canu without collapsing haplotypes, however, I am kind of confused by the batoption.

I saw from issue #1715, you recommended -trim-assemble batOptions="-eg 0.12 -sb 0.01 -dg 12 -db 12 -dr 6 -ca 1000 -cp 10" "correctedErrorRate=0.12" -pacbio-hifi , I am wondering is this suitable for my case? My raw ONT long reads have mean read accuracy around Q10 (error rate is around 10%) , and all reads have accuracy > Q7 (error rate 19%).

My questions are:

is it feasible to use canu to assemble both haplotypes with my current data considering the error rate and coverage?
if it is possible, do I need to adjust the parameters you recommended in the batOptions above? Which one should I play with?

Could you please give a little hint? Thank you very much.

Best wishes,
Yutang

skoren · 2021-07-23T21:03:27Z

Typically Nanopore data is not high enough accuracy for this unless your genome is very diverged (over 2-3% between the haplotypes). Below this you'll end up mixing haplotypes and would likely have some regions collapsed and some uncollapsed (the more diverged regions).
I wouldn't rely on the Q score to estimate read error, it is likely not that accurate. The error rate Canu uses is also in homopolymer-compressed space so it will be more accurate. You could try to get a better estimate of the error rate estimate by aligning some compressed reads to a collapsed compressed assembly. You'd then set the batOptions and correctedErrorRate to double that estimate. With new guppy versions we've typically seen 5-6% error (and lower with Guppy 5+) so that's where the 12% comes from. If your error rate is much above 6-8% then I would suggest using the standard option with correction.

Yutang-ETH · 2021-07-23T21:20:11Z

Thank you very much for your reply.

Our target genome is very diverged (I don't know the exact value between two haplotypes though, it is probably 2-3% or larger).
I also corrected ONT long reads using FMLRC2 with short illumina reads from the same sample. The mean error rate could be less than 10% or even close to 5%, but since I don't have a high-quality reference genome, I cannot estimate the error rate more accurately.
If I try the standard option with correction, then I believe it will take a very long time since my colleague has done that before and he could not make it. So, I am thinking about avoid the error correction step.

Anyway, thank you very much for your input.
Best wishes,
Yutang

skoren · 2021-07-23T21:21:59Z

You have a draft assembly though to align to right? I would suggest trying that to get an estimate of error. If you have short-read data you can also estimate the haplotype diversity using GenomeScope. If you want to separate haplotypes, I wouldn't use any type of corrected read. Correction will not preserve haplotype phasing within reads and so you'll end up with mixed haplotype reads which means haplotype separation from the assembly will be impossible.

skoren · 2021-07-23T21:33:44Z

I'm closing since you have a way forward but feel free to comment on this issue regarding how/if the run worked.

Yutang-ETH · 2021-07-23T21:41:27Z

Yes, I have a mosaic assembly produced by flye using the same ONT data. It was polished by two rounds internal polishment of flye and one additional polishment using POLCA with illumina short reads. I think I can use this assembly to estimate the error rate of my ONT long reads.
I also compared the kmer profile between corrected ONT reads and Illumina short reads, I found almost all the kmers present in short reads are also present in the corrected ONT reads, so this is why I believe the error rate of the corrected long reads is low. However, I cannot get the exact error rate from the kmer comparison analysis. Yeah, losing haplotype information after correction is definitely a concern to me, however, from the kmer comparison analysis, it looks like haplotype in the reads were preserved after correction.
Does Canu's error correction step preserve haplotype? If so, I may try the standard option with correction and specify raw-nanopore not hifi and I would also add the batoption. Do you think I can get a more or less diploid assembly this way? If not, we will consider pacbio hi-fi reads no longer ONT long reads.
Thank you very much,
Best wishes,
Yutang

skoren closed this as completed Jul 23, 2021

thallinger mentioned this issue Jun 5, 2022

Canu parameter settings to assemble ONT R10.4.1 with SQK-LSK112 data #2131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assemble a diploid assembly with ONT long reads #1985

Assemble a diploid assembly with ONT long reads #1985

Yutang-ETH commented Jul 20, 2021

skoren commented Jul 23, 2021

Yutang-ETH commented Jul 23, 2021

skoren commented Jul 23, 2021

skoren commented Jul 23, 2021

Yutang-ETH commented Jul 23, 2021

Assemble a diploid assembly with ONT long reads #1985

Assemble a diploid assembly with ONT long reads #1985

Comments

Yutang-ETH commented Jul 20, 2021

skoren commented Jul 23, 2021

Yutang-ETH commented Jul 23, 2021

skoren commented Jul 23, 2021

skoren commented Jul 23, 2021

Yutang-ETH commented Jul 23, 2021