Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assemble a diploid assembly with ONT long reads #1985

Closed
Yutang-ETH opened this issue Jul 20, 2021 · 5 comments
Closed

Assemble a diploid assembly with ONT long reads #1985

Yutang-ETH opened this issue Jul 20, 2021 · 5 comments

Comments

@Yutang-ETH
Copy link

Hi,

Thank you very much for developing Canu and also thank you very much for supporting the scientific community.

I have some ONT long reads (42x) basedcalled using guppy v4 for a diploid plant genome, which is highly heterozygous. Recently, I tried to reconstruct both haplotypes for the gnome using a read-based phasing approach. However, with a mosaic haploid assembly, I cannot perfectly phase/split reads to two haplotype groups due to the alignment bias. Now, I am thinking about directly assembling a diploid assembly using canu without collapsing haplotypes, however, I am kind of confused by the batoption.

I saw from issue #1715, you recommended -trim-assemble batOptions="-eg 0.12 -sb 0.01 -dg 12 -db 12 -dr 6 -ca 1000 -cp 10" "correctedErrorRate=0.12" -pacbio-hifi , I am wondering is this suitable for my case? My raw ONT long reads have mean read accuracy around Q10 (error rate is around 10%) , and all reads have accuracy > Q7 (error rate 19%).

My questions are:

  1. is it feasible to use canu to assemble both haplotypes with my current data considering the error rate and coverage?
  2. if it is possible, do I need to adjust the parameters you recommended in the batOptions above? Which one should I play with?

Could you please give a little hint? Thank you very much.

Best wishes,
Yutang

@skoren
Copy link
Member

skoren commented Jul 23, 2021

  1. Typically Nanopore data is not high enough accuracy for this unless your genome is very diverged (over 2-3% between the haplotypes). Below this you'll end up mixing haplotypes and would likely have some regions collapsed and some uncollapsed (the more diverged regions).

  2. I wouldn't rely on the Q score to estimate read error, it is likely not that accurate. The error rate Canu uses is also in homopolymer-compressed space so it will be more accurate. You could try to get a better estimate of the error rate estimate by aligning some compressed reads to a collapsed compressed assembly. You'd then set the batOptions and correctedErrorRate to double that estimate. With new guppy versions we've typically seen 5-6% error (and lower with Guppy 5+) so that's where the 12% comes from. If your error rate is much above 6-8% then I would suggest using the standard option with correction.

@Yutang-ETH
Copy link
Author

Thank you very much for your reply.

Our target genome is very diverged (I don't know the exact value between two haplotypes though, it is probably 2-3% or larger).
I also corrected ONT long reads using FMLRC2 with short illumina reads from the same sample. The mean error rate could be less than 10% or even close to 5%, but since I don't have a high-quality reference genome, I cannot estimate the error rate more accurately.
If I try the standard option with correction, then I believe it will take a very long time since my colleague has done that before and he could not make it. So, I am thinking about avoid the error correction step.

Anyway, thank you very much for your input.
Best wishes,
Yutang

@skoren
Copy link
Member

skoren commented Jul 23, 2021

You have a draft assembly though to align to right? I would suggest trying that to get an estimate of error. If you have short-read data you can also estimate the haplotype diversity using GenomeScope. If you want to separate haplotypes, I wouldn't use any type of corrected read. Correction will not preserve haplotype phasing within reads and so you'll end up with mixed haplotype reads which means haplotype separation from the assembly will be impossible.

@skoren skoren closed this as completed Jul 23, 2021
@skoren
Copy link
Member

skoren commented Jul 23, 2021

I'm closing since you have a way forward but feel free to comment on this issue regarding how/if the run worked.

@Yutang-ETH
Copy link
Author

Yes, I have a mosaic assembly produced by flye using the same ONT data. It was polished by two rounds internal polishment of flye and one additional polishment using POLCA with illumina short reads. I think I can use this assembly to estimate the error rate of my ONT long reads.
I also compared the kmer profile between corrected ONT reads and Illumina short reads, I found almost all the kmers present in short reads are also present in the corrected ONT reads, so this is why I believe the error rate of the corrected long reads is low. However, I cannot get the exact error rate from the kmer comparison analysis. Yeah, losing haplotype information after correction is definitely a concern to me, however, from the kmer comparison analysis, it looks like haplotype in the reads were preserved after correction.
Does Canu's error correction step preserve haplotype? If so, I may try the standard option with correction and specify raw-nanopore not hifi and I would also add the batoption. Do you think I can get a more or less diploid assembly this way? If not, we will consider pacbio hi-fi reads no longer ONT long reads.
Thank you very much,
Best wishes,
Yutang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants