-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fragemented and large assembly #2343
Comments
The larger size is expected, it's likely both haplotypes of a diploid genome (see https://canu.readthedocs.io/en/latest/faq.html#my-genome-size-and-assembly-size-are-different-help). You can see that about 500mb are already flagged as bubbles (alt haplotype). The rest likely is too diverged to be automatically flagged so you'd need to rely on a tool like purge_dups. As for the fragmentation, the coverage looks really low from the k-mer histogram. The primary peak is between 6-10x which is too low for a good assembly, what coverage were you inputting? Is this a clonal sample or a collection of individuals? |
Thanks for a prompt reply, Sergey This genome has puzzled me quite a bit. Total input hifi data is ~60X (assuming ~1.2 G genome size, which could be around 2G) genomescope profile of the same organism with the short read data is here
Note this is a Cladocopium app where the polidy and duplication levels are not clear. Any thoughts on how to proceed would be very useful to me. |
The genomescope results imply a larger genome than 1.2 Gbp but also that the haplotypes are extremely similar (if it is diploid) as there are very few single-copy k-mers. You'd probably benefit from a larger k-mer size like k=31 instead of 19 for genomescope. The HiFi assembly implies an even larger genome size, the coverage is somewhere around 8x given 50x * 1.2gb or 7gb which would imply a 3.5gb if diploid genome. HiFi assembly is going to be very sensitive to variation though so it makes me wonder if the inputs for the Illumina and HiFi data are the same? Is it possible the Illumina sample is more clonal than the sample for HiFi? Either way, I'd increase either the genome size or the maxInputCoverage since right now it's only use 50x * 1.2 gb so you have more data that was not used in the assembly. After that, your best option is probably to rely on core genes/purge_dups to determine if there is haplotype duplication in the assembly or not. You could also try verkko and look at the resulting assembly graphs to see if there is diploid structure (though it would likely be less continuous as it only produces phased outputs while canu can produce a pseudo-haplotype). |
I've found issues with Canu when trying to carry out population genome sequencing; there's just too much population variation to construct long consensus contigs. |
Could you please tell your interpretation of this log file for a algae assembly attept and how to improve assembly contiguity for this highly heterogygous algal genome?
It is canu 2.2.
canu -assemble -p algae -d ./ genomeSize=1.2g -pacbio-hifi ../01_Data/hifi_decontamianted.fq useGrid=true gridOptions="--time=02-00:00:00 "
The assembly stat is below for the reference. Note that the assembly size is quite large as the expected genome size is around 1.2G.
Thanks a lot in advance.
The text was updated successfully, but these errors were encountered: