Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any advice on how to create a haplotype-resolved assembly form the phase FastQ reads #27

Open
mdondrup opened this issue Jan 23, 2024 · 4 comments

Comments

@mdondrup
Copy link

I now have phased nanopore reads from nPhase. Thank you very much for the tool and your help so far.

Now, the task is to generate a first haplotype-resolved assembly. The phased reads are distributed into 191 clusters of different sizes. Having a tetraploid organism with 16 chromosomes that is approximately 3 clusters per expected haplotype which looks like a good number to me. However, the clusters are of very different size (from 1 to several thousands of reads) and total length (from 87bp to 57Mbp!). So I thought to: 1. filter out any cluster < 10 kb and any singleton clusters 2: sort the clusters by chromosome (as given in the .tsv) 3: assemble each remaining cluster individually with Flye or Necat, in a way similar to what is described in O'Donnell et al. 2023. But I am not sure if I understand the methods used correctly.

Thank you very much for your help.

@OmarOakheart
Copy link
Owner

That is a pretty good contiguity! I think you can proceed a few different ways.

Firstly, there's the option to run nphase cleaning, which you can see how to run here: https://github.com/OmarOakheart/nPhase?tab=readme-ov-file#nphase-cleaning

nphase cleaning --sampleName Cleaning_1 --resultFolder /path/to/outputFolder/Individual_1 --longReads /path/to/Individual_1_longReads.fastq.gz --threads 8

The idea behind nPhase cleaning is to attempt to solve a few common issues with the default output of nPhase. By default, it will attempt to filter out 2% of results, which normally should get rid of the tiny clusters you observe, and it will attempt to redistribute reads from large clusters.

For example, you may observe that chromosome 2 looks like it has three copies, but when you look at the coverage plot you notice that one of the clusters has twice as much coverage as the other two. nphase cleaning tries to take this largest cluster and cut it in half. In theory the two resulting clusters will cover the same information, but this can help with contiguity sometimes.

Otherwise, if you only have one organism for which to create a haplotype-resolved assembly, I would go through the data manually.

You can look at the automatically generated plots or at the raw .tsv files with R or your favorite datavis software and determine which clusters you want to merge together prior to assembly by Flye or Necat (I personally found SMARTdenovo to perform quite well).

For example, if you look at figure 3 in the nPhase paper, each cluster is colored according to its haplotype or origin, and you can see the yellow haplotype is fragmented. But, given the contiguity of the other three clusters, you can reasonably assume they belong together. If you have any situations like this, I would merge the FastQs together.

You should also look at the discordance plots to see if there are any particularly noisy clusters which you may want to re-run nPhase on

The steps you detailed are also sufficient, but I think if you do the above you can reach higher contiguity and be more sure about your data.

Also, 57 Mbp! The reads must be very long, or the heterozygosity very well distributed. One thing to bear in mind is that nPhase doesn't look at long reads which don't overlap heterozygous positions, so if you have any long stretches of your organism's genome which have no variation, they are likely to be reflected by empty spaces in the output.

Please don't hesitate to reach out again for anything else I can assist with

@OmarOakheart
Copy link
Owner

One more thing that comes to mind, if your organism is a tetraploid and you have no reason to suspect any aneuploidy, you might find flopp (https://github.com/bluenote-1577/flopp https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8892958/) to be a more effective tool for your purposes. I haven't extensively tested it but it's incredibly memory and runtime efficient and more contiguous than nPhase (which means more false positives but fewer regions left unpredicted). Its main drawback is that it isn't ploidy agnostic, but if you're sure you have 4 copies of every chromosome you might find it interesting to test

Best,
Omar

@mdondrup
Copy link
Author

Dear Omar,

Thank you very much for all the suggestions. I ran the cleaning step with an extra --fillGaps 1. This reduced the number of clusters to 80 or exactly 5 per chromosome. 6 chromosomes have 4 clusters, 1 has 3 clusters and the rest has 5-7 (I am attaching the cleaned plots).

The largest cluster now contains 56Mb of total sequence (I checked again, the largest cluster before cleaning was actually 71Mb). I am assuming the cause for the large clusters is high coverage in some regions, the raw data is from a single PromethION cell that gave around 140X coverage plus 100X Illumina PE sequences. Heterozygous sites ~ 0.5%/genome. Ploidy had been determined to be 4 by flow cytometry and I don't have any specific reason to doubt these results, however, some complex SVs CNVs and aneuploidy may be expected. I will certainly try out flopp, too, but I might have missed out on discovering aneuploidy. I think we will have to sequence a few more strains to find out whether this observation is real.

I am now wondering if I can merge a few more clusters manually. On NC001141.2 and NC_001147.6 there are clusters that are only marginally overlapping with another cluster that has mostly a gap in the same region. But I am wondering whether there was contradicting evidence by discordant pairs of SNPs at these overlaps and that was the reason for the algorithm keeping them separate.

(Btw: We have tried SMARTdenovo for the collapsed assembly initially but had some difficulties getting a proper assembly out of it. It produced 90k contigs, while Necat produced 20 contigs. Possibly we did something wrong so I will try again)

Kveik_sample_6_coverageVis
Kveik_sample_6_phasedVis
Kveik_sample_6_discordanceVis

@OmarOakheart
Copy link
Owner

All your discordance plots look pretty good to me, so you definitely haven't reached the point where you've merged too many clusters together yet. I would definitely merge for NC_001147.6, also, if nPhase refused to merge then you may want to check for evidence of a translocation at those boundaries. Sometimes it is due to a too high increase in discordance as you suspect.

NC001141.2 is a little more uncertain to me, the most highly covered cluster in the middle of that chromosome seems to have absorbed reads from the clusters whose coverage drops to nearly zero. It may be the case that there's low heterozygosity in this region of the genome, or a 3:1 situation.

You might want to re-run the analysis with 60~80X long read coverage, aim for a high mean read length, anything above 15kb should be good. If using all 140X means your mean read length is 10kb or lower, you should definitely re-run with nPhase partial and a downsampled long read fastQ such that you increase the mean read length as much as you can

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants