-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract phased variants or reads #23
Comments
I'm glad you find nPhase useful! There's no function to turn nPhase results into a VCF or tag the long reads of a BAM file with a haplotype directly. |
The aim would be to reassemble the phased reads for each scaffold similar to Garg et al. 2021. We have HiFi data that would be used for variant calling and ONT ultra-long reads that can be added for phasing. The phased VCF would be used with WhatsHap-Haplotag, and tagged reads would be split into tagged groups per scaffold and assembled with Hifiasm (Hifi) or Shasta (ONT). |
nPhase doesn't output haplotagged BAM files but it does output a fastQ file for each predicted haplotig, so you could still assemble the long reads this way, but I can see that not outputting a phased VCF does get in the way of you separating and assembling the short reads. As far as I know, whatshap haplotag and split don't handle polyploids, and I'm not aware of another tool that currently does that (though maybe whatshap will soon whatshap/whatshap#408 ) If you really need to split the short reads to assemble them, maybe you can start by assembling the phased long reads, then merging all of the long read scaffolds and mapping the short reads onto them in order to separate them that way and later assemble them? It's a bit clunky but would require the least amount of custom development |
A phased VCF would do the trick! We have an allotetraploid genome with very little variation within sub-genomes so I can cluster the tracks and do the haplotag splitting in two steps. |
Hi Adam, Sorry for the delays in responses, I'm still not sure how you plan to cluster the tracks together so it's a little difficult to anticipate what sort of phased VCF format would work. The main issue is that there isn't a standard for creating a phased VCF for a polyploid. The PS tag that currently exists is an explicitly diploid format, and the alternatives that have been developed by tools like whatshap or flopp use the fact that they start by assigning a ploidy to the sample. For example if you use flopp and tell it that you have a tetraploid and therefore k=4, it will reliably output 4 columns for each variable position. So then you use a script for downstream analysis that works on the flopp output, using the same workflow would fail or give you messy results with an nPhase output because nPhase allows for aneuploidy and this wouldn't have been explicitly taken into account. I did write you a (relatively untested) python function that takes the nPhase output (OUTPUT_FOLDER/Phased/$prefix_variants.tsv) and a VCF file and produces a version of a phased VCF that works with the nPhase output. I'm just not sure if I designed it in a way that's appropriate for your intended purpose. So you'll get a VCF line that looks like this:
The way to interpret this is the HX field tells you the names of the different haplotypes that have a base in this position for this chromosome, the HY field tells you what the base is for each haplotype listed in HX. This means for example that cluster3 is associated with 0, which is the REF allele A, and cluster12 is associated with 2, which is the ALT allele C. I know this format doesn't exist anywhere else, so I imagine this isn't very helpful to you, but maybe this still helps clears something up and leads you in the right direction Here's the python script:
|
Hi Omar,
Thanks for sharing this great tool. I was wondering if there is a way to export a phased vcf or tag long reads with their haplotype group as in Whatshap-haplotag?
The text was updated successfully, but these errors were encountered: