-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nPhase on one chromosome #12
Comments
Hi Elsa, For my paper I did just subset the data. In order for you to also run nPhase on one chromosome at a time you can simply separate your reference genome into one fasta file per chromosome, then run one instance of nPhase for each chromosome by changing the reference sequence being used. A quick way to separate your genome into one fasta file per chromosome can be to use pyfaidx:
Source: https://www.biostars.org/p/173723/ That said, there are a few details to consider, for example if you try to map your full dataset to chromosome 1, then some reads which would have better mapped to another chromosome might still map to it, which would lead to some inaccuracies. A more cumbersome approach would be to first map and variant call all of your reads, then manually subset them and use them as input to nPhase partial, again with a one fasta file per chromosome scheme. nPhase doesn't have any associated tools that would really help achieve that though. I also manually subset the VCF file to reduce the heterzogysity level to 1%, which you may find useful to do as well, and I think that for a tetraploid, any more than 60-80X coverage would be excessive, so feel free to downsample if you have a crazy amount of data. There's a lot that can be done to improve how fast nPhase runs, you can check the status of the following issues in the near future to see if any significant updates have been made to help with genomes of that size. Keep me posted if there's anything else I can help with, hope you'll get good results! Best, |
Hi Omar,
|
Hi Elsa, nPhase partial requires a sorted sam file, not a bam file, since my code reads the individual lines and I didn't use a library that can parse the binary format of bam files. Normally when you run the full nPhase pipeline it generates the correct long read sam file which you can reuse for nphase partial. |
Hi,
I am working on tetraploid potato and trying to run nPhase on my dataset, but as you mention yourself it is very time consuming and I had to terminate the process after running it for 8 days. I was wondering if there is any way to parallelize the process, for example to run it one chromosome at a time like you present in your paper - though I haven't figured out how to do so. Is there an option in the algorithm or have you just subset the dataset before running the algorithm?
Best regards,
Elsa
The text was updated successfully, but these errors were encountered: