-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nphase takes a long time at GATK step #26
Comments
That's odd, maybe the --threads argument isn't being passed properly? I'm not sure what's causing this. The quickest solution, I think, is you can avoid running GATK through nPhase by providing a VCF file directly (which you will have generated previously by running GATK) with the following command:
It should work with any VCF generated by GATK, but filter out any INDELs first, nPhase only tries to phase SNPs |
Sorry for the delay in my response. I was investigating whether I could use the precomputed VCF files. However, I noticed some differences in the workflow which may make recomputing necessary or may even explain the difference in run-time. Firstly, I ran MarkDuplicates on the aligned bam file:
The question is: is it acceptable or required to run MarkDuplicates for nPhase? I didn't find that step in the pipeline code. Therefore I called gatk HaplotypeCaller:
I thought that using MarkDuplicates first may make the step much faster. |
nPhase only uses the short reads into to obtain a VCF with high-confidence variant positions, MarkDuplicates (and any filtering steps you wish to perform) is fine. I think they would need to be converted first, I'm not sure. I'd have to take a look at the differences between output formats and check if my code would be affected, it may be more easier to simply try it, it would result in an error if anything's out of place. |
After some more attempts to debug this, I conclude that this is in part a technical problem of GATK and the /tmp directory setup on that specific machine (an issue with GATK registered here) The first problem is that the /tmp directory on that server is mounted noexec.
This can be averted by passing Java options on the command-line. When running GATK through the nPhase pipeline, I set The second part is that the number of threads used is set via |
I am now running nPhase on my real data and noticed it takes a very long time to finish. My data is yeast ONT reads with ~120X coverage and Illumina reads with ~100X coverage.
It seems nPhase is stuck in the GATK HaplotypeCaller step of the short reads, because this is the only process that is running (at only ~100% CPU):
This is the full command line I am using to run it:
The machine has 144 threads and 2TB RAM, so it should be fine, also I remember that variant calling based on the short reads didn't take that long when I ran GATK directly. Possibly, one could set some performance options for java/gatk to inscrease the
speed?
The text was updated successfully, but these errors were encountered: