nphase takes a long time at GATK step #26

mdondrup · 2024-01-12T13:31:27Z

I am now running nPhase on my real data and noticed it takes a very long time to finish. My data is yeast ONT reads with ~120X coverage and Illumina reads with ~100X coverage.

It seems nPhase is stuck in the GATK HaplotypeCaller step of the short reads, because this is the only process that is running (at only ~100% CPU):

michaeld  40243  40242 98 Jan02 ?        9-21:53:35 java -Dsamjdk.use_async_io_read_samtools=false 
-Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 
-jar /Home/ii/michaeld/miniconda2/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar HaplotypeCaller 
-R ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna -ploidy 2 -I ./nphase_out/Kveik_sample_6/Mapped/shortReads/Kveik_sample_6.final.bam -O ./nphase_out/Kveik_sample_6/VariantCalls/shortReads/Kveik_sample_6.vcf

This is the full command line I am using to run it:

nohup nice -2 nphase pipeline --sampleName Kveik_sample_6 --longReads sample6_cleaned_trimmed.fastq.gz 
--longReadPlatform  ont --R1 ../../Illumina_seq/Fastq/6-1_S8_R1_001.fastq.gz --R2 ../../Illumina_seq/Fastq/6-1_S8_R2_001.fastq.gz  \
--reference  ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna --output ./nphase_out --threads 120 >nphase_nohup.out &

The machine has 144 threads and 2TB RAM, so it should be fine, also I remember that variant calling based on the short reads didn't take that long when I ran GATK directly. Possibly, one could set some performance options for java/gatk to inscrease the
speed?

The text was updated successfully, but these errors were encountered:

OmarOakheart · 2024-01-12T17:09:06Z

That's odd, maybe the --threads argument isn't being passed properly? I'm not sure what's causing this.

The quickest solution, I think, is you can avoid running GATK through nPhase by providing a VCF file directly (which you will have generated previously by running GATK) with the following command:

nphase partial --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE --vcf VCF_FILE --longReadPlatform {ont,pacbio} --threads 8

It should work with any VCF generated by GATK, but filter out any INDELs first, nPhase only tries to phase SNPs

mdondrup · 2024-01-18T13:54:59Z

Sorry for the delay in my response. I was investigating whether I could use the precomputed VCF files. However, I noticed some differences in the workflow which may make recomputing necessary or may even explain the difference in run-time.

Firstly, I ran MarkDuplicates on the aligned bam file:

gatk --java-options "-Xmx200G -Djava.io.tmpdir=$TMPDIR" MarkDuplicatesSpark --spark-master local[60] -R $REF -I $BAM -O bam/$BASE-dedup.bam

The question is: is it acceptable or required to run MarkDuplicates for nPhase? I didn't find that step in the pipeline code.
Secondly, I generated files in gVCF format, but these might be possible to convert.

Therefore I called gatk HaplotypeCaller:

gatk --java-options "-XX:ConcGCThreads=10 -XX:ParallelGCThreads=10 -Xmx500G -Djava.io.tmpdir=$TMPDIR" \ HaplotypeCaller --ploidy $P --native-pair-hmm-threads 60 -R $REF \ -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation \ --tmp-dir $TMPDIR \ -I bam/$BASE-dedup.bam -O vcf/$BASE.Ploidy${P}.g.vcf -ERC GVCF

I thought that using MarkDuplicates first may make the step much faster.

OmarOakheart · 2024-01-18T14:35:34Z

nPhase only uses the short reads into to obtain a VCF with high-confidence variant positions, MarkDuplicates (and any filtering steps you wish to perform) is fine. I think they would need to be converted first, I'm not sure. I'd have to take a look at the differences between output formats and check if my code would be affected, it may be more easier to simply try it, it would result in an error if anything's out of place.

mdondrup · 2024-01-23T09:55:59Z

After some more attempts to debug this, I conclude that this is in part a technical problem of GATK and the /tmp directory setup on that specific machine (an issue with GATK registered here) The first problem is that the /tmp directory on that server is mounted noexec.
If I run HaplotypeCaller with the default settings, I get the following error messages in the log:

15:18:46.056 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/
Home/ii/michaeld/miniconda2/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-pac
kage-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
15:18:46.058 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/tmp/libgkl_utils9418239050694741169.so: /tmp/libgkl_utils9
418239050694741169.so: failed to map segment from shared object: Operation not permitted)
15:18:46.058 WARN  IntelPairHmm - Intel GKL Utils not loaded
15:18:46.058 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported

This can be averted by passing Java options on the command-line. When running GATK through the nPhase pipeline, I set export _JAVA_OPTIONS=-Djava.io.tmpdir=/export/scratch/tmp/ before running nPhase. This is also described in the GATK manual. I have seen a few reports where AVX-support could not be found even though the processor supports it.

The second part is that the number of threads used is set via --native-pair-hmm-threads and the default seems to be 4. So I set --native-pair-hmm-threads 60 and it again significantly speeds up. Possibly, the threads parameter could be passed down to GATK this way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nphase takes a long time at GATK step #26

nphase takes a long time at GATK step #26

mdondrup commented Jan 12, 2024 •

edited

Loading

OmarOakheart commented Jan 12, 2024

mdondrup commented Jan 18, 2024

OmarOakheart commented Jan 18, 2024

mdondrup commented Jan 23, 2024 •

edited

Loading

nphase takes a long time at GATK step #26

nphase takes a long time at GATK step #26

Comments

mdondrup commented Jan 12, 2024 • edited Loading

OmarOakheart commented Jan 12, 2024

mdondrup commented Jan 18, 2024

OmarOakheart commented Jan 18, 2024

mdondrup commented Jan 23, 2024 • edited Loading

mdondrup commented Jan 12, 2024 •

edited

Loading

mdondrup commented Jan 23, 2024 •

edited

Loading