Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault, caught bus error, non-existent physical address #188

Open
ahwanpandey opened this issue May 13, 2024 · 6 comments
Open

segfault, caught bus error, non-existent physical address #188

ahwanpandey opened this issue May 13, 2024 · 6 comments

Comments

@ahwanpandey
Copy link

Hello,

Thanks for this tool.

I submitted some "run_numbat" jobs to to our cluster. It seems to have output all the results files and also seems the plots and data files are all there.

image

But the std err of the job output has a bunch of errors. The job State says "OUT_OF_MEMORY", but with an exit code of 0 meaning it was successful. Also the Memory utilised is 470.93 Gb.

Job ID: 19079833
Cluster: rosalind
User/Group: [email protected]/apandey
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 1-11:19:41
CPU Efficiency: 53.46% of 2-18:04:48 core-walltime
Job Wall-clock time: 04:07:48
Memory Utilized: 470.93 GB
Memory Efficiency: 470.93% of 100.00 GB

I've attached the log and the std err as follows

log.txt
Numbat.AOCS_055_2_0.Step2_run_numbat.19079833.papr-res-compute01.err.txt

Here is my R sessionInfo()

> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /usr/lib64/libblas.so.3.4.2
LAPACK: /usr/lib64/liblapack.so.3.4.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4        data.table_1.15.4  sp_2.1-3           SeuratObject_4.1.0 Seurat_4.1.1       numbat_1.4.0       Matrix_1.5-4      

loaded via a namespace (and not attached):
  [1] Rtsne_0.17            colorspace_2.1-0      ggtree_3.4.0          deldir_2.0-4          scistreer_1.2.0       ggridges_0.5.6        fs_1.6.4              aplot_0.2.2          
  [9] spatstat.data_3.0-4   rstudioapi_0.16.0     leiden_0.4.3.1        listenv_0.9.1         farver_2.1.1          graphlayouts_1.1.1    ggrepel_0.9.5         fansi_1.0.6          
 [17] hahmmr_1.0.0          codetools_0.2-20      splines_4.2.0         cachem_1.0.8          polyclip_1.10-6       jsonlite_1.8.8        RhpcBLASctl_0.23-42   ica_1.0-3            
 [25] cluster_2.1.6         png_0.1-8             rgeos_0.5-9           uwot_0.2.2            spatstat.sparse_3.0-3 sctransform_0.4.1     ggforce_0.4.2         shiny_1.8.1.1        
 [33] compiler_4.2.0        httr_1.4.7            fastmap_1.1.1         lazyeval_0.2.2        cli_3.6.2             later_1.3.2           tweenr_2.0.3          htmltools_0.5.8.1    
 [41] tools_4.2.0           igraph_2.0.3          gtable_0.3.5          glue_1.7.0            reshape2_1.4.4        RANN_2.6.1            fastmatch_1.1-4       Rcpp_1.0.12          
 [49] scattermore_1.2       vctrs_0.6.5           ape_5.8               nlme_3.1-164          progressr_0.14.0      ggraph_2.2.1          lmtest_0.9-40         spatstat.random_3.2-3
 [57] stringr_1.5.1         globals_0.16.3        mime_0.12             miniUI_0.1.1.1        lifecycle_1.0.4       irlba_2.3.5.1         phangorn_2.11.1       goftest_1.2-3        
 [65] future_1.33.2         MASS_7.3-57           zoo_1.8-12            scales_1.3.0          tidygraph_1.3.1       spatstat.core_2.4-4   spatstat.utils_3.0-4  promises_1.3.0       
 [73] parallel_4.2.0        RColorBrewer_1.1-3    pbapply_1.7-2         memoise_2.0.1         reticulate_1.36.1     gridExtra_2.3         ggplot2_3.5.1         ggfun_0.1.4          
 [81] yulab.utils_0.1.4     rpart_4.1.23          stringi_1.8.3         tidytree_0.4.6        rlang_1.1.3           pkgconfig_2.0.3       matrixStats_1.3.0     parallelDist_0.2.6   
 [89] lattice_0.22-6        tensor_1.5            ROCR_1.0-11           purrr_1.0.2           htmlwidgets_1.6.4     treeio_1.20.0         patchwork_1.2.0       cowplot_1.1.3        
 [97] tidyselect_1.2.1      parallelly_1.37.1     RcppAnnoy_0.0.22      plyr_1.8.9            logger_0.3.0          magrittr_2.0.3        R6_2.5.1              generics_0.1.3       
[105] DBI_1.2.2             mgcv_1.9-1            pillar_1.9.0          withr_3.0.0           fitdistrplus_1.1-11   abind_1.4-5           survival_3.6-4        tibble_3.2.1         
[113] future.apply_1.11.2   KernSmooth_2.23-22    utf8_1.2.4            spatstat.geom_3.2-9   plotly_4.10.4         viridis_0.6.5         grid_4.2.0            digest_0.6.35        
[121] xtable_1.8-4          tidyr_1.3.1           httpuv_1.6.15         gridGraphics_0.5-1    RcppParallel_5.1.7    munsell_0.5.1         viridisLite_0.4.2     ggplotify_0.1.2      
[129] quadprog_1.5-8       

This happens with all the samples I have run so far (about 20). I am just attaching the output of one sample as a reference. The samples have anywhere from 6000 - 22000 cells. For example here is another sample's std err and log:

log.txt
Numbat.AOCS_060_2_9.Step2_run_numbat.19079835.papr-res-compute02.err.txt

Not sure if all of this is normal behaviour of the tool or something is wrong?

Thanks so much,
Ahwan

@teng-gao
Copy link
Collaborator

Hmm. The error message below is suspicious. I would guess it's a problem related to general memory management on your jobs/cluster.

slurmstepd: error: Detected 167 oom_kill events in StepId=19079835.batch. Some of the step tasks have been OOM Killed.

@ahwanpandey
Copy link
Author

Hi @teng-gao . Thanks for the reply. I will try to run one sample with just 1 thread/core and see what that looks like. Is there anything in particular you think I could ask the cluster folks re: their memory management? I believe Numbat is the only software/tool I have used that I have seen this type of error in our cluster.

Thanks,
Ahwan

@ahwanpandey
Copy link
Author

And I mean sure I've had segfaults and out of memory issues which have been fixed by providing more memory, but this seems different. And also the memory utilised in the job status is way too high for all the jobs:

Job ID: 19079833
Cluster: rosalind
User/Group: [email protected]/apandey
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 1-11:19:41
CPU Efficiency: 53.46% of 2-18:04:48 core-walltime
Job Wall-clock time: 04:07:48
Memory Utilized: 470.93 GB
Memory Efficiency: 470.93% of 100.00 GB_

Below are some jobs and their States and Memory utilised. Strangely, one of them says State: COMPLETED and the std err has no errors like I've mentioned above, even though it is using a lot more memory than I have asked for.

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 692.82 GB
Memory Efficiency: 692.82% of 100.00 GB

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 724.67 GB
Memory Efficiency: 724.67% of 100.00 GB

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 433.26 GB
Memory Efficiency: 433.26% of 100.00 GB

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 488.84 GB
Memory Efficiency: 488.84% of 100.00 GB

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 594.87 GB
Memory Efficiency: 594.87% of 100.00 GB

State: COMPLETED (exit code 0)
Memory Utilized: 301.44 GB
Memory Efficiency: 301.44% of 100.00 GB

State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 561.10 GB
Memory Efficiency: 561.10% of 100.00 GB

@ahwanpandey
Copy link
Author

ahwanpandey commented May 16, 2024

OK running with just one thread has no issues. Note that I am just using the default "run_numbat" with 'ref_hca" as reference but will be trying with a custom reference as well. The results are vastly different which probably makes sense as a lot of the threads were internally killed by slurm.

Do you think Numbat could benefit from have some sort form error handling for these multi threaded memory issues and not look like it completed? I was testing initially in an interactive session and didn't realise that this was happening in the background. There was no hint in the R terminal about memory issues and threads being killed. The output folder just looks like everything completed without issues, until I submitted the script as a job and checked the std err. Not saying this might be happening to others, but maybe a possibility some other users might have this happening without their knowledge? Hence just suggesting Numbat to notify the user or error out.

But again I might be totally wrong and this could just be a very specific issue with the cluster I am using! I'll talk to the cluster admins about this but would love to hear if you have any specific thoughts on what they could look at as a start.

single thread stats and results/logs/err for a sample

Job ID: 19083767
Cluster: rosalind
User/Group: [email protected]/apandey
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 14:47:00
CPU Efficiency: 99.99% of 14:47:06 core-walltime
Job Wall-clock time: 14:47:06
Memory Utilized: 32.64 GB
Memory Efficiency: 32.64% of 100.00 GB

bulk_clones_final.png
bulk_clones_final
log.txt
Numbat.AOCS_080_2_2.Step2_run_numbat.19083767.papr-res-compute215.err.txt

multi-thread stats and results/logs/err for the same sample

Job ID: 19079838
Cluster: rosalind
User/Group: [email protected]/apandey
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 11:45:45
CPU Efficiency: 43.14% of 1-03:16:00 core-walltime
Job Wall-clock time: 01:42:15
Memory Utilized: 366.78 GB
Memory Efficiency: 366.78% of 100.00 GB

bulk_clones_final.png
bulk_clones_final
log.txt
Numbat.AOCS_080_2_2.Step2_run_numbat.19079838.papr-res-compute215.err.txt

@ahwanpandey
Copy link
Author

I had a chat with our cluster admin and just wanted to share some thoughts with you.

Seems Numbat has the following memory assumptions:

  • A sample completes successfully with max memory usage of 32Gb when run with 1 thread
  • The same sample now needs roughly 32Gb*16 = 512Gb when run with 16 threads

Am I understanding this right?

It seems to be a similar thing to the one being described below:
MonashBioinformaticsPlatform/RNAsik-pipe#39

So if the above is correct, do you think Numbat should break or stop the run if any thread fails, and then exit with an overall error exit code? Like a consensus exit code; if all threads succeeded then 0, otherwise non-zero? And also have some sort of indication that an error occurred during the run in the log file?

Thanks!
Ahwan

@ahwanpandey
Copy link
Author

OK so running with 4 threads and allocating 160Gb let me run the Numbat jobs successfully. I checked the std err for each and no memory issues. Also, the SLURM config in our cluster is setup such that it allows a job to go a little bit over mem, depending on the request/usage of other jobs in the node. Using 16 threads goes way over and starts killing threads as mentioned in the original issue.

State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 210.72 GB
Memory Efficiency: 131.70% of 160.00 GB

State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 203.70 GB
Memory Efficiency: 127.31% of 160.00 GB

State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 170.53 GB
Memory Efficiency: 106.58% of 160.00 GB

State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 181.04 GB
Memory Efficiency: 113.15% of 160.00 GB

State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 159.93 GB
Memory Efficiency: 99.96% of 160.00 GB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants