-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault, caught bus error, non-existent physical address #188
Comments
Hmm. The error message below is suspicious. I would guess it's a problem related to general memory management on your jobs/cluster.
|
Hi @teng-gao . Thanks for the reply. I will try to run one sample with just 1 thread/core and see what that looks like. Is there anything in particular you think I could ask the cluster folks re: their memory management? I believe Numbat is the only software/tool I have used that I have seen this type of error in our cluster. Thanks, |
And I mean sure I've had segfaults and out of memory issues which have been fixed by providing more memory, but this seems different. And also the memory utilised in the job status is way too high for all the jobs: Job ID: 19079833 Below are some jobs and their States and Memory utilised. Strangely, one of them says State: COMPLETED and the std err has no errors like I've mentioned above, even though it is using a lot more memory than I have asked for. State: OUT_OF_MEMORY (exit code 0) State: OUT_OF_MEMORY (exit code 0) State: OUT_OF_MEMORY (exit code 0) State: OUT_OF_MEMORY (exit code 0) State: OUT_OF_MEMORY (exit code 0) State: COMPLETED (exit code 0) State: OUT_OF_MEMORY (exit code 0) |
OK running with just one thread has no issues. Note that I am just using the default "run_numbat" with 'ref_hca" as reference but will be trying with a custom reference as well. The results are vastly different which probably makes sense as a lot of the threads were internally killed by slurm. Do you think Numbat could benefit from have some sort form error handling for these multi threaded memory issues and not look like it completed? I was testing initially in an interactive session and didn't realise that this was happening in the background. There was no hint in the R terminal about memory issues and threads being killed. The output folder just looks like everything completed without issues, until I submitted the script as a job and checked the std err. Not saying this might be happening to others, but maybe a possibility some other users might have this happening without their knowledge? Hence just suggesting Numbat to notify the user or error out. But again I might be totally wrong and this could just be a very specific issue with the cluster I am using! I'll talk to the cluster admins about this but would love to hear if you have any specific thoughts on what they could look at as a start. single thread stats and results/logs/err for a sample
bulk_clones_final.png multi-thread stats and results/logs/err for the same sample
bulk_clones_final.png |
I had a chat with our cluster admin and just wanted to share some thoughts with you. Seems Numbat has the following memory assumptions:
Am I understanding this right? It seems to be a similar thing to the one being described below: So if the above is correct, do you think Numbat should break or stop the run if any thread fails, and then exit with an overall error exit code? Like a consensus exit code; if all threads succeeded then 0, otherwise non-zero? And also have some sort of indication that an error occurred during the run in the log file? Thanks! |
OK so running with 4 threads and allocating 160Gb let me run the Numbat jobs successfully. I checked the std err for each and no memory issues. Also, the SLURM config in our cluster is setup such that it allows a job to go a little bit over mem, depending on the request/usage of other jobs in the node. Using 16 threads goes way over and starts killing threads as mentioned in the original issue.
|
Hello,
Thanks for this tool.
I submitted some "run_numbat" jobs to to our cluster. It seems to have output all the results files and also seems the plots and data files are all there.
But the std err of the job output has a bunch of errors. The job State says "OUT_OF_MEMORY", but with an exit code of 0 meaning it was successful. Also the Memory utilised is 470.93 Gb.
I've attached the log and the std err as follows
log.txt
Numbat.AOCS_055_2_0.Step2_run_numbat.19079833.papr-res-compute01.err.txt
Here is my R sessionInfo()
This happens with all the samples I have run so far (about 20). I am just attaching the output of one sample as a reference. The samples have anywhere from 6000 - 22000 cells. For example here is another sample's std err and log:
log.txt
Numbat.AOCS_060_2_9.Step2_run_numbat.19079835.papr-res-compute02.err.txt
Not sure if all of this is normal behaviour of the tool or something is wrong?
Thanks so much,
Ahwan
The text was updated successfully, but these errors were encountered: