slurm scheduler not working with slurm-wlm-torque qstat #280

nemartins · 2023-01-03T16:49:05Z

I'm trying to run bpipe on a slurm cluster.
This cluster does not have qstat installed, so the pipeline never progresses to the next step.
I've tried to use qstat from slurm-wlm-torque package, but there's no xml output option.

Is it possible to create a SlurmStatusMonitor that alleviates the dependency for the qstat xml output or that uses the native slurm tools (sstat, scontrol)?

Thanks in advance,

ssadedin · 2023-01-05T22:46:08Z

Interesting that you are the first to run into this (or at least the first to report it). I guess it must be unusual for SLURM clusters to not have qstat installed.

The implementation that does not require the XML output option is still supported, so I think we should be able to convince the SLURM executor to use it. As you say, it'd be nice if we can make something more efficient but as a fallback it should work.

I don't have a good test method for SLURM right now - I will see if I can set up something that uses AWS parallelcluster so I can get this working properly.

Sorry for the problem and will look into what to do.

ssadedin · 2023-01-07T11:25:30Z

@nemartins I have just put in a commit that I think should fix the SLURM issue at least in a basic sense. I was able to create a test cluster on AWS and confirmed that it seemed to work. If you are able to build from master and try it out then that would be great. Otherwise, let me know I can provide you with a build or you can test it with the next release. Thanks for reporting this issue!

nemartins · 2023-01-07T21:07:34Z

Thank you for looking into this!
In the meantime I was able to hack together a small script to pipe the output to the xml format that the executor expects and it worked ok.

I will try to build it from master and run it on the cluster early next week.

Thanks again

ssadedin · 2023-01-08T00:11:18Z

that's a very clever way to try and solve it - would be interesting to see it if you are interested to share. It could allow us to used the the pooled status monitor with slurm which would be a better solution (current solution will cause an individual job status command to be issued for every active job every minute or so - not very scalable, which was why the pooled status monitor which queries multiple jobs at a time was introduced.

Let me know how it goes!

nemartins · 2023-01-09T18:49:38Z

I've ran bpipe from master, and it works well. Thank you for the quick solution!

Here's the script I've come up with. It could probably be way simpler/elegant, but it was a very rush job, and the first time I've used jq/yq

export params="${@:2:1000}"
qsub $params |\
     sed -e 's|Job id|JobID|g' -e 's|Time Use|TimeUse|g' |\
     csvtk space2tab --comment-char '-' |\
     csvtk csv2json -t |\
     jq -c .[] |\
     jq -n 'reduce inputs as $line ({};. + { ("DataZ"+$line.JobID) : { "Job": {"Job_Id": ($line.JobID),"job_state": ($line.S)}} })' |\
     yq -o xml |\
     sed -e 's|DataZ.*>|Data>|g' |\
     tr -d "\n" | tr -d "\ " |\
     awk -v RS='</Data>' -v ORS='</Data>\n' ' {print}'

Best

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slurm scheduler not working with slurm-wlm-torque qstat #280

slurm scheduler not working with slurm-wlm-torque qstat #280

nemartins commented Jan 3, 2023

ssadedin commented Jan 5, 2023

ssadedin commented Jan 7, 2023

nemartins commented Jan 7, 2023

ssadedin commented Jan 8, 2023

nemartins commented Jan 9, 2023

slurm scheduler not working with slurm-wlm-torque qstat #280

slurm scheduler not working with slurm-wlm-torque qstat #280

Comments

nemartins commented Jan 3, 2023

ssadedin commented Jan 5, 2023

ssadedin commented Jan 7, 2023

nemartins commented Jan 7, 2023

ssadedin commented Jan 8, 2023

nemartins commented Jan 9, 2023