Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurm scheduler not working with slurm-wlm-torque qstat #280

Open
nemartins opened this issue Jan 3, 2023 · 5 comments
Open

slurm scheduler not working with slurm-wlm-torque qstat #280

nemartins opened this issue Jan 3, 2023 · 5 comments

Comments

@nemartins
Copy link

I'm trying to run bpipe on a slurm cluster.
This cluster does not have qstat installed, so the pipeline never progresses to the next step.
I've tried to use qstat from slurm-wlm-torque package, but there's no xml output option.

Is it possible to create a SlurmStatusMonitor that alleviates the dependency for the qstat xml output or that uses the native slurm tools (sstat, scontrol)?

Thanks in advance,

@ssadedin
Copy link
Owner

ssadedin commented Jan 5, 2023

Interesting that you are the first to run into this (or at least the first to report it). I guess it must be unusual for SLURM clusters to not have qstat installed.

The implementation that does not require the XML output option is still supported, so I think we should be able to convince the SLURM executor to use it. As you say, it'd be nice if we can make something more efficient but as a fallback it should work.

I don't have a good test method for SLURM right now - I will see if I can set up something that uses AWS parallelcluster so I can get this working properly.

Sorry for the problem and will look into what to do.

@ssadedin
Copy link
Owner

ssadedin commented Jan 7, 2023

@nemartins I have just put in a commit that I think should fix the SLURM issue at least in a basic sense. I was able to create a test cluster on AWS and confirmed that it seemed to work. If you are able to build from master and try it out then that would be great. Otherwise, let me know I can provide you with a build or you can test it with the next release. Thanks for reporting this issue!

@nemartins
Copy link
Author

Thank you for looking into this!
In the meantime I was able to hack together a small script to pipe the output to the xml format that the executor expects and it worked ok.

I will try to build it from master and run it on the cluster early next week.

Thanks again

@ssadedin
Copy link
Owner

ssadedin commented Jan 8, 2023

that's a very clever way to try and solve it - would be interesting to see it if you are interested to share. It could allow us to used the the pooled status monitor with slurm which would be a better solution (current solution will cause an individual job status command to be issued for every active job every minute or so - not very scalable, which was why the pooled status monitor which queries multiple jobs at a time was introduced.

Let me know how it goes!

@nemartins
Copy link
Author

I've ran bpipe from master, and it works well. Thank you for the quick solution!

Here's the script I've come up with. It could probably be way simpler/elegant, but it was a very rush job, and the first time I've used jq/yq

export params="${@:2:1000}"
qsub $params |\
     sed -e 's|Job id|JobID|g' -e 's|Time Use|TimeUse|g' |\
     csvtk space2tab --comment-char '-' |\
     csvtk csv2json -t |\
     jq -c .[] |\
     jq -n 'reduce inputs as $line ({};. + { ("DataZ"+$line.JobID) : { "Job": {"Job_Id": ($line.JobID),"job_state": ($line.S)}} })' |\
     yq -o xml |\
     sed -e 's|DataZ.*>|Data>|g' |\
     tr -d "\n" | tr -d "\ " |\
     awk -v RS='</Data>' -v ORS='</Data>\n' ' {print}'

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants