-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu grafana panel #3622
Comments
I will note that the ability to tie a given GPU to a job is something very specific to OSC and not coming from the exporters like the one NVIDIA provides. We have a job prolog that writes out a static metric in such a way that we can use PromQL to tie a job to the GPUs assigned to that job. The tools from the community and NVIDIA can only monitor the whole node's GPUs. |
Thanks. I've commented on the discourse asking if they can provide the way they do it. Maybe there's some documentation we could add for the same. |
Don't know if you're following the discourse post, but it seems they built an exporter: https://github.com/plazonic/nvidia_gpu_prometheus_exporter/tree/master with some additional prologue and epilogue stuff. |
Ah custom exporter, we just use the one from NVIDIA: https://github.com/NVIDIA/dcgm-exporter. Our prolog script:
The metrics are written to a location that is picked up by node exporter. Epilog:
|
PromQL from our dashboards that tie a job to a given GPU:
With how our Grafana integration works, we could integrate this now I think. The GPU panels are already part of the "OnDemand Clusters" dashboard we use for CPU and memory. |
I think we'd just need some mechanism possibly to only show the GPU panels when the job is a GPU job. The schema for cluster YAML in OnDemand would just need to handle one or two more keys, maybe like:
|
Someone on discourse is asking for
gpu
panel support in Activejobs grafana integration.https://discourse.openondemand.org/t/grafana-in-ood-ability-to-embed-other-panels/3575
Given the rise of AI and GPU demand and so on, we should likely support this case, even if many or most jobs don't use GPUs.
The text was updated successfully, but these errors were encountered: