Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu grafana panel #3622

Open
johrstrom opened this issue Jun 21, 2024 · 6 comments
Open

gpu grafana panel #3622

johrstrom opened this issue Jun 21, 2024 · 6 comments

Comments

@johrstrom
Copy link
Contributor

Someone on discourse is asking for gpu panel support in Activejobs grafana integration.
https://discourse.openondemand.org/t/grafana-in-ood-ability-to-embed-other-panels/3575

Given the rise of AI and GPU demand and so on, we should likely support this case, even if many or most jobs don't use GPUs.

@osc-bot osc-bot added this to the Backlog milestone Jun 21, 2024
@treydock
Copy link
Contributor

I will note that the ability to tie a given GPU to a job is something very specific to OSC and not coming from the exporters like the one NVIDIA provides. We have a job prolog that writes out a static metric in such a way that we can use PromQL to tie a job to the GPUs assigned to that job. The tools from the community and NVIDIA can only monitor the whole node's GPUs.

@johrstrom
Copy link
Contributor Author

Thanks. I've commented on the discourse asking if they can provide the way they do it. Maybe there's some documentation we could add for the same.

@johrstrom
Copy link
Contributor Author

Don't know if you're following the discourse post, but it seems they built an exporter: https://github.com/plazonic/nvidia_gpu_prometheus_exporter/tree/master with some additional prologue and epilogue stuff.

@treydock
Copy link
Contributor

Ah custom exporter, we just use the one from NVIDIA: https://github.com/NVIDIA/dcgm-exporter.

Our prolog script:

if [ "x${CUDA_VISIBLE_DEVICES}" != "x" ]; then
  GPU_INFO_PROM=${METRICS_DIR}/slurm_job_gpu_info-${SLURM_JOB_ID}.prom
  cat > $GPU_INFO_PROM.$$ <<EOF
# HELP slurm_job_gpu_info GPU Assigned to a SLURM job
# TYPE slurm_job_gpu_info gauge
EOF

  OIFS=$IFS
  IFS=','
  for gpu in $CUDA_VISIBLE_DEVICES ; do
    echo "slurm_job_gpu_info{jobid=\"${SLURM_JOB_ID}\",gpu=\"${gpu}\"} 1" >> $GPU_INFO_PROM.$$
  done
  IFS=$OIFS

  /bin/mv -f $GPU_INFO_PROM.$$ $GPU_INFO_PROM
fi

exit 0

The metrics are written to a location that is picked up by node exporter.

Epilog:

GPU_INFO_PROM=${METRICS_DIR}/slurm_job_gpu_info-${SLURM_JOB_ID}.prom
rm -f $GPU_INFO_PROM

exit 0

@treydock
Copy link
Contributor

PromQL from our dashboards that tie a job to a given GPU:

DCGM_FI_DEV_GPU_UTIL{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}
DCGM_FI_DEV_MEM_COPY_UTIL{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}
DCGM_FI_DEV_FB_USED{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}
max((DCGM_FI_DEV_FB_FREE{cluster="$cluster",host=~"$host"} + DCGM_FI_DEV_FB_USED{cluster="$cluster",host=~"$host"}) * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}) by (cluster)

With how our Grafana integration works, we could integrate this now I think. The GPU panels are already part of the "OnDemand Clusters" dashboard we use for CPU and memory.

Screenshot 2024-06-22 at 11 32 49 AM

@treydock
Copy link
Contributor

I think we'd just need some mechanism possibly to only show the GPU panels when the job is a GPU job. The schema for cluster YAML in OnDemand would just need to handle one or two more keys, maybe like:

cpu: 20
memory: 24
gpu-util: <num for panel>
gpu-mem: <num for panel>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants