gpu grafana panel #3622

johrstrom · 2024-06-21T14:23:49Z

Someone on discourse is asking for gpu panel support in Activejobs grafana integration.
https://discourse.openondemand.org/t/grafana-in-ood-ability-to-embed-other-panels/3575

Given the rise of AI and GPU demand and so on, we should likely support this case, even if many or most jobs don't use GPUs.

The text was updated successfully, but these errors were encountered:

treydock · 2024-06-21T14:42:12Z

I will note that the ability to tie a given GPU to a job is something very specific to OSC and not coming from the exporters like the one NVIDIA provides. We have a job prolog that writes out a static metric in such a way that we can use PromQL to tie a job to the GPUs assigned to that job. The tools from the community and NVIDIA can only monitor the whole node's GPUs.

johrstrom · 2024-06-21T14:52:18Z

Thanks. I've commented on the discourse asking if they can provide the way they do it. Maybe there's some documentation we could add for the same.

johrstrom · 2024-06-21T16:21:06Z

Don't know if you're following the discourse post, but it seems they built an exporter: https://github.com/plazonic/nvidia_gpu_prometheus_exporter/tree/master with some additional prologue and epilogue stuff.

treydock · 2024-06-22T15:29:28Z

Ah custom exporter, we just use the one from NVIDIA: https://github.com/NVIDIA/dcgm-exporter.

Our prolog script:

if [ "x${CUDA_VISIBLE_DEVICES}" != "x" ]; then
  GPU_INFO_PROM=${METRICS_DIR}/slurm_job_gpu_info-${SLURM_JOB_ID}.prom
  cat > $GPU_INFO_PROM.$$ <<EOF
# HELP slurm_job_gpu_info GPU Assigned to a SLURM job
# TYPE slurm_job_gpu_info gauge
EOF

  OIFS=$IFS
  IFS=','
  for gpu in $CUDA_VISIBLE_DEVICES ; do
    echo "slurm_job_gpu_info{jobid=\"${SLURM_JOB_ID}\",gpu=\"${gpu}\"} 1" >> $GPU_INFO_PROM.$$
  done
  IFS=$OIFS

  /bin/mv -f $GPU_INFO_PROM.$$ $GPU_INFO_PROM
fi

exit 0

The metrics are written to a location that is picked up by node exporter.

Epilog:

GPU_INFO_PROM=${METRICS_DIR}/slurm_job_gpu_info-${SLURM_JOB_ID}.prom
rm -f $GPU_INFO_PROM

exit 0

treydock · 2024-06-22T15:33:26Z

PromQL from our dashboards that tie a job to a given GPU:

DCGM_FI_DEV_GPU_UTIL{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}
DCGM_FI_DEV_MEM_COPY_UTIL{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}
DCGM_FI_DEV_FB_USED{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}
max((DCGM_FI_DEV_FB_FREE{cluster="$cluster",host=~"$host"} + DCGM_FI_DEV_FB_USED{cluster="$cluster",host=~"$host"}) * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}) by (cluster)

With how our Grafana integration works, we could integrate this now I think. The GPU panels are already part of the "OnDemand Clusters" dashboard we use for CPU and memory.

treydock · 2024-06-22T15:35:04Z

I think we'd just need some mechanism possibly to only show the GPU panels when the job is a GPU job. The schema for cluster YAML in OnDemand would just need to handle one or two more keys, maybe like:

cpu: 20
memory: 24
gpu-util: <num for panel>
gpu-mem: <num for panel>

johrstrom added the component/activejobs label Jun 21, 2024

osc-bot added this to the Backlog milestone Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu grafana panel #3622

gpu grafana panel #3622

johrstrom commented Jun 21, 2024

treydock commented Jun 21, 2024

johrstrom commented Jun 21, 2024

johrstrom commented Jun 21, 2024

treydock commented Jun 22, 2024

treydock commented Jun 22, 2024

treydock commented Jun 22, 2024

gpu grafana panel #3622

gpu grafana panel #3622

Comments

johrstrom commented Jun 21, 2024

treydock commented Jun 21, 2024

johrstrom commented Jun 21, 2024

johrstrom commented Jun 21, 2024

treydock commented Jun 22, 2024

treydock commented Jun 22, 2024

treydock commented Jun 22, 2024