Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sy/slurm integration #18893

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 19 additions & 10 deletions .codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -314,18 +314,18 @@ coverage:
target: 75
flags:
- kong
KubeVirt_Handler:
KubeVirt_API:
target: 75
flags:
- kubevirt_handler
- kubevirt_api
KubeVirt_Controller:
target: 75
flags:
- kubevirt_controller
KubeVirt_API:
KubeVirt_Handler:
target: 75
flags:
- kubevirt_api
- kubevirt_handler
Kube_DNS:
target: 75
flags:
Expand Down Expand Up @@ -550,6 +550,10 @@ coverage:
target: 75
flags:
- singlestore
Slurm:
target: 75
flags:
- slurm
Snowflake:
target: 75
flags:
Expand Down Expand Up @@ -1175,21 +1179,21 @@ flags:
paths:
- kubernetes_state/datadog_checks/kubernetes_state
- kubernetes_state/tests
kubevirt_handler:
kubevirt_api:
carryforward: true
paths:
- kubevirt_handler/datadog_checks/kubevirt_handler
- kubevirt_handler/tests
- kubevirt_api/datadog_checks/kubevirt_api
- kubevirt_api/tests
kubevirt_controller:
carryforward: true
paths:
- kubevirt_controller/datadog_checks/kubevirt_controller
- kubevirt_controller/tests
kubevirt_api:
kubevirt_handler:
carryforward: true
paths:
- kubevirt_api/datadog_checks/kubevirt_api
- kubevirt_api/tests
- kubevirt_handler/datadog_checks/kubevirt_handler
- kubevirt_handler/tests
kyototycoon:
carryforward: true
paths:
Expand Down Expand Up @@ -1410,6 +1414,11 @@ flags:
paths:
- singlestore/datadog_checks/singlestore
- singlestore/tests
slurm:
carryforward: true
paths:
- slurm/datadog_checks/slurm
- slurm/tests
snmp:
carryforward: true
paths:
Expand Down
20 changes: 20 additions & 0 deletions .github/workflows/test-all.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3154,6 +3154,26 @@ jobs:
minimum-base-package: ${{ inputs.minimum-base-package }}
pytest-args: ${{ inputs.pytest-args }}
secrets: inherit
j1e78283:
uses: ./.github/workflows/test-target.yml
with:
job-name: Slurm
target: slurm
platform: linux
runner: '["ubuntu-22.04"]'
repo: "${{ inputs.repo }}"
python-version: "${{ inputs.python-version }}"
standard: ${{ inputs.standard }}
latest: ${{ inputs.latest }}
agent-image: "${{ inputs.agent-image }}"
agent-image-py2: "${{ inputs.agent-image-py2 }}"
agent-image-windows: "${{ inputs.agent-image-windows }}"
agent-image-windows-py2: "${{ inputs.agent-image-windows-py2 }}"
test-py2: ${{ inputs.test-py2 }}
test-py3: ${{ inputs.test-py3 }}
minimum-base-package: ${{ inputs.minimum-base-package }}
pytest-args: ${{ inputs.pytest-args }}
secrets: inherit
j06ca546:
uses: ./.github/workflows/test-target.yml
with:
Expand Down
4 changes: 4 additions & 0 deletions slurm/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# CHANGELOG - Slurm

<!-- towncrier release notes start -->

60 changes: 60 additions & 0 deletions slurm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Agent Check: Slurm

## Overview

This check monitors [Slurm][1] through the Datadog Agent.

Include a high level overview of what this integration does:
- What does your product do (in 1-2 sentences)?
- What value will customers get from this integration, and why is it valuable to them?
- What specific data will your integration monitor, and what's the value of that data?

## Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the [Autodiscovery Integration Templates][3] for guidance on applying these instructions.

### Installation

The Slurm check is included in the [Datadog Agent][2] package.
No additional installation is needed on your server.

### Configuration

1. Edit the `slurm.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your slurm performance data. See the [sample slurm.d/conf.yaml][4] for all available configuration options.

2. [Restart the Agent][5].

### Validation

[Run the Agent's status subcommand][6] and look for `slurm` under the Checks section.

## Data Collected

### Metrics

See [metadata.csv][7] for a list of metrics provided by this integration.

### Events

The Slurm integration does not include any events.

### Service Checks

The Slurm integration does not include any service checks.

See [service_checks.json][8] for a list of service checks provided by this integration.

## Troubleshooting

Need help? Contact [Datadog support][9].


[1]: **LINK_TO_INTEGRATION_SITE**
[2]: https://app.datadoghq.com/account/settings/agent/latest
[3]: https://docs.datadoghq.com/agent/kubernetes/integrations/
[4]: https://github.com/DataDog/integrations-core/blob/master/slurm/datadog_checks/slurm/data/conf.yaml.example
[5]: https://docs.datadoghq.com/agent/guide/agent-commands/#start-stop-and-restart-the-agent
[6]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information
[7]: https://github.com/DataDog/integrations-core/blob/master/slurm/metadata.csv
[8]: https://github.com/DataDog/integrations-core/blob/master/slurm/assets/service_checks.json
[9]: https://docs.datadoghq.com/help/
133 changes: 133 additions & 0 deletions slurm/assets/configuration/spec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
name: Slurm
files:
- name: slurm.yaml
options:
- template: init_config
options:
- name: slurm_binaries_dir
description: |
The directory at which all the Slurm binaries are located. These are mainly:
sinfo, sacct, sdiag, sshare and sdiag.
value:
type: string
example: /usr/bin/
- template: init_config/default
- template: instances
options:
- name: collect_sinfo_stats
description: Whether or not to collect statistics from the sinfo command.
value:
type: boolean
example: True
- name: collect_sdiag_stats
description: Whether or not to collect statistics from the sdiag command.
value:
type: boolean
example: True
- name: collect_squeue_stats
description: Whether or not to collect statistics from the squeue command.
value:
type: boolean
example: True
- name: collect_sacct_stats
description: Whether or not to collect statistics from the sacct command.
value:
type: boolean
example: True
- name: collect_sshare_stats
description: Whether or not to collect statistics from the sshare command.
value:
type: boolean
example: True
- name: collect_gpu_stats
description: Whether or not to collect GPU statistics when Slurm is configured to use GPUs.
value:
type: boolean
example: False
- name: sinfo_collection_level
description: |
The level of detail to collect from the sinfo command. The default is 'basic'. Available options are 1, 2 and
3. Level 1 collects data only for partitions. Level 2 collects data from individual nodes. Level 3
collects data from from individual nodes as well but will be more verbose and include data such as CPU and
memory usage as reported from the OS as well as additional tags.
value:
type: integer
example: 1
- name: sinfo_path
description: Full path to the sinfo binary.
value:
type: string
example: /usr/bin/sinfo
- name: sacct_path
description: Full path to the sacct binary.
value:
type: string
example: /usr/bin/sacct
- name: sdiag_path
description: Full path to the sdiag binary.
value:
type: string
example: /usr/bin/sdiag
- name: sshare_path
description: Full path to the sshare binary.
value:
type: string
example: /usr/bin/sshare
- name: squeue_path
description: Full path to the squeue binary.
value:
type: string
example: /usr/bin/squeue
- name: debug_sinfo_stats
description: |
Whether or not to enable debug logging for the sinfo command. This will log the output of the sinfo command
to the agent log.
value:
type: boolean
example: False
hidden: True
- name: debug_sdiag_stats
description: |
Whether or not to enable debug logging for the sdiag command. This will log the output of the sdiag command
to the agent log.
value:
type: boolean
example: False
hidden: True
- name: debug_squeue_stats
description: |
Whether or not to enable debug logging for the squeue command. This will log the output of the squeue command
to the agent log.
value:
type: boolean
example: False
hidden: True
- name: debug_sshare_stats
description: |
Whether or not to enable debug logging for the sshare command. This will log the output of the sshare command
to the agent log.
value:
type: boolean
example: False
hidden: True
- name: debug_sacct_stats
description: |
Whether or not to enable debug logging for the sacct command. This will log the output of the sacct command
to the agent log.
value:
type: boolean
example: False
hidden: True
- template: instances/default
overrides:
min_collection_interval.description: |
This changes the collection interval of the check. For more information, see:
https://docs.datadoghq.com/developers/write_agent_check/#collection-interval

Most Slurm metrics are collected from calling the different binaries. Depending on the size of the slurm cluster,
this can be a very expensive operation. It is recommended to set this to a higher value than the default 15
seconds, but this can be adjusted based on the size of the cluster and the desired granularity of the metrics.
min_collection_interval.value.display_default: 15
min_collection_interval.value.default: 60
min_collection_interval.value.example: 60
min_collection_interval.enabled: true
1 change: 1 addition & 0 deletions slurm/assets/dashboards/slurm_overview.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{}
1 change: 1 addition & 0 deletions slurm/assets/service_checks.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[]
1 change: 1 addition & 0 deletions slurm/changelog.d/1.added
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Initial Release

Check failure on line 1 in slurm/changelog.d/1.added

View workflow job for this annotation

GitHub Actions / run / Check PR

Your changelog entry has the wrong PR number. To fix this please run: mv slurm/changelog.d/1.added slurm/changelog.d/18893.added
4 changes: 4 additions & 0 deletions slurm/datadog_checks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# (C) Datadog, Inc. 2024-present
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
__path__ = __import__('pkgutil').extend_path(__path__, __name__) # type: ignore
4 changes: 4 additions & 0 deletions slurm/datadog_checks/slurm/__about__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# (C) Datadog, Inc. 2024-present
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
__version__ = '0.0.1'
7 changes: 7 additions & 0 deletions slurm/datadog_checks/slurm/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# (C) Datadog, Inc. 2024-present
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
from .__about__ import __version__
from .check import SlurmCheck

__all__ = ['__version__', 'SlurmCheck']
Loading
Loading