Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ported monitoring stack to k3s #449

Open
wants to merge 102 commits into
base: feature/k3s-ansible-init
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 92 commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
108fa7c
Added prometheus operator role compatible with state_dir (still needs…
wtripp180901 Sep 19, 2024
8f2977c
Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring
wtripp180901 Sep 20, 2024
9836ef8
Added node selectors for non-exporter pods
wtripp180901 Sep 20, 2024
d790b2b
Added services for monitoring
wtripp180901 Sep 20, 2024
10af75d
WIP porting prometheus rolevars
wtripp180901 Sep 24, 2024
7b29a3b
Added ingress for monitoring services
wtripp180901 Sep 25, 2024
b959e92
Refactored + re-enabled external labels (not sure if working)
wtripp180901 Sep 25, 2024
560eb96
replaced monitoring in site.yml and fixed sslip IPs
wtripp180901 Sep 25, 2024
0106a95
Added slurm exporter service to k3s
wtripp180901 Sep 25, 2024
a4dca77
Added ood exporter to k3s
wtripp180901 Sep 26, 2024
6081d77
added grafana metrics
wtripp180901 Sep 26, 2024
84fd355
fixed alertmanager status
wtripp180901 Sep 26, 2024
e2d1c62
Dashboards now installed into k3s (dataources not configured yet)
wtripp180901 Sep 27, 2024
cce35a9
Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring
wtripp180901 Oct 1, 2024
7afdc1d
Added slurmstats datasource
wtripp180901 Oct 1, 2024
b3020ca
enabled ips for monitoring services (except prometheus)
wtripp180901 Oct 1, 2024
0dff07f
Added grafana to state directory and made port configurable
wtripp180901 Oct 2, 2024
f7e555b
grafana can now be reverse proxied by ood
wtripp180901 Oct 3, 2024
d142a9f
Ported grafana rolevars
wtripp180901 Oct 7, 2024
7fa3609
Added slack integration default
wtripp180901 Oct 7, 2024
96edb79
Ported alertmanager rolevars
wtripp180901 Oct 7, 2024
123c573
Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring
wtripp180901 Oct 8, 2024
b13311a
removed k3s ingress
wtripp180901 Oct 8, 2024
01718ee
Services now exposed/proxied via nodeports
wtripp180901 Oct 8, 2024
74bd3ba
Removed grafana servicemonitor and moved nodeports to helm config
wtripp180901 Oct 9, 2024
e724b5d
grafana admin now definable
wtripp180901 Oct 9, 2024
9c359d9
Now adds additional rules correctly
wtripp180901 Oct 9, 2024
04a5bf3
Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring
wtripp180901 Oct 10, 2024
7f4862c
Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring
wtripp180901 Oct 11, 2024
2b97d32
Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring
wtripp180901 Oct 11, 2024
337c101
Removed monitoring binaries from build
wtripp180901 Oct 14, 2024
dd1e464
bump for CI test
wtripp180901 Oct 14, 2024
cc6bef1
ported node-exporter vars
wtripp180901 Oct 14, 2024
21a8d25
non-atomic helm install for ci test
wtripp180901 Oct 14, 2024
e1355de
Merge branch 'feature/k3s-monitoring' of github.com:stackhpc/ansible-…
wtripp180901 Oct 14, 2024
76842f1
fixed hostnames not recognised by selector and defaulted slack integr…
wtripp180901 Oct 14, 2024
185eafb
fixed k3s hostnames properly
wtripp180901 Oct 14, 2024
10d4e93
increased control node CI memory
wtripp180901 Oct 15, 2024
d1e8c0a
Refactored monitoring config and removed redundant groups
wtripp180901 Oct 15, 2024
96723f8
updated dashboard defaults
wtripp180901 Oct 15, 2024
bf9a473
fixed caas cluster name
wtripp180901 Oct 15, 2024
fdb5c23
nodeselectors now use custom labels
wtripp180901 Oct 15, 2024
4bffe4b
fixed (?) grafana zenith proxy
wtripp180901 Oct 16, 2024
b0f856e
bump images
wtripp180901 Oct 16, 2024
12e1166
added old recording rules to defaults
wtripp180901 Oct 16, 2024
9ab06a6
Merge branch 'feature/k3s-monitoring' of github.com:stackhpc/ansible-…
wtripp180901 Oct 16, 2024
5f89be8
fixed openhpc dashboard
wtripp180901 Oct 17, 2024
2d1dab5
Refactored and fixed slack integration
wtripp180901 Oct 18, 2024
43f27a5
removed unused config options
wtripp180901 Oct 18, 2024
bb928ad
review suggestions
wtripp180901 Oct 18, 2024
db91120
updated defaults
wtripp180901 Oct 18, 2024
34a779b
Merge branch 'feature/k3s-monitoring' of github.com:stackhpc/ansible-…
wtripp180901 Oct 18, 2024
7e1370d
removed grafana data volume
wtripp180901 Oct 18, 2024
934ec7a
set default dashboard to slurm exporter
wtripp180901 Oct 18, 2024
a03f7f9
added play to remove unwanted default dashboards
wtripp180901 Oct 18, 2024
886c22d
updated grafana groupvars
wtripp180901 Oct 18, 2024
c4fa2a6
added node exporter collection config
wtripp180901 Oct 21, 2024
3e1f019
removed unenforced volume size config option
wtripp180901 Oct 21, 2024
8d242f7
ondemand grafana proxying now conditional on ondemand having groups d…
wtripp180901 Oct 22, 2024
e6fbda8
standardised control ip resolution
wtripp180901 Oct 22, 2024
8b93aa2
Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring
wtripp180901 Oct 22, 2024
e6a4e4b
reduced collectors to minimal set
wtripp180901 Oct 22, 2024
e5dff96
updated docs
wtripp180901 Oct 22, 2024
ce90ab0
Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring
wtripp180901 Oct 22, 2024
4c44261
bump images
wtripp180901 Oct 23, 2024
29614fe
Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring
wtripp180901 Oct 23, 2024
f93348e
monitoring stack images now pre-pulled
wtripp180901 Oct 23, 2024
e8d2e81
moved monitoring pre-pulls to role
wtripp180901 Oct 23, 2024
311bbbc
Merge branch 'feature/k3s-monitoring' of github.com:stackhpc/ansible-…
wtripp180901 Oct 23, 2024
d002454
fixed build typo
wtripp180901 Oct 24, 2024
991613e
removed unused groupvars
wtripp180901 Oct 24, 2024
b4b69b5
removed cloudalchemy roles from install
wtripp180901 Oct 24, 2024
b0f48fd
bump images
wtripp180901 Oct 24, 2024
ec57a21
fixed some incompatibilities with old metrics
wtripp180901 Oct 24, 2024
df946dd
Merge branch 'feature/k3s-monitoring' of github.com:stackhpc/ansible-…
wtripp180901 Oct 24, 2024
a0edab7
removed container internal networking devices from grafana
wtripp180901 Oct 24, 2024
8acc2b5
openhpc dashboard now job agnostic
wtripp180901 Oct 25, 2024
ee945b5
added local copy of slurm exporter dashboard without container networ…
wtripp180901 Oct 25, 2024
46d95ef
set default dashboard to slurm jobs
wtripp180901 Oct 25, 2024
b2b673f
added ansible to migrate cloudalchemy data to KPS
wtripp180901 Oct 25, 2024
d712f69
updated docs
wtripp180901 Oct 25, 2024
d0c8781
merge conflicts
wtripp180901 Oct 29, 2024
f23c2fc
merge
wtripp180901 Oct 29, 2024
2d16356
cleaned up dashboard role
wtripp180901 Oct 29, 2024
c6b221e
moved image pre-pull list to rolevar
wtripp180901 Oct 29, 2024
d1c915e
doc changes + opensearch datasource now based on opensearch group
wtripp180901 Oct 29, 2024
b6be009
made kps default dashboards more configurable
wtripp180901 Oct 29, 2024
206134d
bump image up to date with main
wtripp180901 Oct 29, 2024
8364eb8
newline
wtripp180901 Oct 29, 2024
603e818
bumped caas minimum control node ram
wtripp180901 Oct 29, 2024
c4a4847
reduced disk footprint of container pe-pulls
wtripp180901 Nov 11, 2024
040e569
merge
wtripp180901 Nov 11, 2024
a2540f2
moved image pulls to tasks
wtripp180901 Nov 12, 2024
cd281f3
moved prometheus install to host group
wtripp180901 Nov 12, 2024
774f608
Review docs suggestions
wtripp180901 Nov 12, 2024
80a0e21
Merge branch 'feature/k3s-monitoring' of github.com:stackhpc/ansible-…
wtripp180901 Nov 12, 2024
6506e7e
added readme link
wtripp180901 Nov 12, 2024
a6d8edc
file name and defaults changes
wtripp180901 Nov 13, 2024
5864b56
disambiguated default addresses
wtripp180901 Nov 13, 2024
15b77db
separated prometheus recording and alerting rules
wtripp180901 Nov 13, 2024
acf0c0d
adding alertmanager docs
wtripp180901 Nov 13, 2024
b7d9c48
merge
wtripp180901 Nov 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions ansible/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -62,5 +62,7 @@ roles/*
!roles/k3s/**
!roles/k9s/
!roles/k9s/**
!roles/kube_prometheus_stack
!roles/kube_prometheus_stack/**
!roles/lustre/
!roles/lustre/**
76 changes: 5 additions & 71 deletions ansible/fatimage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -110,12 +110,6 @@
tasks_from: install.yml
when: "'filebeat' in group_names"

wtripp180901 marked this conversation as resolved.
Show resolved Hide resolved
- import_role:
# can't only run cloudalchemy.node_exporter/tasks/install.yml as needs vars from preflight.yml and triggers service start
# however starting node exporter is ok
name: cloudalchemy.node_exporter
when: "'node_exporter' in group_names"

- name: openondemand exporter
dnf:
name: ondemand_exporter
Expand All @@ -128,72 +122,12 @@
vars:
slurm_exporter_state: stopped
when: "'slurm_exporter' in group_names"

- hosts: prometheus
become: yes
gather_facts: yes
tasks:
- import_role:
name: cloudalchemy.prometheus
tasks_from: preflight.yml

# can't run cloudalchemy.prometheus/tasks/install.yml as it triggers a unit start
# so below is a partial extraction of this:
- name: create prometheus system group
group:
name: prometheus
system: true
state: present

- name: create prometheus system user
user:
name: prometheus
system: true
shell: "/usr/sbin/nologin"
group: prometheus
createhome: false
home: "{{ prometheus_db_dir }}"

- name: download prometheus binary to local folder
become: false
get_url:
url: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-{{ go_arch }}.tar.gz"
dest: "/tmp/prometheus-{{ prometheus_version }}.linux-{{ go_arch }}.tar.gz"
checksum: "sha256:{{ __prometheus_checksum }}"
register: _download_archive
until: _download_archive is succeeded
retries: 5
delay: 2

- name: unpack prometheus binaries
become: false
unarchive:
remote_src: yes
src: "/tmp/prometheus-{{ prometheus_version }}.linux-{{ go_arch }}.tar.gz"
dest: "/tmp"
creates: "/tmp/prometheus-{{ prometheus_version }}.linux-{{ go_arch }}/prometheus"

- name: propagate official prometheus and promtool binaries
copy:
remote_src: yes
src: "/tmp/prometheus-{{ prometheus_version }}.linux-{{ go_arch }}/{{ item }}"
dest: "{{ _prometheus_binary_install_dir }}/{{ item }}"
mode: 0755
owner: root
group: root
with_items:
- prometheus
- promtool

- hosts: grafana
become: yes
gather_facts: yes
tasks:
- name: Include distribution variables for cloudalchemy.grafana
include_vars: "{{ appliances_repository_root }}/ansible/roles/cloudalchemy.grafana/vars/redhat.yml"
- import_role:
name: cloudalchemy.grafana

- name: kube prometheus stack
wtripp180901 marked this conversation as resolved.
Show resolved Hide resolved
import_role:
name: kube_prometheus_stack
tasks_from: install.yml
when: "'prometheus' in group_names"

- name: Run post.yml hook
vars:
Expand Down
44 changes: 6 additions & 38 deletions ansible/monitoring.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,13 @@
- import_role:
name: filebeat

- name: Deploy node_exporter
hosts: node_exporter
tags: node_exporter
- name: Install monitoring helm chart
hosts: prometheus
become: true
tags: prometheus
tasks:
- import_role: name=cloudalchemy.node_exporter
- ansible.builtin.import_role:
name: kube_prometheus_stack

- name: Deploy OpenOndemand exporter
hosts: openondemand
Expand All @@ -57,37 +59,3 @@
tasks:
- import_role:
name: slurm_exporter

- name: Setup core monitoring software
hosts: prometheus
tags: prometheus
tasks:
- name: Check for existing prometheus binaries
stat:
path: /usr/local/bin/{{ item }}
register: prometheus_binaries
loop:
- prometheus
- promtool
- name: Skip prometheus install if prometheus binaries exist and prometheus_version not defined
# i.e. if prometheus_version isn't defined we don't care, so use what's already there
set_fact:
prometheus_skip_install: "{{ false if prometheus_version is defined else true }}"
when: "{{ (prometheus_binaries.results | map(attribute='stat') | map(attribute='exists')) + [prometheus_skip_install is not defined] }}"
- import_role:
name: cloudalchemy.prometheus

- name: Deploy grafana
hosts: grafana
tags: grafana
tasks:
- assert:
that: vault_grafana_admin_password is defined
fail_msg: "Must define vault_grafana_admin_password - use `ansible-playbook generate-passwords.yml` to generate a set of passwords"
- include_role:
name: cloudalchemy.grafana
vars:
# We use internal roles to register the dashboards as the role does not support all options that we require.
grafana_dashboards: []
- import_role: # done in same play so it can use handlers from cloudalchemy.grafana
name: grafana-dashboards
15 changes: 0 additions & 15 deletions ansible/roles/grafana-dashboards/defaults/main.yml

This file was deleted.

Loading
Loading