Ported monitoring stack to k3s #449

wtripp180901 · 2024-10-14T09:10:08Z

Monitoring stack (prometheus/node exporter/grafana/alertmanager) binary installs removed from site and fatimage, now installs kube-prometheus-stack Helm chart into k3s cluster during site run. Containers are pre-pulled by podman and exported into k3s during fatimage build.

As a consequence, the grafana, alertmanager and node exporter groups have been removed and associated roles are now all managed by the prometheus role, which is short for kube_prometheus_stack

Also reduced metrics collected by node exporter down to minimal set described in docs/monitoring-and-logging.README.md, which was previously unimplemented

Note that because of how OOD's proxying interacts with Grafana's server config and kubernetes, OOD being enabled means that Grafana is only accessible through the OOD proxy. In the caas environment, this means that accessing Grafana requires authenticating with OOD's basic auth. Therefore, accessing Grafana through caas no longer logs you in as the admin user, you instead access the dashboards anonymously

… more affinity)

wtripp180901 · 2024-10-29T09:04:46Z

https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/11570541407

misclick

wtripp180901 · 2024-10-29T11:41:08Z

no image changes since last build so last commit should be ready to merge barring review changes

wtripp180901 · 2024-11-11T16:56:01Z

https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/11782814296

sjpb

Please note in the PR (with better wording!) that

the "prometheus" group is bascally short for "kube-prometheus-stack" group!
the monitoring link in CaaS now accesses grafana with anonymous auth (b/c it has to go via OOD), so CaaS users can't change their dashboards

environments/common/inventory/group_vars/all/prometheus.yml

environments/common/inventory/group_vars/all/monitoring.yml

environments/common/inventory/group_vars/all/defaults.yml

docs/monitoring-and-logging.md

ansible/roles/kube_prometheus_stack/tasks/install.yml

ansible/fatimage.yml

ansible/roles/kube_prometheus_stack/defaults/main/install.yml

Co-authored-by: Steve Brasier <[email protected]>

…slurm-appliance into feature/k3s-monitoring

sjpb · 2024-11-15T13:23:03Z

@wtripp180901 not a high priority but would be nice to know if this PR reduces the size of the data in the image. And/or whether we can reduce the required root disk size at all - which isn't the same thing, b/c e.g. dnf caches which we throw away require additional size during build.

I think you'd need qemu-img info to see the former. And monitoring disk usage during build to see the latter.

wtripp180901 added 30 commits September 19, 2024 13:28

Added prometheus operator role compatible with state_dir (still needs…

108fa7c

… more affinity)

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

8f2977c

Added node selectors for non-exporter pods

9836ef8

Added services for monitoring

d790b2b

WIP porting prometheus rolevars

10af75d

Added ingress for monitoring services

7b29a3b

Refactored + re-enabled external labels (not sure if working)

b959e92

replaced monitoring in site.yml and fixed sslip IPs

560eb96

Added slurm exporter service to k3s

0106a95

Added ood exporter to k3s

a4dca77

added grafana metrics

6081d77

fixed alertmanager status

84fd355

Dashboards now installed into k3s (dataources not configured yet)

e2d1c62

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

cce35a9

Added slurmstats datasource

7afdc1d

enabled ips for monitoring services (except prometheus)

b3020ca

Added grafana to state directory and made port configurable

0dff07f

grafana can now be reverse proxied by ood

f7e555b

Ported grafana rolevars

d142a9f

Added slack integration default

7fa3609

Ported alertmanager rolevars

96edb79

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

123c573

removed k3s ingress

b13311a

Services now exposed/proxied via nodeports

01718ee

Removed grafana servicemonitor and moved nodeports to helm config

74bd3ba

grafana admin now definable

e724b5d

Now adds additional rules correctly

9c359d9

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

04a5bf3

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

7f4862c

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

2b97d32

wtripp180901 added 2 commits October 29, 2024 08:58

merge conflicts

d0c8781

merge

f23c2fc

bertiethorpe previously approved these changes Oct 29, 2024

View reviewed changes

wtripp180901 added 6 commits October 29, 2024 10:22

cleaned up dashboard role

2d16356

moved image pre-pull list to rolevar

c6b221e

doc changes + opensearch datasource now based on opensearch group

d1c915e

made kps default dashboards more configurable

b6be009

bump image up to date with main

206134d

newline

8364eb8

bumped caas minimum control node ram

603e818

wtripp180901 mentioned this pull request Nov 11, 2024

SELinux not disabled by default, causes Prometheus install to fail #432

Open

wtripp180901 linked an issue Nov 11, 2024 that may be closed by this pull request

SELinux not disabled by default, causes Prometheus install to fail #432

Open

wtripp180901 added 2 commits November 11, 2024 16:49

reduced disk footprint of container pe-pulls

c4a4847

merge

040e569

sjpb requested changes Nov 12, 2024

View reviewed changes

wtripp180901 and others added 10 commits November 12, 2024 14:00

moved image pulls to tasks

a2540f2

moved prometheus install to host group

cd281f3

Review docs suggestions

774f608

Co-authored-by: Steve Brasier <[email protected]>

Merge branch 'feature/k3s-monitoring' of github.com:stackhpc/ansible-…

80a0e21

…slurm-appliance into feature/k3s-monitoring

added readme link

6506e7e

file name and defaults changes

a6d8edc

disambiguated default addresses

5864b56

separated prometheus recording and alerting rules

15b77db

adding alertmanager docs

acf0c0d

merge

b7d9c48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ported monitoring stack to k3s #449

Ported monitoring stack to k3s #449

wtripp180901 commented Oct 14, 2024 •

edited

Loading

wtripp180901 commented Oct 29, 2024

wtripp180901 commented Oct 29, 2024

wtripp180901 commented Nov 11, 2024

sjpb left a comment

sjpb commented Nov 15, 2024 •

edited

Loading

Ported monitoring stack to k3s #449

Are you sure you want to change the base?

Ported monitoring stack to k3s #449

Conversation

wtripp180901 commented Oct 14, 2024 • edited Loading

wtripp180901 commented Oct 29, 2024

wtripp180901 commented Oct 29, 2024

wtripp180901 commented Nov 11, 2024

sjpb left a comment

Choose a reason for hiding this comment

sjpb commented Nov 15, 2024 • edited Loading

wtripp180901 commented Oct 14, 2024 •

edited

Loading

sjpb commented Nov 15, 2024 •

edited

Loading