Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add multi-cluster metrics and alerting post #86

Merged
merged 11 commits into from
Sep 6, 2023
Binary file added assets/images/authors/mhrabovcin.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
226 changes: 226 additions & 0 deletions content/posts/multi-cluster-monitoring-and-alerting/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
---
authors: ["mhrabovcin"]
title: "Multi-cluster monitoring and alerting"
date: 2023-09-05T12:22:45+02:00
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
tags: ["dkp", "multicluster", "monitoring", "alerting", "prometheus", "thanos"]
excerpt: How to integrate metrics to a Go application and how to act on those metrics.
feature_image: mitchel-boot-hOf9BaYUN88-unsplash.jpg
---

Metrics and monitoring are essential for ensuring the health and performance of Go applications. Metrics are data points
that represent the state of an application, such as CPU usage, memory usage, and requests per minute. Monitoring is the
process of collecting, storing, and analyzing metrics to identify problems and trends. By monitoring their applications,
developers can proactively identify and fix problems before they impact users.
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved

The D2iQ Kubernetes Platform (DKP) comes with monitoring stack components consisting of [Prometheus][], which collects
metrics, [Grafana][], which can visualize and present the metrics, [Alertmanager][], which handles acting on metrics and
sending alerts to 3rd party service, and [Thanos][] which handles multi-cluster functionality of [Prometheus][]. These
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
components are configured to provide monitoring and alerting for applications launched on DKP.

## How to emit metrics from Go application
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved

To emit metrics from a Go application, use the [Prometheus library][], which provides a number
of helpful functions. There are multiple types of metrics that application can emit.
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
The type of metrics defines how the Prometheus will interpret the data.
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved

Prometheus has four types of metrics:

* *Counters*: Counters are metrics that can only increase or reset to zero. They are typically used to track things like
the number of requests, the number of errors, and the number of bytes transferred.
* *Gauges*: Gauges are metrics that can go up and down. They are typically used to track things like the current memory
usage, the current CPU usage, and the number of concurrent connections.
* *Histograms*: Histograms are used to track the distribution of values. They are typically used to track things like the
response time of requests, the size of requests, and the number of errors.
* *Summaries*: Summaries are similar to histograms, but they also track the quantiles of the distribution. This can be
useful for identifying outliers and understanding the tail of the distribution.

mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
An example Go application that would expose `/increment` HTTP handler that would increase a counter metric would look
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
like this:

```go
package main

import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {
// Create a counter metric.
requests := prometheus.NewCounter(
prometheus.CounterOpts{
Name: "my_app_requests_total",
Help: "The total number of requests.",
},
)
// Register the counter with Prometheus.
prometheus.MustRegister(requests)

http.Handle("/increment", func(w http.ResponseWriter, req *http.Request) {
requests.Inc()
fmt.Fprintf(w, "OK")
})
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8090", nil)
}
```

For more details see the [prometheus tutorial](https://prometheus.io/docs/tutorials/instrumenting_http_server_in_go/) on
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
how to integrate prometheus library and emit metrics.
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved

## How to integrate metrics with DKP

Prometheus uses a pull model to collect metrics. This means that Prometheus actively fetches metrics from the targets,
rather than the targets pushing metrics to Prometheus. In the example above the prometheus handler only creates a HTTP
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
handler on the `/metrics` path but its not actively pushing collected metrics anywhere.
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved

When an application runs on Kubernetes cluster in a Pod it is usually exposed via `Service` resource that can expose
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
Pods networking ports to rest of the cluster. The Prometheus operator that comes on DKP exposes a custom resource types
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
CRD `ServiceMonitor` and `PodMonitor` that should be used to configure prometheus instance to include particular
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
Kubernetes service into scraping targets from which prometheus will read the metrics data.
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved

When creating own `ServiceMonitor` it is necessary to include the `prometheus.kommander.d2iq.io/select: "true"` label on
the resource. Based on this label the default instance of Prometheus installed on DKP will include the `ServiceMonitor`
configuration. The Prometheus operator allows to run multiple Prometheus instances and the label selector is used to
associate service monitors with Prometheus instance.
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved

In the example bellow the `ServiceMonitor` instructs the DKP Prometheus to scrape data from `Service` that matches the
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
label `app: my-go-app` on port `http` and the `/metrics` path.

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-go-app-monitor
namespace: my-namespace
labels:
prometheus.kommander.d2iq.io/select: "true"
spec:
selector:
matchLabels:
app: my-go-app
endpoints:
- port: http
interval: 1s
path: /metrics
scheme: http
```

To confirm that metrics are scraped by prometheus visit the `https://<CLUSTER_DOMAIN>/dkp/prometheus/graph` and enter
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
`my_app_requests_total` to the console to see the metrics.

## Alerting

Alerting is the process of notifying users when a metric or set of metrics exceeds a predefined threshold. This can be
used to proactively identify problems before they impact users. DKP comes with [Alertmanager][] installation that can be
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
used to receive alerts from Prometheus and send them to a variety of notification channels, such as email, Slack, or
PagerDuty. Alertmanager itself is not creating any alerts but rather only propagates and routes information based on
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
routing rules. Alerts are created by Prometheus by continuously evaluating metrics values and creating alerts by calling
Alertmanager API.
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved

To create an alert definition use the `PrometheusRule` resource:
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-example-rule
namespace: my-namespace
labels:
prometheus.kommander.d2iq.io/select: "true"
spec:
groups:
- name: my-app
rules:
- alert: MyAppExampleRule
annotations:
description: ""
summary: Number of requests is over a threshold
expr: |-
my_app_requests_total > 5
for: 1m
labels:
severity: critical
```

The Prometheus operator will apply this configuration to default Prometheus which is configured to push alerts to
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
Alertmanager. When the number of requests will go over 5 the Prometheus will create new alert by calling Alertmanager
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
HTTP API. The `PrometheusRule` can be created on DKP management or attached clusters.

## Multi-cluster monitoring and alerting

The DKP metrics stack is configured by default to collect metrics across all managed clusters and have them available in
the central management DKP cluster. For that purpose DKP comes with Thanos which is a tool that can be used to extend
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
Prometheus's capabilities for collecting and storing metrics across multiple clusters. Thanos Query component can be
used to query metrics from multiple Prometheus servers in a single place. This makes it possible to view and analyze
metrics from all of the clusters in a single place.

To make alerting possible on metrics from all clusters it is necessary to enable Thanos Ruler component. Thanos Ruler is
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
a component of Thanos that can be used to evaluate Prometheus recording and alerting rules against a chosen query API
and then send the results directly to remote storage. It can be used to make alerting possible across multiple clusters
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
by evaluating rules against a central query API that aggregates data from all of the clusters.

In order to enable Ruler on DKP management cluster add the following configmap:
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-prometheus-stack-overrides
namespace: kommander
data:
values.yaml: |
thanosRuler:
enabled: true
thanosRulerSpec:
queryEndpoints:
- http://thanos-query.kommander:10902/
alertmanagersUrl:
- http://kube-prometheus-stack-alertmanager.kommander:9093
ruleSelector:
matchLabels:
role: thanos-alerts
routePrefix: /dkp/ruler
externalPrefix: /dkp/ruler
ingress:
enabled: true
paths:
- /dkp/ruler
annotations:
traefik.ingress.kubernetes.io/router.middlewares: "kommander-forwardauth@kubernetescrd"
traefik.ingress.kubernetes.io/router.tls: "true"
```

And apply the override to the `kube-prometheus-stack` `AppDeployment`.

```sh
cat <<<EOF | kubectl patch --type merge -n kommander appdeployments kube-prometheus-stack --patch-file=/dev/stdin
spec:
configOverrides:
name: kube-prometheus-stack-overrides
EOF
```

The configuration above will deploy the Thanos Ruler on the DKP Management cluster, it will expose its UI on the
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
`https://<CLUSTER_DOMAIN>/dkp/ruler` URL and it will limit the rules only with `role: thanos-alerts` to be used by
Ruler. The Ruler is configured same way as Prometheus using the `PrometheusRule` resource. Limiting the configuration to
the specific label allows to select which configuration will be applied to the Prometheus and which will be applied to
the Thanos Ruler. The exactly same `PrometheusRule` resources can now create alerts for data coming from multiple
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
clusters.

The final result looks like this.

![Diagram](multi-cluster-metrics-alerting.drawio.svg)

## Conclusion

Multi-cluster monitoring is an important feature of the DKP because it allows you to monitor multiple Kubernetes
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
clusters from a single pane of glass. This can help you to identify and troubleshoot problems that affect multiple
clusters, and to plan for the capacity needs of your clusters. DKP gives administrators flexibility to define deploy
mhrabovcin marked this conversation as resolved.
Show resolved Hide resolved
various monitoring and alerting configurations per cluster or in a single centralized location.

[Prometheus]: https://prometheus.io
[Grafana]: https://grafana.com/grafana/
[Alertmanager]: https://prometheus.io/docs/alerting/latest/alertmanager/
[Thanos]: https://thanos.io
[Prometheus library]: https://pkg.go.dev/github.com/prometheus/client_golang
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading