Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link checker #6215

Merged
merged 3 commits into from
Oct 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 43 additions & 43 deletions architecture-decision-record/026-Managed-Prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ Use [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/)

It's good operational practice to have good 'observability'. This includes monitoring, achieved by regular checking the metrics, or health numbers, of the containers running. The timeseries data which is collected can be shown as graphs or other indicators in a dashboard, and evaluated against rules which trigger alerts to the operators. Typical use by operators include:

* to become familiar with the typical quantity of resources consumed by their software
* to be alerted to deteriorating health, so that they can fix it, before it becomes an incident
* being alerted to an incident, to be able to react quickly, not just when users flag it
* during an incident getting an at-a-glance overview of where problems exist
* after an incident to understand what went wrong, and help review the actions taken during the response
* reviewing long-term patterns of health
- to become familiar with the typical quantity of resources consumed by their software
- to be alerted to deteriorating health, so that they can fix it, before it becomes an incident
- being alerted to an incident, to be able to react quickly, not just when users flag it
- during an incident getting an at-a-glance overview of where problems exist
- after an incident to understand what went wrong, and help review the actions taken during the response
- reviewing long-term patterns of health

### Choice of Prometheus

Expand All @@ -35,9 +35,9 @@ So overall we are happy to stick with Prometheus.

Prometheus is setup to monitor the whole of Cloud Platform, including:

* Tenant containers
* Tenant AWS resources
* Kubernetes cluster. kube-prometheus
- Tenant containers
- Tenant AWS resources
- Kubernetes cluster. kube-prometheus

Prometheus is configured to store 24h worth of data, which is enough to support most use cases. The data is also sent on to Thanos, which efficiently stores 1 year of metrics data, and makes it available for queries using the same PromQL syntax.

Expand All @@ -47,18 +47,19 @@ Alertmanager uses the Prometheus data when evaluating its alert rules.

The Prometheus container has not run smoothly in recent months:

* **Performance (resolved)** - There were some serious performance issues - alert rules were taking too long to evaluate against the Prometheus data, however this was successfully alleviated by increasing the disk iops, so is not a remaining concern.
- **Performance (resolved)** - There were some serious performance issues - alert rules were taking too long to evaluate against the Prometheus data, however this was successfully alleviated by increasing the disk iops, so is not a remaining concern.

* **Custom node group** - Being a single Prometheus instance for monitoring the entire platform, it consumes a lot of resources. We've put it on a dedicated node, so it has the full resources. And it needs more memory than other nodes, which means it needs a custom node group, which is a bit of extra management overhead.
- **Custom node group** - Being a single Prometheus instance for monitoring the entire platform, it consumes a lot of resources. We've put it on a dedicated node, so it has the full resources. And it needs more memory than other nodes, which means it needs a custom node group, which is a bit of extra management overhead.

* **Scalability** - Scaling in this vertical way is not ideal - scaling up is not smooth and eventually we'll hit a limit of CPU/memory/iops. There are options to shard - see below.
- **Scalability** - Scaling in this vertical way is not ideal - scaling up is not smooth and eventually we'll hit a limit of CPU/memory/iops. There are options to shard - see below.

We also need to address:

* **Management overhead** - Managed cloud services are generally preferred to self-managed because the cost tends to be amortized over a large customer base and be far cheaper than in-house staff. And people with ops skills are at a premium. The management overhead is:
* for each of Prometheus, kube-prometheus
- **Management overhead** - Managed cloud services are generally preferred to self-managed because the cost tends to be amortized over a large customer base and be far cheaper than in-house staff. And people with ops skills are at a premium. The management overhead is:

* **High availability** - We have a single instance of Prometheus, simply because we've not got round to choosing and implementing a HA arrangement yet. This risks periods of outage where we don't collect metrics data. Although the impact on the use cases is not likely to be very disruptive, there is some value in fixing this up.
- for each of Prometheus, kube-prometheus

- **High availability** - We have a single instance of Prometheus, simply because we've not got round to choosing and implementing a HA arrangement yet. This risks periods of outage where we don't collect metrics data. Although the impact on the use cases is not likely to be very disruptive, there is some value in fixing this up.

### Options for addressing the concerns

Expand All @@ -82,21 +83,20 @@ Resilience: AMP is relatively isolated against cluster issues. The data kept in

Lock-in: the configuration syntax and other interfaces are the same or similar to our existing self-hosted Prometheus, so we maintain low lock-in / migration cost.


### Existing install

The 'monitoring' namespace is configured in [components terraform](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/components.tf#L115-L138) calling the [cloud-platform-terraform-monitoring module](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring). This [installs](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring/blob/main/prometheus.tf#L88) the [kube-prometheus-stack Helm chart](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/README.md) / [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) (among other things).
The 'monitoring' namespace is configured in [components terraform](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components/components.tf#L115-L138) calling the [cloud-platform-terraform-monitoring module](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring). This [installs](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring/blob/main/prometheus.tf#L88) the [kube-prometheus-stack Helm chart](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/README.md) / [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) (among other things).

[kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) contains a number of things:

* [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) - adds kubernetes-native wrappers for managing Prometheus
* CRDs for install: Prometheus, Alertmanager, Grafana, ThanosRuler
* CRDs for configuring: ServiceMonitor, PodMonitor, Probe, PrometheusRule, AlertmanagerConfig
- allows specifying monitoring targets using kubernetes labels
* Kubernetes manifests
* Grafana dashboards
* Prometheus rules
* example configs for: node_exporter, scrape targets, alerting rules for cluster issues
- [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) - adds kubernetes-native wrappers for managing Prometheus
- CRDs for install: Prometheus, Alertmanager, Grafana, ThanosRuler
- CRDs for configuring: ServiceMonitor, PodMonitor, Probe, PrometheusRule, AlertmanagerConfig
- allows specifying monitoring targets using kubernetes labels
- Kubernetes manifests
- Grafana dashboards
- Prometheus rules
- example configs for: node_exporter, scrape targets, alerting rules for cluster issues

High Availability - not implemented (yet).

Expand All @@ -105,8 +105,8 @@ https://github.com/ministryofjustice/cloud-platform/issues/1749#issue-587058014

Prometheus config is held in k8s resources:

* ServiceMonitor
* PrometheusRule - alerting
- ServiceMonitor
- PrometheusRule - alerting

## How it would work with AMP

Expand All @@ -122,23 +122,23 @@ Storage: - you can throw as much data at it. Instead there is a days limit of 15

Alertmanager:

* AMP has an Alertmanager-compatible option, which we'd use with the same rules
* Sending alerts would need to us to configure: create SNS topic that forwards to user Slack channels
- AMP has an Alertmanager-compatible option, which we'd use with the same rules
- Sending alerts would need to us to configure: create SNS topic that forwards to user Slack channels

Grafana:

* Amazon Managed Grafana has no terraform support yet so just setup in AWS console. So in the meantime we stick with self-managed Grafana, which works fine.
- Amazon Managed Grafana has no terraform support yet so just setup in AWS console. So in the meantime we stick with self-managed Grafana, which works fine.

Prometheus web interface - previously AMP was headless, but now it comes with the web interface

Prometheus Rules and Alerts:

* In our existing cluster:
* we get ~3500 Prometheus rules from: https://github.com/kubernetes-monitoring/kubernetes-mixin
* kube-prometheus compiles it to JSON and applies it to the cluster
* So for our new cluster:
* we need to do the same thing for our new cluster. But let's avoid using kube-prometheus. Just copy what it does.
* when we upgrade the prometheus version, we'll manually [run the jsonnet config generation](https://github.com/kubernetes-monitoring/kubernetes-mixin#generate-config-files), and paste the resulting rules into our terraform module e.g.: https://github.com/ministryofjustice/cloud-platform-terraform-amp/blob/main/example/rules.tf
- In our existing cluster:
- we get ~3500 Prometheus rules from: https://github.com/kubernetes-monitoring/kubernetes-mixin
- kube-prometheus compiles it to JSON and applies it to the cluster
- So for our new cluster:
- we need to do the same thing for our new cluster. But let's avoid using kube-prometheus. Just copy what it does.
- when we upgrade the prometheus version, we'll manually [run the jsonnet config generation](https://github.com/kubernetes-monitoring/kubernetes-mixin#generate-config-files), and paste the resulting rules into our terraform module e.g.: https://github.com/ministryofjustice/cloud-platform-terraform-amp/blob/main/example/rules.tf

### Still to figure out

Expand All @@ -152,8 +152,8 @@ Look at scale and costs. Ingestion: $1 for 10m samples

Prices (Ireland):

* EU-AMP:MetricSampleCount - $0.35 per 10M metric samples for the next 250B metric samples
* EU-AMP:MetricStorageByteHrs - $0.03 per GB-Mo for storage above 10GB
- EU-AMP:MetricSampleCount - $0.35 per 10M metric samples for the next 250B metric samples
- EU-AMP:MetricStorageByteHrs - $0.03 per GB-Mo for storage above 10GB

#### Region

Expand All @@ -163,10 +163,10 @@ AMP is not released in the London region yet (at the time of writing, 3/11/21).

We should check our usage of these related components, and if we still need them in the new cluster:

* CloudWatch exporter
* Node exporter
* ECR exporter
* Pushgateway
- CloudWatch exporter
- Node exporter
- ECR exporter
- Pushgateway

#### Showing alerts

Expand All @@ -178,4 +178,4 @@ Or maybe we can give users read-only access to the console, for their team's SNS

#### Workspace as a service?

We could offer users a Prometheus workspace to themselves - a full monitoring stack that they fully control. Just a terraform module they can run. Maybe this is better for everyone, than a centralized one, or just for some specialized users - do some comparison?
We could offer users a Prometheus workspace to themselves - a full monitoring stack that they fully control. Just a terraform module they can run. Maybe this is better for everyone, than a centralized one, or just for some specialized users - do some comparison?
4 changes: 2 additions & 2 deletions runbooks/source/add-concourse-to-cluster.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ terraform plan -var "enable_oidc_associate=false"
terraform apply -var "enable_oidc_associate=false"
```

- Go to [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components).
Amend the following file and remove the count line from the [concourse module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/components.tf#L2).
- Go to [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components).
Amend the following file and remove the count line from the [concourse module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components/components.tf#L2).
- Apply the terraform module to your test cluster

```
Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/add-new-receiver-alert-manager.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ You must have the below details from the development team.

## Creating a new receiver set

1. Fill in the template with the details provided from development team and add the array to [`terraform.tfvars`](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/terraform.tfvars) file.
1. Fill in the template with the details provided from development team and add the array to [`terraform.tfvars`](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components/terraform.tfvars) file.
The `terraform.tfvars` file is encrypted so you have to `git-crypt unlock` to view the contents of the file.
Check [git-crypt documentation in user guide](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/other-topics/git-crypt-setup.html#git-crypt) for more information on how to setup git-crypt.

Expand Down
6 changes: 3 additions & 3 deletions runbooks/source/auth0-rotation.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ $ terraform apply

## 2) Apply changes within components (terraform)

Execute `terraform plan` inside [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components)
Execute `terraform plan` inside [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components)
to ensure changes match resources below, if they do, apply them:

```
Expand Down Expand Up @@ -63,9 +63,9 @@ In order to verify that the changes were successfully applied, follow the checkl
## 4) Update Manager cluster within components (terraform)

Our pipelines read auth0 credentials from a K8S secret inside the manager cluster. This secret is updated through concourse's TF module variable called `tf_provider_auth0_client_secret` and `tf_provider_auth0_client_id` in
[cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/terraform.tfvars](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/terraform.tfvars)
[cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components/terraform.tfvars](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components/terraform.tfvars)

Switch to manager cluster and Execute `terraform plan` inside [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components)
Switch to manager cluster and Execute `terraform plan` inside [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components)
to ensure changes match resources below, if they do, apply them:

```
Expand Down
2 changes: 1 addition & 1 deletion runbooks/source/container-images.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ This depends on several factors, some of them are:
| docker.io/grafana/grafana:10.4.0 | 🟠 | v11.1.0| [v11.1.0](https://github.com/grafana/grafana/releases/tag/v11.1.0) | [60.4.0](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/Chart.yaml#L26) |
| ministryofjustice/prometheus-ecr-exporter:0.2.0 | 🟢 | managed by us | n/a | [0.4.0](https://github.com/ministryofjustice/cloud-platform-helm-charts/blob/main/prometheus-ecr-exporter/Chart.yaml#L5) |
| ghcr.io/nerdswords/yet-another-cloudwatch-exporter:v0.61.2 | 🟢 | v0.61.2 | [v0.61.2](https://github.com/nerdswords/yet-another-cloudwatch-exporter/releases) | [0.38.0](https://github.com/nerdswords/helm-charts/releases)
| quay.io/kiwigrid/k8s-sidecar:1.26.1 | 🟢 | v1.26.4 | [v1.26.4](https://github.com/kiwigrid/k8s-sidecar/releases/tag/1.26.4) | [60.4.0](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/Chart.yaml#L26) |
| quay.io/kiwigrid/k8s-sidecar:1.26.1 | 🟢 | v1.26.2 | [v1.26.2](https://github.com/kiwigrid/k8s-sidecar/releases/tag/1.26.2) | [60.4.0](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/Chart.yaml#L26) |
| quay.io/oauth2-proxy/oauth2-proxy:v7.6.0 | 🟢 | v7.6.0 | [v7.6.0](https://github.com/oauth2-proxy/oauth2-proxy/releases/tag/v7.6.0) | [7.7.7](https://github.com/oauth2-proxy/manifests/releases/tag/oauth2-proxy-7.7.7) |
| quay.io/prometheus-operator/prometheus-config-reloader:v0.72.0 | 🟢 | v0.75.0 | [v0.75.0](https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.73.0) | [60.4.0](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/Chart.yaml#L26) |
| quay.io/prometheus-operator/prometheus-operator:v0.72.0 | 🟢 | v0.75.0 | [v0.75.0](https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.75.0) | [60.4.0](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/Chart.yaml#L26) |
Expand Down
4 changes: 2 additions & 2 deletions runbooks/source/creating-a-live-like.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ to the configuration similar to the live cluster.

## Installing live components and test applications

1. In [terraform/aws-accounts/cloud-platform-aws/vpc/eks/components] enable the following components:
1. In [terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components] enable the following components:
* cluster_autoscaler
* large_nodegroup
* kibana_proxy
Expand Down Expand Up @@ -80,4 +80,4 @@ See documentation for upgrading a [cluster](upgrade-eks-cluster.html).

[cluster build pipeline]: https://concourse.cloud-platform.service.justice.gov.uk/teams/main/pipelines/create-cluster
[terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf]: https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf
[terraform/aws-accounts/cloud-platform-aws/vpc/eks/components]: https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components
[terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components]: https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components
2 changes: 1 addition & 1 deletion runbooks/source/delete-cluster.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ Then, from the root of a checkout of the `cloud-platform-infrastructure` reposit
these commands to destroy all cluster components, and delete the terraform workspace:

```
$ cd terraform/aws-accounts/cloud-platform-aws/vpc/eks/components
$ cd terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components
$ terraform init
$ terraform workspace select ${cluster}
$ terraform destroy
Expand Down
Loading