ministryofjustice · mikebell · Oct 1, 2024 · Oct 1, 2024 · Oct 1, 2024 · Oct 1, 2024
diff --git a/architecture-decision-record/026-Managed-Prometheus.md b/architecture-decision-record/026-Managed-Prometheus.md
@@ -16,12 +16,12 @@ Use [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/)
 
 It's good operational practice to have good 'observability'. This includes monitoring, achieved by regular checking the metrics, or health numbers, of the containers running. The timeseries data which is collected can be shown as graphs or other indicators in a dashboard, and evaluated against rules which trigger alerts to the operators. Typical use by operators include:
 
-* to become familiar with the typical quantity of resources consumed by their software
-* to be alerted to deteriorating health, so that they can fix it, before it becomes an incident
-* being alerted to an incident, to be able to react quickly, not just when users flag it
-* during an incident getting an at-a-glance overview of where problems exist
-* after an incident to understand what went wrong, and help review the actions taken during the response
-* reviewing long-term patterns of health
+- to become familiar with the typical quantity of resources consumed by their software
+- to be alerted to deteriorating health, so that they can fix it, before it becomes an incident
+- being alerted to an incident, to be able to react quickly, not just when users flag it
+- during an incident getting an at-a-glance overview of where problems exist
+- after an incident to understand what went wrong, and help review the actions taken during the response
+- reviewing long-term patterns of health
 
 ### Choice of Prometheus
 
@@ -35,9 +35,9 @@ So overall we are happy to stick with Prometheus.
 
 Prometheus is setup to monitor the whole of Cloud Platform, including:
 
-* Tenant containers
-* Tenant AWS resources
-* Kubernetes cluster. kube-prometheus
+- Tenant containers
+- Tenant AWS resources
+- Kubernetes cluster. kube-prometheus
 
 Prometheus is configured to store 24h worth of data, which is enough to support most use cases. The data is also sent on to Thanos, which efficiently stores 1 year of metrics data, and makes it available for queries using the same PromQL syntax.
 
@@ -47,18 +47,19 @@ Alertmanager uses the Prometheus data when evaluating its alert rules.
 
 The Prometheus container has not run smoothly in recent months:
 
-* **Performance (resolved)** - There were some serious performance issues - alert rules were taking too long to evaluate against the Prometheus data, however this was successfully alleviated by increasing the disk iops, so is not a remaining concern.
+- **Performance (resolved)** - There were some serious performance issues - alert rules were taking too long to evaluate against the Prometheus data, however this was successfully alleviated by increasing the disk iops, so is not a remaining concern.
 
-* **Custom node group** - Being a single Prometheus instance for monitoring the entire platform, it consumes a lot of resources. We've put it on a dedicated node, so it has the full resources. And it needs more memory than other nodes, which means it needs a custom node group, which is a bit of extra management overhead.
+- **Custom node group** - Being a single Prometheus instance for monitoring the entire platform, it consumes a lot of resources. We've put it on a dedicated node, so it has the full resources. And it needs more memory than other nodes, which means it needs a custom node group, which is a bit of extra management overhead.
 
-* **Scalability** - Scaling in this vertical way is not ideal - scaling up is not smooth and eventually we'll hit a limit of CPU/memory/iops. There are options to shard - see below.
+- **Scalability** - Scaling in this vertical way is not ideal - scaling up is not smooth and eventually we'll hit a limit of CPU/memory/iops. There are options to shard - see below.
 
 We also need to address:
 
-* **Management overhead** - Managed cloud services are generally preferred to self-managed because the cost tends to be amortized over a large customer base and be far cheaper than in-house staff. And people with ops skills are at a premium. The management overhead is:
-  * for each of Prometheus, kube-prometheus
+- **Management overhead** - Managed cloud services are generally preferred to self-managed because the cost tends to be amortized over a large customer base and be far cheaper than in-house staff. And people with ops skills are at a premium. The management overhead is:
 
-* **High availability** - We have a single instance of Prometheus, simply because we've not got round to choosing and implementing a HA arrangement yet. This risks periods of outage where we don't collect metrics data. Although the impact on the use cases is not likely to be very disruptive, there is some value in fixing this up.
+  - for each of Prometheus, kube-prometheus
+
+- **High availability** - We have a single instance of Prometheus, simply because we've not got round to choosing and implementing a HA arrangement yet. This risks periods of outage where we don't collect metrics data. Although the impact on the use cases is not likely to be very disruptive, there is some value in fixing this up.
 
 ### Options for addressing the concerns
 
@@ -82,21 +83,20 @@ Resilience: AMP is relatively isolated against cluster issues. The data kept in
 
 Lock-in: the configuration syntax and other interfaces are the same or similar to our existing self-hosted Prometheus, so we maintain low lock-in / migration cost.
 
-
 ### Existing install
 
-The 'monitoring' namespace is configured in [components terraform](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/components.tf#L115-L138) calling the [cloud-platform-terraform-monitoring module](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring). This [installs](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring/blob/main/prometheus.tf#L88) the [kube-prometheus-stack Helm chart](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/README.md) / [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) (among other things).
+The 'monitoring' namespace is configured in [components terraform](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components/components.tf#L115-L138) calling the [cloud-platform-terraform-monitoring module](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring). This [installs](https://github.com/ministryofjustice/cloud-platform-terraform-monitoring/blob/main/prometheus.tf#L88) the [kube-prometheus-stack Helm chart](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/README.md) / [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) (among other things).
 
 [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) contains a number of things:
 
-* [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) - adds kubernetes-native wrappers for managing Prometheus
-  * CRDs for install: Prometheus, Alertmanager, Grafana, ThanosRuler
-  * CRDs for configuring: ServiceMonitor, PodMonitor, Probe, PrometheusRule, AlertmanagerConfig
-    - allows specifying monitoring targets using kubernetes labels 
-* Kubernetes manifests
-* Grafana dashboards
-* Prometheus rules
-* example configs for: node_exporter, scrape targets, alerting rules for cluster issues
+- [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) - adds kubernetes-native wrappers for managing Prometheus
+  - CRDs for install: Prometheus, Alertmanager, Grafana, ThanosRuler
+  - CRDs for configuring: ServiceMonitor, PodMonitor, Probe, PrometheusRule, AlertmanagerConfig
+    - allows specifying monitoring targets using kubernetes labels
+- Kubernetes manifests
+- Grafana dashboards
+- Prometheus rules
+- example configs for: node_exporter, scrape targets, alerting rules for cluster issues
 
 High Availability - not implemented (yet).
 
@@ -105,8 +105,8 @@ https://github.com/ministryofjustice/cloud-platform/issues/1749#issue-587058014
 
 Prometheus config is held in k8s resources:
 
-* ServiceMonitor
-* PrometheusRule - alerting
+- ServiceMonitor
+- PrometheusRule - alerting
 
 ## How it would work with AMP
 
@@ -122,23 +122,23 @@ Storage: - you can throw as much data at it. Instead there is a days limit of 15
 
 Alertmanager:
 
-* AMP has an Alertmanager-compatible option, which we'd use with the same rules
-* Sending alerts would need to us to configure: create SNS topic that forwards to user Slack channels
+- AMP has an Alertmanager-compatible option, which we'd use with the same rules
+- Sending alerts would need to us to configure: create SNS topic that forwards to user Slack channels
 
 Grafana:
 
-* Amazon Managed Grafana has no terraform support yet so just setup in AWS console. So in the meantime we stick with self-managed Grafana, which works fine.
+- Amazon Managed Grafana has no terraform support yet so just setup in AWS console. So in the meantime we stick with self-managed Grafana, which works fine.
 
 Prometheus web interface - previously AMP was headless, but now it comes with the web interface
 
 Prometheus Rules and Alerts:
 
-* In our existing cluster:
-   * we get ~3500 Prometheus rules from: https://github.com/kubernetes-monitoring/kubernetes-mixin 
-   * kube-prometheus compiles it to JSON and applies it to the cluster
-* So for our new cluster:
-   * we need to do the same thing for our new cluster. But let's avoid using kube-prometheus. Just copy what it does.
-   * when we upgrade the prometheus version, we'll manually [run the jsonnet config generation](https://github.com/kubernetes-monitoring/kubernetes-mixin#generate-config-files), and paste the resulting rules into our terraform module e.g.: https://github.com/ministryofjustice/cloud-platform-terraform-amp/blob/main/example/rules.tf
+- In our existing cluster:
+  - we get ~3500 Prometheus rules from: https://github.com/kubernetes-monitoring/kubernetes-mixin
+  - kube-prometheus compiles it to JSON and applies it to the cluster
+- So for our new cluster:
+  - we need to do the same thing for our new cluster. But let's avoid using kube-prometheus. Just copy what it does.
+  - when we upgrade the prometheus version, we'll manually [run the jsonnet config generation](https://github.com/kubernetes-monitoring/kubernetes-mixin#generate-config-files), and paste the resulting rules into our terraform module e.g.: https://github.com/ministryofjustice/cloud-platform-terraform-amp/blob/main/example/rules.tf
 
 ### Still to figure out
 
@@ -152,8 +152,8 @@ Look at scale and costs. Ingestion: $1 for 10m samples
 
 Prices (Ireland):
 
-* EU-AMP:MetricSampleCount - $0.35 per 10M metric samples for the next 250B metric samples
-* EU-AMP:MetricStorageByteHrs - $0.03 per GB-Mo for storage above 10GB
+- EU-AMP:MetricSampleCount - $0.35 per 10M metric samples for the next 250B metric samples
+- EU-AMP:MetricStorageByteHrs - $0.03 per GB-Mo for storage above 10GB
 
 #### Region
 
@@ -163,10 +163,10 @@ AMP is not released in the London region yet (at the time of writing, 3/11/21).
 
 We should check our usage of these related components, and if we still need them in the new cluster:
 
-* CloudWatch exporter
-* Node exporter
-* ECR exporter
-* Pushgateway
+- CloudWatch exporter
+- Node exporter
+- ECR exporter
+- Pushgateway
 
 #### Showing alerts
 
@@ -178,4 +178,4 @@ Or maybe we can give users read-only access to the console, for their team's SNS
 
 #### Workspace as a service?
 
-We could offer users a Prometheus workspace to themselves - a full monitoring stack that they fully control. Just a terraform module they can run.  Maybe this is better for everyone, than a centralized one, or just for some specialized users - do some comparison?
+We could offer users a Prometheus workspace to themselves - a full monitoring stack that they fully control. Just a terraform module they can run. Maybe this is better for everyone, than a centralized one, or just for some specialized users - do some comparison?
diff --git a/runbooks/source/add-concourse-to-cluster.html.md.erb b/runbooks/source/add-concourse-to-cluster.html.md.erb
@@ -40,8 +40,8 @@ terraform plan -var "enable_oidc_associate=false"
 terraform apply -var "enable_oidc_associate=false"
 ```
 
-- Go to [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components).
- Amend the following file and remove the count line from the [concourse module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/components.tf#L2).
+- Go to [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components).
+ Amend the following file and remove the count line from the [concourse module](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components/components.tf#L2).
 - Apply the terraform module to your test cluster
 
 ```

diff --git a/runbooks/source/add-new-receiver-alert-manager.html.md.erb b/runbooks/source/add-new-receiver-alert-manager.html.md.erb
@@ -22,7 +22,7 @@ You must have the below details from the development team.
 
 ## Creating a new receiver set
 
-1. Fill in the template with the details provided from development team and add the array to [`terraform.tfvars`](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/terraform.tfvars) file.
+1. Fill in the template with the details provided from development team and add the array to [`terraform.tfvars`](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components/terraform.tfvars) file.
 The `terraform.tfvars` file is encrypted so you have to `git-crypt unlock` to view the contents of the file.
 Check [git-crypt documentation in user guide](https://user-guide.cloud-platform.service.justice.gov.uk/documentation/other-topics/git-crypt-setup.html#git-crypt) for more information on how to setup git-crypt.
 

diff --git a/runbooks/source/auth0-rotation.html.md.erb b/runbooks/source/auth0-rotation.html.md.erb
@@ -34,7 +34,7 @@ $ terraform apply
 
 ## 2) Apply changes within components (terraform)
 
-Execute `terraform plan` inside [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components)
+Execute `terraform plan` inside [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components)
 to ensure changes match resources below, if they do, apply them:
 
 ```
@@ -63,9 +63,9 @@ In order to verify that the changes were successfully applied, follow the checkl
 ## 4) Update Manager cluster within components (terraform)
 
 Our pipelines read auth0 credentials from a K8S secret inside the manager cluster. This secret is updated through concourse's TF module variable called `tf_provider_auth0_client_secret` and `tf_provider_auth0_client_id` in
-[cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/terraform.tfvars](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/terraform.tfvars)
+[cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components/terraform.tfvars](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components/terraform.tfvars)
 
-Switch to manager cluster and Execute `terraform plan` inside [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components)
+Switch to manager cluster and Execute `terraform plan` inside [`cloud-platform-infrastructure/terraform/aws-accounts/cloud-platform-aws/vpc/eks` directory](https://github.com/ministryofjustice/cloud-platform-infrastructure/tree/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components)
 to ensure changes match resources below, if they do, apply them:
 
 ```

diff --git a/runbooks/source/container-images.html.md.erb b/runbooks/source/container-images.html.md.erb
@@ -129,7 +129,7 @@ This depends on several factors, some of them are:
 | docker.io/grafana/grafana:10.4.0 | 🟠 | v11.1.0| [v11.1.0](https://github.com/grafana/grafana/releases/tag/v11.1.0) | [60.4.0](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/Chart.yaml#L26) |
 | ministryofjustice/prometheus-ecr-exporter:0.2.0 | 🟢 | managed by us | n/a | [0.4.0](https://github.com/ministryofjustice/cloud-platform-helm-charts/blob/main/prometheus-ecr-exporter/Chart.yaml#L5) |
 | ghcr.io/nerdswords/yet-another-cloudwatch-exporter:v0.61.2 | 🟢 | v0.61.2 | [v0.61.2](https://github.com/nerdswords/yet-another-cloudwatch-exporter/releases) | [0.38.0](https://github.com/nerdswords/helm-charts/releases)
-| quay.io/kiwigrid/k8s-sidecar:1.26.1 | 🟢 | v1.26.4 |  [v1.26.4](https://github.com/kiwigrid/k8s-sidecar/releases/tag/1.26.4) | [60.4.0](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/Chart.yaml#L26) |
+| quay.io/kiwigrid/k8s-sidecar:1.26.1 | 🟢 | v1.26.2 |  [v1.26.2](https://github.com/kiwigrid/k8s-sidecar/releases/tag/1.26.2) | [60.4.0](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/Chart.yaml#L26) |
 | quay.io/oauth2-proxy/oauth2-proxy:v7.6.0 | 🟢 | v7.6.0 | [v7.6.0](https://github.com/oauth2-proxy/oauth2-proxy/releases/tag/v7.6.0) | [7.7.7](https://github.com/oauth2-proxy/manifests/releases/tag/oauth2-proxy-7.7.7) |
 | quay.io/prometheus-operator/prometheus-config-reloader:v0.72.0 | 🟢 | v0.75.0 | [v0.75.0](https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.73.0) | [60.4.0](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/Chart.yaml#L26) |
 | quay.io/prometheus-operator/prometheus-operator:v0.72.0 | 🟢 | v0.75.0 | [v0.75.0](https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.75.0) | [60.4.0](https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/Chart.yaml#L26) |

diff --git a/runbooks/source/creating-a-live-like.html.md.erb b/runbooks/source/creating-a-live-like.html.md.erb
@@ -26,7 +26,7 @@ to the configuration similar to the live cluster.
 
 ## Installing live components and test applications
 
-1. In [terraform/aws-accounts/cloud-platform-aws/vpc/eks/components] enable the following components:
+1. In [terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components] enable the following components:
   * cluster_autoscaler
   * large_nodegroup
   * kibana_proxy
@@ -80,4 +80,4 @@ See documentation for upgrading a [cluster](upgrade-eks-cluster.html).
 
 [cluster build pipeline]: https://concourse.cloud-platform.service.justice.gov.uk/teams/main/pipelines/create-cluster
 [terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf]: https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf
-[terraform/aws-accounts/cloud-platform-aws/vpc/eks/components]: https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components
+[terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components]: https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components
diff --git a/runbooks/source/delete-cluster.html.md.erb b/runbooks/source/delete-cluster.html.md.erb
@@ -79,7 +79,7 @@ Then, from the root of a checkout of the `cloud-platform-infrastructure` reposit
 these commands to destroy all cluster components, and delete the terraform workspace:
 
 ```
-$ cd terraform/aws-accounts/cloud-platform-aws/vpc/eks/components
+$ cd terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components
 $ terraform init
 $ terraform workspace select ${cluster}
 $ terraform destroy