Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alertmanager and alert rules for prometheus #1468

Merged
merged 1 commit into from
Aug 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docker/batch-test.env
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ IPV4_IP_GRAFANA_INTERNAL=192.168.43.110
IPV4_IP_PROMETHEUS_INTERNAL=192.168.43.111
IPV4_IP_RESOLVER_INTERNAL_VALIDATING=192.168.43.112
IPV4_IP_RESOLVER_INTERNAL_PERMISSIVE=192.168.43.113
IPV4_IP_ALERTMANAGER_INTERNAL=192.168.43.114

IPV4_IP_MOCK_RESOLVER_PUBLIC=172.43.0.114
IPV6_IP_MOCK_RESOLVER_PUBLIC=fd00:43:1::114
Expand Down
17 changes: 17 additions & 0 deletions docker/defaults.env
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,22 @@ INTERNET_NL_CHECK_SUPPORT_RPKI=True
# list of domainnames that can have retry timer be reset via API
INTERNETNL_CACHE_RESET_ALLOWLIST=

# settings for alertmanager, enable it by adding 'alertmanager' to COMPOSE_PROFILES
# sending email address used for alerts
ALERTMANAGER_MAIL_FROM=

# SMTP configuration for sending emails
ALERTMANAGER_SMTP_HOST=
ALERTMANAGER_SMTP_USER=
ALERTMANAGER_SMTP_PASSWORD=
ALERTMANAGER_SMTP_PORT=587

# comma separated list of email addresses to send alert emails to
ALERTMANAGER_MAIL_TO=

# set subject for alert mails to be sent, see: https://prometheus.io/docs/alerting/latest/notifications/
ALERTMANAGER_SUBJECT=Alert on host '{{ .CommonAnnotations.host }}', caused by '{{ .CommonAnnotations.summary }}'

## Settings below _may_ be changed but are best _left_ as is

# Docker Compose project name to use in case of multiple instances running on the same host
Expand Down Expand Up @@ -209,6 +225,7 @@ IPV4_IP_GRAFANA_INTERNAL=192.168.42.110
IPV4_IP_PROMETHEUS_INTERNAL=192.168.42.111
IPV4_IP_RESOLVER_INTERNAL_VALIDATING=192.168.42.112
IPV4_IP_RESOLVER_INTERNAL_PERMISSIVE=192.168.42.113
IPV4_IP_ALERTMANAGER_INTERNAL=192.168.42.114

IPV4_IP_MOCK_RESOLVER_PUBLIC=172.42.0.114
IPV6_IP_MOCK_RESOLVER_PUBLIC=fd00:42:1::114
Expand Down
65 changes: 65 additions & 0 deletions docker/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ services:
- IPV4_IP_APP_INTERNAL
- IPV4_IP_GRAFANA_INTERNAL
- IPV4_IP_PROMETHEUS_INTERNAL
- IPV4_IP_ALERTMANAGER_INTERNAL
- ENABLE_BATCH
- LETSENCRYPT_STAGING
- LETSENCRYPT_EMAIL
Expand Down Expand Up @@ -790,6 +791,8 @@ services:
configs:
- source: prometheus_config
target: /prometheus.yaml
- source: prometheus_rules_config
target: /prometheus-rules.yaml

restart: unless-stopped
logging:
Expand All @@ -806,6 +809,31 @@ services:
volumes:
- prometheus-data:/prometheus

alertmanager:
image: ${DOCKER_IMAGE_PROMETHEUS:-prom/alertmanager:v0.27.0}

command:
- --config.file=/alertmanager.yaml
- --web.external-url=https://$INTERNETNL_DOMAINNAME/alertmanager/
- --cluster.listen-address=

configs:
- source: alertmanager_config
target: /alertmanager.yaml

restart: unless-stopped
logging:
driver: $LOGGING_DRIVER
options:
tag: '{{.Name}}'
networks:
internal:
ipv4_address: $IPV4_IP_ALERTMANAGER_INTERNAL
public-internet: {}

profiles:
- alertmanager

postgresql-exporter:
image: ${DOCKER_IMAGE_POSTGRESQL_EXPORTER:-prometheuscommunity/postgres-exporter:v0.12.0}

Expand Down Expand Up @@ -999,6 +1027,14 @@ configs:
global:
scrape_interval: 10s
scrape_timeout: 5s
rule_files:
- /prometheus-rules.yaml
alerting:
alertmanagers:
- path_prefix: /alertmanager
static_configs:
- targets:
- $IPV4_IP_ALERTMANAGER_INTERNAL:9093
scrape_configs:
- &scrape_config
scheme: http
Expand Down Expand Up @@ -1031,6 +1067,35 @@ configs:
- <<: *scrape_config
job_name: nginx_logs_exporter
static_configs: [{targets: ["nginx_logs_exporter:4040"]}]
prometheus_rules_config:
content: |
groups:
- name: End to end monitoring
rules:
- alert: HighTestRuntime
expr: min(tests_test_runtime_seconds{test="site"})>=10 and max(tests_test_runtime_seconds{test="site"})>=30
annotations:
host: $INTERNETNL_DOMAINNAME
summary: Tests/probes take longer to complete than expected
dashboard: 'https://$INTERNETNL_DOMAINNAME/grafana/d/af7d1d82-c0f9-4d8d-bc03-542c4c4c75c0/periodic-tests'
alertmanager_config:
content: |
global:
smtp_from: $ALERTMANAGER_MAIL_FROM
smtp_smarthost: $ALERTMANAGER_SMTP_HOST:$ALERTMANAGER_SMTP_PORT
smtp_require_tls: true
smtp_auth_username: $ALERTMANAGER_SMTP_USER
smtp_auth_password: $ALERTMANAGER_SMTP_PASSWORD
route:
receiver: alerts
routes:
- receiver: alerts
receivers:
- name: alerts
email_configs:
- to: $ALERTMANAGER_MAIL_TO
headers:
subject: $ALERTMANAGER_SUBJECT

restart_worker_cron:
content: |
Expand Down
40 changes: 36 additions & 4 deletions docker/grafana/dashboards/home.json
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
"uid": "oUXCLhCMk"
},
"gridPos": {
"h": 15,
"h": 19,
"w": 12,
"x": 0,
"y": 1
Expand Down Expand Up @@ -94,7 +94,7 @@
"uid": "oUXCLhCMk"
},
"gridPos": {
"h": 15,
"h": 8,
"w": 12,
"x": 12,
"y": 1
Expand All @@ -106,7 +106,7 @@
"showLineNumbers": false,
"showMiniMap": false
},
"content": "<ul>\n<li><a target=\"_blank\" href=\"/prometheus\">/prometheus</a>\n<li><a target=\"_blank\" href=\"/prometheus/targets\">/prometheus/targets</a>\n",
"content": "<ul>\n<li><a target=\"_blank\" href=\"/prometheus\">/prometheus</a>\n<li><a target=\"_blank\" href=\"/prometheus/targets\">/prometheus/targets</a>\n<li><a target=\"_blank\" href=\"/prometheus/alerts\">/prometheus/alerts</a>\n<li><a target=\"_blank\" href=\"/prometheus/rules\">/prometheus/rules</a>\n<li><a target=\"_blank\" href=\"/alertmanager\">/alertmanager</a>\n",
"mode": "html"
},
"pluginVersion": "9.5.2",
Expand All @@ -121,6 +121,38 @@
],
"title": "Links",
"type": "text"
},
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
},
"gridPos": {
"h": 11,
"w": 12,
"x": 12,
"y": 9
},
"id": 10,
"options": {
"alertInstanceLabelFilter": "",
"alertName": "",
"dashboardAlerts": false,
"groupBy": [],
"groupMode": "default",
"maxItems": 20,
"sortOrder": 1,
"stateFilter": {
"error": true,
"firing": true,
"noData": false,
"normal": false,
"pending": true
},
"viewMode": "list"
},
"title": "Panel Title",
"type": "alertlist"
}
],
"refresh": "30s",
Expand Down Expand Up @@ -168,4 +200,4 @@
"uid": "NES71yrGz",
"version": 15,
"weekStart": ""
}
}
1 change: 1 addition & 0 deletions docker/test.env
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ IPV4_IP_GRAFANA_INTERNAL=192.168.43.110
IPV4_IP_PROMETHEUS_INTERNAL=192.168.43.111
IPV4_IP_RESOLVER_INTERNAL_VALIDATING=192.168.43.112
IPV4_IP_RESOLVER_INTERNAL_PERMISSIVE=192.168.43.113
IPV4_IP_ALERTMANAGER_INTERNAL=192.168.43.114

IPV4_IP_MOCK_RESOLVER_PUBLIC=172.43.0.114
IPV6_IP_MOCK_RESOLVER_PUBLIC=fd00:43:1::114
Expand Down
5 changes: 5 additions & 0 deletions docker/webserver/nginx_templates/app.conf.template
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,11 @@ server {
auth_basic_user_file /etc/nginx/htpasswd/monitoring.htpasswd;
proxy_pass http://${IPV4_IP_PROMETHEUS_INTERNAL}:9090;
}
location /alertmanager {
auth_basic "Please enter your monitoring username and password";
auth_basic_user_file /etc/nginx/htpasswd/monitoring.htpasswd;
proxy_pass http://${IPV4_IP_ALERTMANAGER_INTERNAL}:9093;
}
}

# Temporary (1 year) exception for conn. subdomain to disable HSTS and redirect back to HTTP for
Expand Down
25 changes: 25 additions & 0 deletions documentation/Docker-deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,31 @@ To verify the health status of the critial services use these commands:

The services `webserver`, `app`, `postgres` and `redis` are critical for the user facing HTTP frontend, no page will show if these are not running. The services `worker`, `rabbitmq`, `routinator`, `unbound`, `resolver-permissive` and `resolver-validating` are additionally required for new tests to be performed. The `beat` service is required for updating hall-of-fame. For Batch Deployment this is however a critical service to schedule batch tests submitted via the API.

### Alerting emails/alertmanager

A Prometheus Alertmanager service is available but disabled by default. Enabling this will allow you to configure alert emails to be sent whenever the periodic tests fail to complete in a reasonable time, indicating an issue with the application.

To enable and configure the Alertmanager add the following lines to `docker/local.env` and adjust the values to be applicable for your environment:

COMPOSE_PROFILES=default,alertmanager
[email protected],[email protected]
[email protected]
ALERTMANAGER_SMTP_HOST=smtp.example.com
ALERTMANAGER_SMTP_USER=example
ALERTMANAGER_SMTP_PASSWORD=example

If there already is a `COMPOSE_PROFILES` entry in the configuration file, add `alertmanager` to that instead.

The SMTP server is expected to use TLS, there is no way to disable this setting. The port used is `587` and can be customized using the `ALERTMANAGER_SMTP_PORT` variable.

The email subject can be customized using the `ALERTMANAGER_SUBJECT` variable, see `docker/defaults.env` for details.

Current alert status can seen at: https://example.com/prometheus/alerts or https://example.com/alertmanager

If notification emails are not being sent even though alert status shows red see Alertmanager logging for debugging:

docker compose --project-name=internetnl-prod logs --follow alertmanager

## Restricting access

By default the installation is open to everyone. If you like to restrict access you can do so by either using HTTP Basic Authentication or IP allow/deny lists.
Expand Down