Skip to content

Commit

Permalink
Merge pull request #7 from rhythmictech/ENGB360-22
Browse files Browse the repository at this point in the history
Monitor improvements for multi-env
  • Loading branch information
kmackowick authored Oct 23, 2024
2 parents 96642b7 + 5b1cbf0 commit 4aa8a64
Show file tree
Hide file tree
Showing 45 changed files with 929 additions and 406 deletions.
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,29 @@ module "monitor" {
```

## About

<!-- BEGIN_TF_DOCS -->
## Requirements

No requirements.

## Providers

No providers.

## Modules

No modules.

## Resources

No resources.

## Inputs

No inputs.

## Outputs

No outputs.
<!-- END_TF_DOCS -->
11 changes: 9 additions & 2 deletions aws/alb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Configures the following for ALBs based on tags matches:

| Name | Version |
|------|---------|
| <a name="provider_datadog"></a> [datadog](#provider\_datadog) | >= 3.37 |
| <a name="provider_datadog"></a> [datadog](#provider\_datadog) | 3.37.0 |

## Modules

Expand All @@ -46,23 +46,26 @@ No modules.
| <a name="input_base_tags"></a> [base\_tags](#input\_base\_tags) | Base tags (key:value format) to add to this type of check (combined with `local.tags` and `var.additional_tags`, generally you should not change this) | `list(string)` | <pre>[<br> "resource:alb"<br>]</pre> | no |
| <a name="input_cost_center"></a> [cost\_center](#input\_cost\_center) | Cost Center of the monitored resource (leave blank to omit tag) | `string` | `null` | no |
| <a name="input_dashboard_link"></a> [dashboard\_link](#input\_dashboard\_link) | Dashboard link to include in message | `string` | `null` | no |
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | n/a | yes |
| <a name="input_env"></a> [env](#input\_env) | Environment the monitored resource is in (leave blank to omit tag) | `string` | `null` | no |
| <a name="input_evaluation_delay"></a> [evaluation\_delay](#input\_evaluation\_delay) | Monitor evaluation delay (see [https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#set-alert-conditions](Datadog Docs)) | `number` | `900` | no |
| <a name="input_http_5xx_responses_enabled"></a> [http\_5xx\_responses\_enabled](#input\_http\_5xx\_responses\_enabled) | Enable HTTP 5xx response monitor | `bool` | `false` | no |
| <a name="input_http_5xx_responses_evaluation_window"></a> [http\_5xx\_responses\_evaluation\_window](#input\_http\_5xx\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| <a name="input_http_5xx_responses_no_data_window"></a> [http\_5xx\_responses\_no\_data\_window](#input\_http\_5xx\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| <a name="input_http_5xx_responses_threshold_critical"></a> [http\_5xx\_responses\_threshold\_critical](#input\_http\_5xx\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
| <a name="input_http_5xx_responses_threshold_warning"></a> [http\_5xx\_responses\_threshold\_warning](#input\_http\_5xx\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
| <a name="input_http_5xx_responses_use_message"></a> [http\_5xx\_responses\_use\_message](#input\_http\_5xx\_responses\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
| <a name="input_http_5xx_tg_responses_enabled"></a> [http\_5xx\_tg\_responses\_enabled](#input\_http\_5xx\_tg\_responses\_enabled) | Enable HTTP 5xx response monitor (target group) | `bool` | `false` | no |
| <a name="input_http_5xx_tg_responses_evaluation_window"></a> [http\_5xx\_tg\_responses\_evaluation\_window](#input\_http\_5xx\_tg\_responses\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| <a name="input_http_5xx_tg_responses_no_data_window"></a> [http\_5xx\_tg\_responses\_no\_data\_window](#input\_http\_5xx\_tg\_responses\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| <a name="input_http_5xx_tg_responses_threshold_critical"></a> [http\_5xx\_tg\_responses\_threshold\_critical](#input\_http\_5xx\_tg\_responses\_threshold\_critical) | Critical threshold (percentage, 0-100) | `number` | `75` | no |
| <a name="input_http_5xx_tg_responses_threshold_warning"></a> [http\_5xx\_tg\_responses\_threshold\_warning](#input\_http\_5xx\_tg\_responses\_threshold\_warning) | Warning threshold (percentage, 0-100) | `number` | `25` | no |
| <a name="input_http_5xx_tg_responses_use_message"></a> [http\_5xx\_tg\_responses\_use\_message](#input\_http\_5xx\_tg\_responses\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
| <a name="input_latency_enabled"></a> [latency\_enabled](#input\_latency\_enabled) | Enable latency monitor | `bool` | `false` | no |
| <a name="input_latency_evaluation_window"></a> [latency\_evaluation\_window](#input\_latency\_evaluation\_window) | Evaluation window for monitor (`last_?m` (1, 5, 10, 15, or 30), `last_?h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| <a name="input_latency_no_data_window"></a> [latency\_no\_data\_window](#input\_latency\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| <a name="input_latency_threshold_critical"></a> [latency\_threshold\_critical](#input\_latency\_threshold\_critical) | Critical threshold (seconds) | `number` | `null` | no |
| <a name="input_latency_threshold_warning"></a> [latency\_threshold\_warning](#input\_latency\_threshold\_warning) | Warning threshold (seconds) | `number` | `null` | no |
| <a name="input_latency_use_message"></a> [latency\_use\_message](#input\_latency\_use\_message) | Whether to use the query alert base message | `bool` | `false` | no |
| <a name="input_monitor_exclude_tags"></a> [monitor\_exclude\_tags](#input\_monitor\_exclude\_tags) | Tags to be excluded in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| <a name="input_monitor_include_tags"></a> [monitor\_include\_tags](#input\_monitor\_include\_tags) | Tags to be included in the monitoring query. Specify in key:value format | `list(string)` | `[]` | no |
| <a name="input_new_group_delay"></a> [new\_group\_delay](#input\_new\_group\_delay) | Delay in seconds before generating alerts for a new resource | `number` | `300` | no |
Expand All @@ -71,10 +74,14 @@ No modules.
| <a name="input_no_healthy_instances_no_data_window"></a> [no\_healthy\_instances\_no\_data\_window](#input\_no\_healthy\_instances\_no\_data\_window) | No data threshold (in minutes, 0 to disable) | `number` | `10` | no |
| <a name="input_no_healthy_instances_threshold_critical"></a> [no\_healthy\_instances\_threshold\_critical](#input\_no\_healthy\_instances\_threshold\_critical) | Critical threshold (percentage) | `number` | `0` | no |
| <a name="input_no_healthy_instances_threshold_warning"></a> [no\_healthy\_instances\_threshold\_warning](#input\_no\_healthy\_instances\_threshold\_warning) | Warning threshold (percentage) | `number` | `null` | no |
| <a name="input_no_healthy_instances_use_message"></a> [no\_healthy\_instances\_use\_message](#input\_no\_healthy\_instances\_use\_message) | Whether to use the query alert base message | `bool` | `true` | no |
| <a name="input_notify_alert_override"></a> [notify\_alert\_override](#input\_notify\_alert\_override) | List of notifications for alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_crit_override"></a> [notify\_crit\_override](#input\_notify\_crit\_override) | List of notifications for 24x7 alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_default"></a> [notify\_default](#input\_notify\_default) | List of alert notifications (can be overridden based on alert type) | `list(string)` | n/a | yes |
| <a name="input_notify_no_data"></a> [notify\_no\_data](#input\_notify\_no\_data) | Alert if no matching data is found | `bool` | `false` | no |
| <a name="input_notify_nodata_override"></a> [notify\_nodata\_override](#input\_notify\_nodata\_override) | List of notifications for no data (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_nonprod_override"></a> [notify\_nonprod\_override](#input\_notify\_nonprod\_override) | List of notifications for non-prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_prod_override"></a> [notify\_prod\_override](#input\_notify\_prod\_override) | List of notifications for 12x5 prod alerts in critical threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_recovery_override"></a> [notify\_recovery\_override](#input\_notify\_recovery\_override) | List of notifications for alert recovery (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_notify_warn_override"></a> [notify\_warn\_override](#input\_notify\_warn\_override) | List of notifications for alerts in warning threshold (uses `notify_default` otherwise) | `list(string)` | `[]` | no |
| <a name="input_renotify_interval"></a> [renotify\_interval](#input\_renotify\_interval) | Interval in minutes to re-send notifications about an alert | `number` | `0` | no |
Expand Down
38 changes: 19 additions & 19 deletions aws/alb/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@ locals {
monitor_warn_default_priority = null
monitor_nodata_default_priority = null

title_prefix = "${var.title_prefix == null ? "" : "[${var.title_prefix}]"}"
title_prefix = var.title_prefix == null ? "" : "[${var.title_prefix}]"
title_suffix = var.title_suffix == null ? "" : " (${var.title_suffix})"
}

resource "datadog_monitor" "http_5xx_responses" {
count = var.http_5xx_responses_enabled ? 1 : 0

name = join("", [local.title_prefix, "ALB 5xx Responses - {{loadbalancer.name}}", local.title_suffix])
include_tags = true
message = local.query_alert_base_message
include_tags = false
message = var.http_5xx_responses_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"

Expand All @@ -27,8 +27,8 @@ resource "datadog_monitor" "http_5xx_responses" {

query = <<END
min(${var.http_5xx_responses_evaluation_window}):
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {aws_account,env,loadbalancer,region}.as_rate(), 0) / (
default(avg:aws.applicationelb.request_count${local.query_filter} by {aws_account,env,loadbalancer,region}.as_rate(), 1)
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {aws_account,env,datadog_managed,loadbalancer,region}.as_rate(), 0) / (
default(avg:aws.applicationelb.request_count${local.query_filter} by {aws_account,env,datadog_managed,loadbalancer,region}.as_rate(), 1)
) * 100 > ${var.http_5xx_responses_threshold_critical}
END

Expand All @@ -42,8 +42,8 @@ resource "datadog_monitor" "http_5xx_tg_responses" {
count = var.http_5xx_tg_responses_enabled ? 1 : 0

name = join("", [local.title_prefix, "ALB Target Group 5xx Responses - {{loadbalancer.name}}", local.title_suffix])
include_tags = true
message = local.query_alert_base_message
include_tags = false
message = var.http_5xx_tg_responses_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"

Expand All @@ -57,8 +57,8 @@ resource "datadog_monitor" "http_5xx_tg_responses" {

query = <<END
min(${var.http_5xx_tg_responses_evaluation_window}):
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env}.as_rate(), 0) / (
default(avg:aws.applicationelb.request_count${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env}.as_rate(), 1)
default(avg:aws.applicationelb.httpcode_elb_5xx${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env,datadog_managed}.as_rate(), 0) / (
default(avg:aws.applicationelb.request_count${local.query_filter} by {loadbalancer,region,aws_account,targetgroup,env,datadog_managed}.as_rate(), 1)
) * 100 > ${var.http_5xx_tg_responses_threshold_critical}
END

Expand All @@ -72,9 +72,9 @@ END
resource "datadog_monitor" "latency" {
count = var.latency_enabled ? 1 : 0

name = join("", [local.title_prefix, "{{loadbalancer.name}} ALB latency - {{value}}s ", local.title_suffix])
include_tags = true
message = local.query_alert_base_message
name = join("", [local.title_prefix, "ALB latency - {{loadbalancer.name}} {{value}}s", local.title_suffix])
include_tags = false
message = var.latency_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"

Expand All @@ -88,7 +88,7 @@ resource "datadog_monitor" "latency" {

query = <<END
avg(${var.latency_evaluation_window}):
default(avg:aws.applicationelb.target_response_time.average${local.query_filter} by {aws_account,env,loadbalancer,region}, 0
default(avg:aws.applicationelb.target_response_time.average${local.query_filter} by {aws_account,env,datadog_managed,loadbalancer,region}, 0
) > ${var.latency_threshold_critical}
END

Expand All @@ -101,9 +101,9 @@ END
resource "datadog_monitor" "no_healthy_instances" {
count = var.no_healthy_instances_enabled ? 1 : 0

name = join("", [local.title_prefix, "{{loadbalancer.name}} ALB healthy instances is at {{value}}%", local.title_suffix])
include_tags = true
message = local.query_alert_base_message
name = join("", [local.title_prefix, "ALB available healthy instances - {{loadbalancer.name}} {{value}}%", local.title_suffix])
include_tags = false
message = var.no_healthy_instances_use_message ? local.query_alert_base_message : ""
tags = concat(local.common_tags, var.base_tags, var.additional_tags)
type = "query alert"

Expand All @@ -117,9 +117,9 @@ resource "datadog_monitor" "no_healthy_instances" {

query = <<END
min(${var.no_healthy_instances_evaluation_window}): (
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,region,loadbalancer} / (
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,region,loadbalancer} +
sum:aws.applicationelb.un_healthy_host_count.maximum${local.query_filter} by {aws_account,env,region,loadbalancer} )
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,datadog_managed,region,loadbalancer} / (
sum:aws.applicationelb.healthy_host_count.minimum${local.query_filter} by {aws_account,env,datadog_managed,region,loadbalancer} +
sum:aws.applicationelb.un_healthy_host_count.maximum${local.query_filter} by {aws_account,env,datadog_managed,region,loadbalancer} )
) * 100 <= ${var.no_healthy_instances_threshold_critical}
END

Expand Down
32 changes: 28 additions & 4 deletions aws/alb/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ variable "base_tags" {
# HTTP 5xx Response Codes (ALB)
########################################
variable "http_5xx_responses_enabled" {
default = false
default = true
description = "Enable HTTP 5xx response monitor"
type = bool
}
Expand Down Expand Up @@ -46,11 +46,17 @@ variable "http_5xx_responses_threshold_warning" {
type = number
}

variable "http_5xx_responses_use_message" {
description = "Whether to use the query alert base message"
type = bool
default = false
}

########################################
# HTTP 5xx Response Codes (Target Group)
########################################
variable "http_5xx_tg_responses_enabled" {
default = false
default = true
description = "Enable HTTP 5xx response monitor (target group)"
type = bool
}
Expand Down Expand Up @@ -79,11 +85,17 @@ variable "http_5xx_tg_responses_threshold_warning" {
type = number
}

variable "http_5xx_tg_responses_use_message" {
description = "Whether to use the query alert base message"
type = bool
default = false
}

########################################
# Latency Instances
########################################
variable "latency_enabled" {
default = false
default = true
description = "Enable latency monitor"
type = bool
}
Expand All @@ -101,7 +113,7 @@ variable "latency_no_data_window" {
}

variable "latency_threshold_critical" {
default = null
default = 3
description = "Critical threshold (seconds)"
type = number
}
Expand All @@ -112,6 +124,12 @@ variable "latency_threshold_warning" {
type = number
}

variable "latency_use_message" {
description = "Whether to use the query alert base message"
type = bool
default = false
}

########################################
# No Healthy Instances
########################################
Expand Down Expand Up @@ -144,3 +162,9 @@ variable "no_healthy_instances_threshold_warning" {
description = "Warning threshold (percentage)"
type = number
}

variable "no_healthy_instances_use_message" {
description = "Whether to use the query alert base message"
type = bool
default = true
}
Loading

0 comments on commit 4aa8a64

Please sign in to comment.