Skip to content

Commit

Permalink
feat(probe): add startup probe (#1420)
Browse files Browse the repository at this point in the history
Signed-off-by: Wassim DHIF <[email protected]>
  • Loading branch information
wdhif authored Jun 28, 2024
1 parent 4af0ec6 commit 7f4da79
Show file tree
Hide file tree
Showing 14 changed files with 175 additions and 2 deletions.
4 changes: 4 additions & 0 deletions charts/datadog/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Datadog changelog

## 3.67.0

* Add startup probe for `Agent`, `Cluster-Agent` and `Cluster-Check-Runner`.

## 3.66.1

* Add 'datadog.namespaceAnnotationsAsTags' to assign namespace annotations as tags on pod entities in the tagger.
Expand Down
2 changes: 1 addition & 1 deletion charts/datadog/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
apiVersion: v1
name: datadog
version: 3.66.1
version: 3.67.0
appVersion: "7"
description: Datadog Agent
keywords:
Expand Down
5 changes: 4 additions & 1 deletion charts/datadog/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Datadog

![Version: 3.66.1](https://img.shields.io/badge/Version-3.66.1-informational?style=flat-square) ![AppVersion: 7](https://img.shields.io/badge/AppVersion-7-informational?style=flat-square)
![Version: 3.67.0](https://img.shields.io/badge/Version-3.67.0-informational?style=flat-square) ![AppVersion: 7](https://img.shields.io/badge/AppVersion-7-informational?style=flat-square)

[Datadog](https://www.datadoghq.com/) is a hosted infrastructure monitoring platform. This chart adds the Datadog Agent to all nodes in your cluster via a DaemonSet. It also optionally depends on the [kube-state-metrics chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-state-metrics). For more information about monitoring Kubernetes with Datadog, please refer to the [Datadog documentation website](https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/).

Expand Down Expand Up @@ -467,6 +467,7 @@ helm install <RELEASE_NAME> \
| agents.containers.agent.readinessProbe | object | Every 15s / 6 KO / 1 OK | Override default agent readiness probe settings |
| agents.containers.agent.resources | object | `{}` | Resource requests and limits for the agent container. |
| agents.containers.agent.securityContext | object | `{}` | Allows you to overwrite the default container SecurityContext for the agent container. |
| agents.containers.agent.startupProbe | object | Every 15s / 6 KO / 1 OK | Override default agent startup probe settings |
| agents.containers.initContainers.resources | object | `{}` | Resource requests and limits for the init containers |
| agents.containers.initContainers.securityContext | object | `{}` | Allows you to overwrite the default container SecurityContext for the init containers. |
| agents.containers.initContainers.volumeMounts | list | `[]` | Specify additional volumes to mount for the init containers |
Expand Down Expand Up @@ -611,6 +612,7 @@ helm install <RELEASE_NAME> \
| clusterAgent.revisionHistoryLimit | int | `10` | The number of old ReplicaSets to keep in this Deployment. |
| clusterAgent.securityContext | object | `{}` | Allows you to overwrite the default PodSecurityContext on the cluster-agent pods. |
| clusterAgent.shareProcessNamespace | bool | `false` | Set the process namespace sharing on the Datadog Cluster Agent |
| clusterAgent.startupProbe | object | Every 15s / 6 KO / 1 OK | Override default Cluster Agent startup probe settings |
| clusterAgent.strategy | object | `{"rollingUpdate":{"maxSurge":1,"maxUnavailable":0},"type":"RollingUpdate"}` | Allow the Cluster Agent deployment to perform a rolling update on helm update |
| clusterAgent.token | string | `""` | Cluster Agent token is a preshared key between node agents and cluster agent (autogenerated if empty, needs to be at least 32 characters a-zA-z) |
| clusterAgent.tokenExistingSecret | string | `""` | Existing secret name to use for Cluster Agent token. Put the Cluster Agent token in a key named `token` inside the Secret |
Expand Down Expand Up @@ -652,6 +654,7 @@ helm install <RELEASE_NAME> \
| clusterChecksRunner.resources | object | `{}` | Datadog clusterchecks-agent resource requests and limits. |
| clusterChecksRunner.revisionHistoryLimit | int | `10` | The number of old ReplicaSets to keep in this Deployment. |
| clusterChecksRunner.securityContext | object | `{}` | Allows you to overwrite the default PodSecurityContext on the clusterchecks pods. |
| clusterChecksRunner.startupProbe | object | Every 15s / 6 KO / 1 OK | Override default agent startup probe settings |
| clusterChecksRunner.strategy | object | `{"rollingUpdate":{"maxSurge":1,"maxUnavailable":0},"type":"RollingUpdate"}` | Allow the ClusterChecks deployment to perform a rolling update on helm update |
| clusterChecksRunner.tolerations | list | `[]` | Tolerations for pod assignment |
| clusterChecksRunner.topologySpreadConstraints | list | `[]` | Allow the ClusterChecks Deployment to schedule using pod topology spreading |
Expand Down
42 changes: 42 additions & 0 deletions charts/datadog/templates/NOTES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ Then run:
{{- end }}

{{- $healthPort := .Values.agents.containers.agent.healthPort }}

{{- with $liveness := .Values.agents.containers.agent.livenessProbe.httpGet }}
{{- if and $liveness.port (ne $healthPort $liveness.port) }}

Expand All @@ -37,6 +38,7 @@ Then run:
Node Agent liveness probe port ({{ $liveness.port }}) is different from the configured health port ({{ $healthPort }}).
{{- end }}
{{- end }}

{{- with $readiness := .Values.agents.containers.agent.readinessProbe.httpGet }}
{{- if and $readiness.port (ne $healthPort $readiness.port) }}

Expand All @@ -47,6 +49,18 @@ Node Agent liveness probe port ({{ $liveness.port }}) is different from the conf
Node Agent readiness probe port ({{ $readiness.port }}) is different from the configured health port ({{ $healthPort }}).
{{- end }}
{{- end }}

{{- with $startup := .Values.agents.containers.agent.startupProbe.httpGet }}
{{- if and $startup.port (ne $healthPort $startup.port) }}

##############################################################################
#### ERROR: Node Agent startup probe misconfiguration ####
##############################################################################

Node Agent readiness probe port ({{ $startup.port }}) is different from the configured health port ({{ $healthPort }}).
{{- end }}
{{- end }}

{{- if eq (include "should-deploy-cluster-agent" .) "true" }}

{{- if .Values.clusterAgent.metricsProvider.enabled }}
Expand All @@ -65,6 +79,7 @@ Create an application key at https://app.datadoghq.com/account/settings#api
{{- end }}
{{- end }}
{{- $healthPort := .Values.clusterAgent.healthPort }}

{{- with $liveness := .Values.clusterAgent.livenessProbe.httpGet }}
{{- if and $liveness.port (ne $healthPort $liveness.port) }}

Expand All @@ -75,6 +90,7 @@ Create an application key at https://app.datadoghq.com/account/settings#api
Cluster Agent liveness probe port ({{ $liveness.port }}) is different from the configured health port ({{ $healthPort }}).
{{- end }}
{{- end }}

{{- with $readiness := .Values.clusterAgent.readinessProbe.httpGet }}
{{- if and $readiness.port (ne $healthPort $readiness.port) }}

Expand All @@ -85,8 +101,21 @@ Cluster Agent liveness probe port ({{ $liveness.port }}) is different from the c
Cluster Agent readiness probe port ({{ $readiness.port }}) is different from the configured health port ({{ $healthPort }}).
{{- end }}
{{- end }}

{{- with $startup := .Values.clusterAgent.startupProbe.httpGet }}
{{- if and $startup.port (ne $healthPort $startup.port) }}

##############################################################################
#### ERROR: Cluster Agent startup probe misconfiguration ####
##############################################################################

Cluster Agent readiness probe port ({{ $startup.port }}) is different from the configured health port ({{ $healthPort }}).
{{- end }}
{{- end }}

{{- if (eq (include "should-enable-cluster-check-workers" .) "true") }}
{{- $healthPort := .Values.clusterChecksRunner.healthPort }}

{{- with $liveness := .Values.clusterChecksRunner.livenessProbe.httpGet }}
{{- if and $liveness.port (ne $healthPort $liveness.port) }}

Expand All @@ -97,6 +126,7 @@ Cluster Agent readiness probe port ({{ $readiness.port }}) is different from the
Cluster Checks Runner liveness probe port ({{ $liveness.port }}) is different from the configured health port ({{ $healthPort }}).
{{- end }}
{{- end }}

{{- with $readiness := .Values.clusterChecksRunner.readinessProbe.httpGet }}
{{- if and $readiness.port (ne $healthPort $readiness.port) }}

Expand All @@ -107,6 +137,18 @@ Cluster Checks Runner liveness probe port ({{ $liveness.port }}) is different fr
Cluster Checks Runner readiness probe port ({{ $readiness.port }}) is different from the configured health port ({{ $healthPort }}).
{{- end }}
{{- end }}

{{- with $startup := .Values.clusterChecksRunner.startupProbe.httpGet }}
{{- if and $startup.port (ne $healthPort $startup.port) }}

#####################################################################################
#### ERROR: Cluster Checks Runner startup probe misconfiguration ####
#####################################################################################

Cluster Checks Runner readiness probe port ({{ $startup.port }}) is different from the configured health port ({{ $healthPort }}).
{{- end }}
{{- end }}

{{- end }}
{{- end }}
{{- if or .Values.datadog.apm.enabled .Values.datadog.apm.portEnabled }}
Expand Down
3 changes: 3 additions & 0 deletions charts/datadog/templates/_container-agent.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -330,4 +330,7 @@
readinessProbe:
{{- $ready := .Values.agents.containers.agent.readinessProbe }}
{{ include "probe.http" (dict "path" "/ready" "port" $healthPort "settings" $ready) | indent 4 }}
startupProbe:
{{- $startup := .Values.agents.containers.agent.startupProbe }}
{{ include "probe.http" (dict "path" "/startup" "port" $healthPort "settings" $startup) | indent 4 }}
{{- end -}}
3 changes: 3 additions & 0 deletions charts/datadog/templates/agent-clusterchecks-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,9 @@ spec:
readinessProbe:
{{- $ready := .Values.clusterChecksRunner.readinessProbe }}
{{ include "probe.http" (dict "settings" $ready "path" "/ready" "port" $healthPort) | indent 10 }}
startupProbe:
{{- $startup := .Values.clusterChecksRunner.startupProbe }}
{{ include "probe.http" (dict "settings" $startup "path" "/startup" "port" $healthPort) | indent 10 }}
volumes:
- name: installinfo
configMap:
Expand Down
3 changes: 3 additions & 0 deletions charts/datadog/templates/cluster-agent-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -368,6 +368,9 @@ spec:
readinessProbe:
{{- $ready := .Values.clusterAgent.readinessProbe }}
{{ include "probe.http" (dict "path" "/ready" "port" $healthPort "settings" $ready) | indent 10 }}
startupProbe:
{{- $startup := .Values.clusterAgent.startupProbe }}
{{ include "probe.http" (dict "path" "/startup" "port" $healthPort "settings" $startup) | indent 10 }}
{{- if .Values.clusterAgent.containers.clusterAgent.securityContext }}
securityContext:
{{ toYaml .Values.clusterAgent.containers.clusterAgent.securityContext | indent 10 }}
Expand Down
35 changes: 35 additions & 0 deletions charts/datadog/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1249,6 +1249,15 @@ clusterAgent:
successThreshold: 1
failureThreshold: 6

# clusterAgent.startupProbe -- Override default Cluster Agent startup probe settings
# @default -- Every 15s / 6 KO / 1 OK
startupProbe:
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 6

# clusterAgent.strategy -- Allow the Cluster Agent deployment to perform a rolling update on helm update

## ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
Expand Down Expand Up @@ -1585,6 +1594,15 @@ agents:
successThreshold: 1
failureThreshold: 6

# agents.containers.agent.startupProbe -- Override default agent startup probe settings
# @default -- Every 15s / 6 KO / 1 OK
startupProbe:
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 6

# agents.containers.agent.securityContext -- Allows you to overwrite the default container SecurityContext for the agent container.
securityContext: {}

Expand Down Expand Up @@ -2028,6 +2046,23 @@ clusterChecksRunner:
successThreshold: 1
failureThreshold: 6

# clusterChecksRunner.startupProbe -- Override default agent startup probe settings
# @default -- Every 15s / 6 KO / 1 OK

## In case of issues with the probe, you can disable it with the
## following values, to allow easier investigating:
#
# startupProbe:
# exec:
# command: ["/bin/true"]
#
startupProbe:
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 6

# clusterChecksRunner.deploymentAnnotations -- Annotations to add to the cluster-checks-runner's Deployment
deploymentAnnotations: {}
# key: "value"
Expand Down
10 changes: 10 additions & 0 deletions test/datadog/baseline/agent-clusterchecks-deployment_default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,16 @@ spec:
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
startupProbe:
failureThreshold: 6
httpGet:
path: /startup
port: 5557
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
volumes:
- name: installinfo
configMap:
Expand Down
10 changes: 10 additions & 0 deletions test/datadog/baseline/cluster-agent-deployment_default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,16 @@ spec:
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
startupProbe:
failureThreshold: 6
httpGet:
path: /live
port: 5556
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,16 @@ spec:
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
startupProbe:
failureThreshold: 6
httpGet:
path: /startup
port: 5556
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,16 @@ spec:
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
startupProbe:
failureThreshold: 6
httpGet:
path: /live
port: 5556
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
Expand Down
10 changes: 10 additions & 0 deletions test/datadog/baseline/daemonset_default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,16 @@ spec:
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
startupProbe:
failureThreshold: 6
httpGet:
path: /startup
port: 5555
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
- name: trace-agent
image: "gcr.io/datadoghq/agent:7.54.0"
imagePullPolicy: IfNotPresent
Expand Down
30 changes: 30 additions & 0 deletions test/datadog/baseline/other_default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -969,6 +969,16 @@ spec:
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
startupProbe:
failureThreshold: 6
httpGet:
path: /startup
port: 5555
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
- name: trace-agent
image: "gcr.io/datadoghq/agent:7.54.0"
imagePullPolicy: IfNotPresent
Expand Down Expand Up @@ -1437,6 +1447,16 @@ spec:
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
startupProbe:
failureThreshold: 6
httpGet:
path: /live
port: 5557
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
volumes:
- name: installinfo
configMap:
Expand Down Expand Up @@ -1647,6 +1667,16 @@ spec:
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
startupProbe:
failureThreshold: 6
httpGet:
path: /startup
port: 5556
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
Expand Down

0 comments on commit 7f4da79

Please sign in to comment.