Enabling Active HealthCheck causes `ClusterLoadAssignment` to be empty. #5792

alanprot · 2024-10-07T22:51:48Z

Describe the bug
Hi,

I'm trying to enable active health checks on a specific Mapping, which uses a KubernetesEndpointResolver. Upon configuring the healthcheck we can see that in some ambassador pods, all upstream hosts for the cluster associated to the mapping seem to disappear, and it stays missing until something else changes in the cluster (a pod scale up for instance). This causes these Ambassador pods to return 503 errors, as no upstream targets are found.

Here's the configuration I'm trying to add to the Mapping:

  health_checks:
  - unhealthy_threshold: 50
    healthy_threshold: 1
    interval: "15s"
    timeout: "10s"
    health_check:
      http:
        path: /ready
        expected_statuses:
          - max: 300
            min: 199

This issue only affects some Emissary pods. When comparing pods that experience the issue with those that work correctly, the Envoy config dump (with EDS info) reveals that the list of hosts is empty for the faulty pods:

Bad:

    {
     "endpoint_config": {
      "@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
      "cluster_name": "k8s/namespace-1/service-a/443",
      "policy": {
       "overprovisioning_factor": 140
      }
     }
    },

Good:

endpoint_config": {
      "@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
      "cluster_name": "k8s/namespace-1/service-a/443",
      "endpoints": [
       {
        "locality": {},
        "lb_endpoints": [
         {
          "endpoint": {
           "address": {
            "socket_address": {
             "address": "10.1.130.52",
             "port_value": 8080
            }
           },
           "health_check_config": {}
          },
          "health_status": "HEALTHY",
          "load_balancing_weight": 1
         },
         {
          "endpoint": {
           "address": {
            "socket_address": {
             "address": "10.1.165.169",
             "port_value": 8080
            }
           },
           "health_check_config": {}
          },
          "health_status": "HEALTHY",
          "load_balancing_weight": 1
         },
         {
          "endpoint": {
           "address": {
            "socket_address": {
             "address": "10.1.196.153",
             "port_value": 8080
            }
           },
           "health_check_config": {}
          },
          "health_status": "HEALTHY",
          "load_balancing_weight": 1
         }
        ]
       }
      ],
      "policy": {
       "overprovisioning_factor": 140
      }
     }
    }

I haven't been able to pinpoint why some pods have an empty ClusterLoadAssignment, but it seems like a race condition, possibly in the service that populates the assignment via EDS. The issue occurs randomly in different pods if i keep removing and adding back the healthcheck config.

To Reproduce
Steps to reproduce the behavior:

Create an Mapping with KubernetesEndpointResolver and service pointing to an k8s service. Ex:

apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: mapping-1
spec:
  hostname: "*"
  ambassador_id: [ emissary ]
  load_balancer:
    policy: round_robin
  prefix: /prefix
  service: https://service-a.default:443

Modify the mapping to add the healthcheck

apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: mapping-1
spec:
  hostname: "*"
  ambassador_id: [ emissary ]
  load_balancer:
    policy: round_robin
  prefix: /prefix
  service: https://service-a.default:443
  health_checks:
  - unhealthy_threshold: 50
    healthy_threshold: 1
    interval: "15s"
    timeout: "10s"
    health_check:
      http:
        path: /ready
        expected_statuses:
          - max: 300
            min: 199

Port forward to the ambassador pod and check that the instances of service-a.default cluster are not registered on the ClusterLoadAssignment (http://localhost:8001/config_dump?resource=&mask=&name_regex=&include_eds=on)

Expected behavior
Configuring the healthcheck should not wipe the instances of the cluster associated to the mapping resource.

Versions (please complete the following information):

Ambassador: 3.8.3
Kubernetes environment [e.g. Minikube, bare metal, Google Kubernetes Engine]
K8S 1.28

The text was updated successfully, but these errors were encountered:

alanprot · 2024-10-08T00:44:47Z

It seems that this may have the same root cause of #4447

Seems when we add the healthcheck, we create a new cluster with the same name and can trigger this bug...

alanprot · 2024-10-08T01:01:26Z

Ok...

Indeed seems the same root cause of #4447

Seems that anything that changes the cluster object, can trigger this bug:

I kept changing the connect_timeout_ms or cluster_idle_timeout_ms on the mapping and could reproduce this problem as well -> No healthcheck configured at all:

Cluster Object:

"cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "cluster_https___servoce-1_namespace-a_443_o-32E2014365DD7432-0",
      "type": "EDS",
      "eds_cluster_config": {
       "eds_config": {
        "ads": {},
        "resource_api_version": "V3"
       },
       "service_name": "k8s/servoce-1/namespace-a/443"
      },
      "connect_timeout": "2s",
      "dns_lookup_family": "V4_ONLY",
      "transport_socket": {
       "name": "envoy.transport_sockets.tls",
       "typed_config": {
        "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
        "common_tls_context": {}
       }
      },
      "alt_stat_name": "distributor_cortex_443",
      "common_http_protocol_options": {
       "idle_timeout": "90s"
      }
     },
     "last_updated": "2024-10-08T00:52:01.553Z"
    },

dosubot bot added the t:bug Something isn't working label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling Active HealthCheck causes `ClusterLoadAssignment` to be empty. #5792

Enabling Active HealthCheck causes `ClusterLoadAssignment` to be empty. #5792

alanprot commented Oct 7, 2024 •

edited

Loading

alanprot commented Oct 8, 2024

alanprot commented Oct 8, 2024

Enabling Active HealthCheck causes ClusterLoadAssignment to be empty. #5792

Enabling Active HealthCheck causes ClusterLoadAssignment to be empty. #5792

Comments

alanprot commented Oct 7, 2024 • edited Loading

alanprot commented Oct 8, 2024

alanprot commented Oct 8, 2024

Enabling Active HealthCheck causes `ClusterLoadAssignment` to be empty. #5792

Enabling Active HealthCheck causes `ClusterLoadAssignment` to be empty. #5792

alanprot commented Oct 7, 2024 •

edited

Loading