Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling Active HealthCheck causes ClusterLoadAssignment to be empty. #5792

Open
alanprot opened this issue Oct 7, 2024 · 2 comments
Open
Labels
t:bug Something isn't working

Comments

@alanprot
Copy link

alanprot commented Oct 7, 2024

Describe the bug
Hi,

I'm trying to enable active health checks on a specific Mapping, which uses a KubernetesEndpointResolver. Upon configuring the healthcheck we can see that in some ambassador pods, all upstream hosts for the cluster associated to the mapping seem to disappear, and it stays missing until something else changes in the cluster (a pod scale up for instance). This causes these Ambassador pods to return 503 errors, as no upstream targets are found.

Here's the configuration I'm trying to add to the Mapping:

  health_checks:
  - unhealthy_threshold: 50
    healthy_threshold: 1
    interval: "15s"
    timeout: "10s"
    health_check:
      http:
        path: /ready
        expected_statuses:
          - max: 300
            min: 199

This issue only affects some Emissary pods. When comparing pods that experience the issue with those that work correctly, the Envoy config dump (with EDS info) reveals that the list of hosts is empty for the faulty pods:

Bad:

    {
     "endpoint_config": {
      "@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
      "cluster_name": "k8s/namespace-1/service-a/443",
      "policy": {
       "overprovisioning_factor": 140
      }
     }
    },

Good:

endpoint_config": {
      "@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
      "cluster_name": "k8s/namespace-1/service-a/443",
      "endpoints": [
       {
        "locality": {},
        "lb_endpoints": [
         {
          "endpoint": {
           "address": {
            "socket_address": {
             "address": "10.1.130.52",
             "port_value": 8080
            }
           },
           "health_check_config": {}
          },
          "health_status": "HEALTHY",
          "load_balancing_weight": 1
         },
         {
          "endpoint": {
           "address": {
            "socket_address": {
             "address": "10.1.165.169",
             "port_value": 8080
            }
           },
           "health_check_config": {}
          },
          "health_status": "HEALTHY",
          "load_balancing_weight": 1
         },
         {
          "endpoint": {
           "address": {
            "socket_address": {
             "address": "10.1.196.153",
             "port_value": 8080
            }
           },
           "health_check_config": {}
          },
          "health_status": "HEALTHY",
          "load_balancing_weight": 1
         }
        ]
       }
      ],
      "policy": {
       "overprovisioning_factor": 140
      }
     }
    }

I haven't been able to pinpoint why some pods have an empty ClusterLoadAssignment, but it seems like a race condition, possibly in the service that populates the assignment via EDS. The issue occurs randomly in different pods if i keep removing and adding back the healthcheck config.

To Reproduce
Steps to reproduce the behavior:

  1. Create an Mapping with KubernetesEndpointResolver and service pointing to an k8s service. Ex:
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: mapping-1
spec:
  hostname: "*"
  ambassador_id: [ emissary ]
  load_balancer:
    policy: round_robin
  prefix: /prefix
  service: https://service-a.default:443
  1. Modify the mapping to add the healthcheck
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: mapping-1
spec:
  hostname: "*"
  ambassador_id: [ emissary ]
  load_balancer:
    policy: round_robin
  prefix: /prefix
  service: https://service-a.default:443
  health_checks:
  - unhealthy_threshold: 50
    healthy_threshold: 1
    interval: "15s"
    timeout: "10s"
    health_check:
      http:
        path: /ready
        expected_statuses:
          - max: 300
            min: 199
  1. Port forward to the ambassador pod and check that the instances of service-a.default cluster are not registered on the ClusterLoadAssignment (http://localhost:8001/config_dump?resource=&mask=&name_regex=&include_eds=on)

Expected behavior
Configuring the healthcheck should not wipe the instances of the cluster associated to the mapping resource.

Versions (please complete the following information):

  • Ambassador: 3.8.3
  • Kubernetes environment [e.g. Minikube, bare metal, Google Kubernetes Engine]
  • K8S 1.28
@dosubot dosubot bot added the t:bug Something isn't working label Oct 7, 2024
@alanprot
Copy link
Author

alanprot commented Oct 8, 2024

It seems that this may have the same root cause of #4447

Seems when we add the healthcheck, we create a new cluster with the same name and can trigger this bug...

@alanprot
Copy link
Author

alanprot commented Oct 8, 2024

Ok...

Indeed seems the same root cause of #4447

Seems that anything that changes the cluster object, can trigger this bug:

I kept changing the connect_timeout_ms or cluster_idle_timeout_ms on the mapping and could reproduce this problem as well -> No healthcheck configured at all:

Cluster Object:

"cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "cluster_https___servoce-1_namespace-a_443_o-32E2014365DD7432-0",
      "type": "EDS",
      "eds_cluster_config": {
       "eds_config": {
        "ads": {},
        "resource_api_version": "V3"
       },
       "service_name": "k8s/servoce-1/namespace-a/443"
      },
      "connect_timeout": "2s",
      "dns_lookup_family": "V4_ONLY",
      "transport_socket": {
       "name": "envoy.transport_sockets.tls",
       "typed_config": {
        "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
        "common_tls_context": {}
       }
      },
      "alt_stat_name": "distributor_cortex_443",
      "common_http_protocol_options": {
       "idle_timeout": "90s"
      }
     },
     "last_updated": "2024-10-08T00:52:01.553Z"
    },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant