read: connection reset by peer #236

MansurEsm · 2022-11-10T12:14:40Z

Hello,

I see errors in the log of the controller.
But the functionality seems still to work.
So my problem is mainly that errors are written in the logs.

Details:
Kubernetes EKS 1.23
At the moment we run 75 Namespaces in total (With two rolebindings each NS)
The Configuration of the deployment is this:

--webhook-server-port=9443
- --metrics-addr=:8080
- --max-reconciles=30
- --apiserver-qps-throttle=30
- --excluded-namespace=flux-system
- --excluded-namespace=kube-system
- --excluded-namespace=kube-public
- --excluded-namespace=hnc-system
- --excluded-namespace=kube-node-lease
- --excluded-namespace=ingress-controller
- --excluded-namespace=observability
- --excluded-namespace=postgres
- --excluded-namespace=rabbitmq
- --enable-internal-cert-management
- --cert-restart-on-secret-refresh
- --included-namespace-regex=app-.*
(I tried allready to adjust the trottle and reconcil time with no success)

The Log outputs a huge ammount of this kind of logs:
│ 2022-11-10T11:52:10.356933556Z {"level":"error","ts":1668081130.3564637,"msg":"http: TLS handshake error from XX.XXX.39.206:60406: EOF"} │
│ 2022-11-10T11:54:55.332004115Z {"level":"error","ts":1668081295.331847,"msg":"http: TLS handshake error from XX.XXX.39.206:46584: EOF"} │
│ 2022-11-10T11:55:33.518983284Z {"level":"error","ts":1668081333.5188336,"msg":"http: TLS handshake error from XX.XXX.39.206:52420: EOF"} │
│ 2022-11-10T11:56:07.074519506Z {"level":"error","ts":1668081367.0743582,"msg":"http: TLS handshake error from XX.XXX.39.206:34722: EOF"} │
│ 2022-11-10T11:58:05.740595523Z {"level":"error","ts":1668081485.7404268,"msg":"http: TLS handshake error from XX.XXX.39.206:49758: EOF"} │
│ 2022-11-10T11:58:05.795854044Z {"level":"error","ts":1668081485.7956684,"msg":"http: TLS handshake error from XX.XXX.39.206:49764: EOF"} │
│ 2022-11-10T11:58:06.018933577Z {"level":"error","ts":1668081486.0187643,"msg":"http: TLS handshake error from XX.XXX.39.206:49786: EOF"} │
│ 2022-11-10T11:58:06.118885718Z {"level":"error","ts":1668081486.118721,"msg":"http: TLS handshake error from XX.XXX.39.206:49792: EOF"} │
│ 2022-11-10T11:58:06.216421670Z {"level":"error","ts":1668081486.2161145,"msg":"http: TLS handshake error from XX.XXX.39.206:49818: EOF"} │
│ 2022-11-10T11:58:06.216464114Z {"level":"error","ts":1668081486.2161825,"msg":"http: TLS handshake error from XX.XXX.39.206:49806: read tcp XX.XXX.181.107:9443->XX.XXX.39.206:49806: read: connection reset by peer"} │
│ 2022-11-10T11:59:21.996177659Z {"level":"error","ts":1668081561.9958243,"msg":"http: TLS handshake error from XX.XXX.39.206:37166: EOF"} │
│ 2022-11-10T12:00:00.218649715Z {"level":"error","ts":1668081600.218468,"msg":"http: TLS handshake error from XX.XXX.140.169:33712: EOF"} │
│ 2022-11-10T12:00:41.321180183Z {"level":"error","ts":1668081641.320983,"msg":"http: TLS handshake error from XX.XXX.39.206:48920: EOF"} │
│ 2022-11-10T12:01:39.304614334Z {"level":"error","ts":1668081699.3044322,"msg":"http: TLS handshake error from XX.XXX.39.206:52170: EOF"} │
│ 2022-11-10T12:05:48.388023648Z {"level":"error","ts":1668081948.3877704,"msg":"http: TLS handshake error from XX.XXX.39.206:46100: EOF"} │
│ 2022-11-10T12:06:47.723495059Z {"level":"error","ts":1668082007.7233665,"msg":"http: TLS handshake error from XX.XXX.39.206:60902: EOF"} │
│ 2022-11-10T12:07:37.665469165Z {"level":"error","ts":1668082057.6652558,"msg":"http: TLS handshake error from XX.XXX.39.206:55668: EOF"}

I checked the success of the HNC as following:
k get rolebindings -n myParentNS -n oneOfTheChildNamespaces

I see the desired rolebindings are there in all Namespaces.
So actually no problem in the result.

As I understand the error describes an timeout somewhere, so I fear when additional Namespaces appear that these new Namespaces may do not get the Changes HNC should apply.

Question:
How to solve this timeout? Can I increase somewhere the timeout?

Thx for suggestions or solutions in advance

adrianludwin · 2022-11-11T17:43:12Z

I've seen the occasional error like this reported but I've never been able to reproduce it. Do you know what's at XX.XXX.39.206?

MansurEsm · 2022-11-14T06:40:04Z

Yes. I have just reducted the IP Address.
Sample:
{"level":"error","ts":1668430451.4085166,"msg":"http: TLS handshake error from 10.160.39.206:56810: EOF"} {"level":"error","ts":1668430504.1096373,"msg":"http: TLS handshake error from 10.160.39.206:34218: EOF"}
2022-11-14T14:11:26.594074106Z {"level":"error","ts":1668435086.5939658,"msg":"http: TLS handshake error from 10.160.140.169:52866: EOF"}

The error appears often. Like every 2-5 Minute
I dont understand what IP Address it is. Its Not the Nodes and its not any pod.

adrianludwin · 2022-11-14T15:02:00Z

Could it be the control plane (i.e. masters / apiserver)?

adrianludwin · 2022-11-14T15:03:44Z

Prior version (#49) was also on EKS.

A similar problem on a different project is on Azure: kubernetes-sigs/cluster-api-provider-azure#428

Have you seen any log messages like x509: certificate signed by unknown authority? Maybe in the apiserver logs (see that second issue for details)?

adrianludwin · 2022-11-14T15:05:50Z

I wonder if it has something to do with the webhooks. If you try to do something illegal - e.g., deleting a propagated object - does it go through or do you get an error? The webhooks are only there to stop you from doing the wrong thing, so if you only use HNC correct, everything would appear to work.

Are you using internal cert management (the default) or something like cert-manager?

MansurEsm · 2022-11-15T07:20:54Z

Hi,

Have you seen any log messages like x509: certificate signed by unknown authority? Maybe in the apiserver logs (see that second issue for details)?
Are you using internal cert management (the default) or something like cert-manager?

No certificate errors. No cert-manager.
I use this certificate configuration:

--enable-internal-cert-management
--cert-restart-on-secret-refresh

I also suspect the webhooks. I remove them first of all and see whats happening.
Actually I did'mt saw an error doing smth illegal. But it's hard to test.

Another suspect is the AWS securitygroup setting. I will check this also.
I'll come back

MansurEsm · 2022-11-15T07:44:01Z

Feedback:

Deleting the webhooks:

kind: MutatingWebhookConfiguration
name: namespacelabel.hnc.x-k8s.io
and
kind: ValidatingWebhookConfiguration
name: subnamespaceanchors.hnc.x-k8s.io

... made that the TLS handshake Error disapeared.

The configuration of them was:
`apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
creationTimestamp: null
name: hnc-mutating-webhook-configuration
webhooks:

admissionReviewVersions:
- v1
  clientConfig:
  service:
  name: hnc-webhook-service
  namespace: hnc-system
  path: /mutate-namespace
  failurePolicy: Ignore
  name: namespacelabel.hnc.x-k8s.io
  rules:
- apiGroups:
  - ""
    apiVersions:
  - v1
    operations:
  - CREATE
  - UPDATE
    resources:
  - namespaces
    sideEffects: None

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: hnc-validating-webhook-configuration
webhooks:

admissionReviewVersions:
- v1
  clientConfig:
  service:
  name: hnc-webhook-service
  namespace: hnc-system
  path: /validate-hnc-x-k8s-io-v1alpha2-subnamespaceanchors
  failurePolicy: Fail
  name: subnamespaceanchors.hnc.x-k8s.io
  rules:
- apiGroups:
  - hnc.x-k8s.io
    apiVersions:
  - v1alpha2
    operations:
  - CREATE
  - DELETE
    resources:
  - subnamespaceanchors
    sideEffects: None
admissionReviewVersions:
- v1
  clientConfig:
  service:
  name: hnc-webhook-service
  namespace: hnc-system
  path: /validate-hnc-x-k8s-io-v1alpha2-hierarchyconfigurations
  failurePolicy: Fail
  name: hierarchyconfigurations.hnc.x-k8s.io
  rules:
- apiGroups:
  - hnc.x-k8s.io
    apiVersions:
  - v1alpha2
    operations:
  - CREATE
  - UPDATE
    resources:
  - hierarchyconfigurations
    sideEffects: None
admissionReviewVersions:
- v1
  clientConfig:
  service:
  name: hnc-webhook-service
  namespace: hnc-system
  path: /validate-objects
  failurePolicy: Fail
  name: objects.hnc.x-k8s.io
  namespaceSelector:
  matchLabels:
  hnc.x-k8s.io/included-namespace: "true"
  rules:
- apiGroups:
  - "*"
    apiVersions:
  - "*"
    operations:
  - CREATE
  - UPDATE
  - DELETE
    resources:
  - "*"
    scope: Namespaced
    sideEffects: None
    timeoutSeconds: 4
admissionReviewVersions:
- v1
  clientConfig:
  service:
  name: hnc-webhook-service
  namespace: hnc-system
  path: /validate-hnc-x-k8s-io-v1alpha2-hncconfigurations
  failurePolicy: Fail
  name: hncconfigurations.hnc.x-k8s.io
  rules:
- apiGroups:
  - hnc.x-k8s.io
    apiVersions:
  - v1alpha2
    operations:
  - CREATE
  - UPDATE
  - DELETE
    resources:
  - hncconfigurations
    sideEffects: None
admissionReviewVersions:
- v1
  clientConfig:
  service:
  name: hnc-webhook-service
  namespace: hnc-system
  path: /validate-v1-namespace
  failurePolicy: Fail
  name: namespaces.hnc.x-k8s.io
  rules:
- apiGroups:
  - ""
    apiVersions:
  - v1
    operations:
  - DELETE
  - CREATE
  - UPDATE
    resources:
  - namespaces
    sideEffects: None`

Do you have a hint what can cause the issue? So I can investigate further.
I actually would prefere to keep them.

What I do in the background:

I create (In another Repo - terraform) namespaces
for_each Namespace assign k8 role_bindings (and cluster_role)

erikgb · 2022-11-15T08:31:29Z

One option is to try switching to cert-manager. I've had numerous issues with the cert-rotator.

adrianludwin · 2022-11-15T13:44:26Z

If there are no certificate errors, then switching from cert-rotator to cert-manager doesn't seem too likely to solve the issue but you can certainly try. Maybe Erik can give you the instructions :) Did you check the apiserver logs as well? Try re-installing the webhooks, then create two namespaces (kubectl create ns foo) and then try to make them parents of each other (kubectl hns set --parent foo bar, then kubectl hns set --parent bar foo). If the first one fails, that means the webhooks aren't responding properly - I doubt that's the case, or you would have seen it already. If the second one *succeeds* it means they're somehow being skipped. If they both work, then the problem's probably not on the HNC side - it's probably from the EKS control plane. After all, the log is saying that the *client* did something wrong - it sent a zero-length handshake. There's not much HNC can do about that.

…

On Tue, Nov 15, 2022 at 3:31 AM Erik Godding Boye ***@***.***> wrote: One option is to try switching to cert-manager. I've had numerous issues with the cert-rotator. — Reply to this email directly, view it on GitHub <#236 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE43PZD6DEKG7XEBR6K4ZITWINC6ZANCNFSM6AAAAAAR4PXCME> . You are receiving this because you commented.Message ID: ***@***.***>

mochizuki875 · 2023-01-05T14:56:13Z

I've came across the same issue on kind.
I'm not sure, but it seems to be related to net/http.

The similar issues can be found in other projects.
kubernetes/kubernetes#109022
open-policy-agent/gatekeeper#2142

adrianludwin · 2023-02-15T13:09:02Z

Thanks for that @mochizuki875 . Unfortunately the problem seems to be coming from K8s itself (kubernetes/kubernetes#109022) so there's nothing we can do here.

/close

k8s-ci-robot · 2023-02-15T13:09:07Z

@adrianludwin: Closing this issue.

In response to this:

Thanks for that @mochizuki875 . Unfortunately the problem seems to be coming from K8s itself (kubernetes/kubernetes#109022) so there's nothing we can do here.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot closed this as completed Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read: connection reset by peer #236

read: connection reset by peer #236

MansurEsm commented Nov 10, 2022 •

edited

Loading

adrianludwin commented Nov 11, 2022

MansurEsm commented Nov 14, 2022 •

edited

Loading

adrianludwin commented Nov 14, 2022

adrianludwin commented Nov 14, 2022

adrianludwin commented Nov 14, 2022

MansurEsm commented Nov 15, 2022

MansurEsm commented Nov 15, 2022 •

edited

Loading

erikgb commented Nov 15, 2022

adrianludwin commented Nov 15, 2022 via email

mochizuki875 commented Jan 5, 2023 •

edited

Loading

adrianludwin commented Feb 15, 2023

k8s-ci-robot commented Feb 15, 2023

read: connection reset by peer #236

read: connection reset by peer #236

Comments

MansurEsm commented Nov 10, 2022 • edited Loading

adrianludwin commented Nov 11, 2022

MansurEsm commented Nov 14, 2022 • edited Loading

adrianludwin commented Nov 14, 2022

adrianludwin commented Nov 14, 2022

adrianludwin commented Nov 14, 2022

MansurEsm commented Nov 15, 2022

MansurEsm commented Nov 15, 2022 • edited Loading

erikgb commented Nov 15, 2022

adrianludwin commented Nov 15, 2022 via email

mochizuki875 commented Jan 5, 2023 • edited Loading

adrianludwin commented Feb 15, 2023

k8s-ci-robot commented Feb 15, 2023

MansurEsm commented Nov 10, 2022 •

edited

Loading

MansurEsm commented Nov 14, 2022 •

edited

Loading

MansurEsm commented Nov 15, 2022 •

edited

Loading

mochizuki875 commented Jan 5, 2023 •

edited

Loading