Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error draining node #302

Open
preflightsiren opened this issue Aug 25, 2021 · 8 comments
Open

Error draining node #302

preflightsiren opened this issue Aug 25, 2021 · 8 comments

Comments

@preflightsiren
Copy link

Is this a BUG REPORT or FEATURE REQUEST?: Bug report

What happened: something very similar to #225
During a change to an instanceGroup, a RollingUpgrade is created. The RollingUpgrade correctly identifies the ec2 instances affected, attempts to drain the node and fails with an error. Deleting the RollingUpgrade and restarting the process consistently fails in the same way.

executing the same drain eg. kubectl drain <node> --delete-local-data --ignore-daemonsets works as expected from a local shell.

What you expected to happen: kubectl drain evicts the pods as normal.

How to reproduce it (as minimally and precisely as possible): Not sure how to reproduce it outside of my environment, but seems to affect certain instance groups consistently.

Anything else we need to know?: the error appears to be produced immediately (about 1 second after the drain messages)
there are no PDBs being used on this cluster, all pods have a termiantionGracePeriodSeconds=30

Environment:

  • rolling-upgrade-controller version 1.0.2
  • Kubernetes version : 1.19.12

Other debugging information (if applicable):

  • RollingUpgrade status:
$ kubectl describe rollingupgrade <rollingupgrade-name>
  • controller logs:
$ kubectl logs <rolling-upgrade-controller pod>
2021-08-25T05:20:08.849Z    INFO    controllers.RollingUpgrade    scaling group details    {"scalingGroup": "ew1p-fcp-k2-xxxxxxxx-sh-r5-2xlarge-eu-west-1a", "desiredInstances": 17, "launchConfig": "", "name": "instance-man
ager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z    INFO    controllers.RollingUpgrade    checking if rolling upgrade is completed    {"name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z    INFO    controllers.RollingUpgrade    drift detected in scaling group    {"driftedInstancesCount/DesiredInstancesCount": "(10/17)", "name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z    INFO    controllers.RollingUpgrade    selecting batch for rotation    {"batch size": 1, "name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z    INFO    controllers.RollingUpgrade    found in-progress instances    {"instances": ["i-039af737f93705937"]}
2021-08-25T05:20:08.849Z    INFO    controllers.RollingUpgrade    rotating batch    {"instances": ["i-039af737f93705937"], "name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z    INFO    controllers.RollingUpgrade    no InService instances in the batch    {"batch": ["i-039af737f93705937"], "instances(InService)": [], "name": "instance-manager/rollup-nodes-0cc26b4a29182e2
bc"}
2021-08-25T05:20:08.849Z    INFO    controllers.RollingUpgrade    waiting for desired nodes    {"name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z    INFO    controllers.RollingUpgrade    desired nodes are ready    {"name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z    INFO    controllers.RollingUpgrade    draining the node    {"instance": "i-039af737f93705937", "node name": "ip-172-24-64-136.eu-west-1.compute.internal", "name": "instance-manager/rollup-nodes-
0cc26b4a29182e2bc"}
time="2021-08-25T05:20:09Z" level=debug msg="nodeEventsHandler[update] nodeObj updated, updated in sync mapnodeNameip-172-24-69-198.eu-west-1.compute.internal"
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-tkf65, kube-system/kiam-agent-kwrjp, kube-system/kube-proxy-985s5, kube-system/node-local-dns-8tdlk, logging/default-fluentbit-lnlj6, monitoring/node-export
er-fhtx5
2021-08-25T05:20:09.624Z    ERROR    controllers.RollingUpgrade    failed to rotate the node    {"name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc", "error": "DrainNode failed: instanceID - i-039af737f93705937, erro
r draining node: [error when evicting pods/\"xxxxxxxx-5859f68b5d-npkq9\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-5c5f9478fd-x7kqj\" -n \"datal
ink-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-b9658cf44-q85r4\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"al-bisamericas-manage
r-7b6f9964df-9qzkq\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6bc5d48f5-jjn5l\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting
pods/\"xxxxxxxx-6f7c9d9df-jfbj5\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-778fff6dd6-rhc4k\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evic
ting pods/\"xxxxxxxx-6599f96bd8-xf2vl\" -n \"xxxxxxxx-customer\": global timeout reached: -1s]", "errorVerbose": "DrainNode failed: instanceID - i-039af737f93705937, error draining node: [error when evicting pods/\"t
hameswater-agent-5859f68b5d-npkq9\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-5c5f9478fd-x7kqj\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, e
rror when evicting pods/\"xxxxxxxx-b9658cf44-q85r4\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-7b6f9964df-9qzkq\" -n \"xxxxxxxx-customer\":
 global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6bc5d48f5-jjn5l\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6f7c9d9df-jfbj5\" -n \
"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-778fff6dd6-rhc4k\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6599f96bd8-xf2vl\"
-n \"xxxxxxxx-customer\": global timeout reached: -1s]\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeContext).ReplaceNodeBatch.func1\n\t/workspace/controllers/upgrade.go:272\nruntime.goexit\n\t/usr/loca
l/go/src/runtime/asm_amd64.s:1374"}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132
github.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeContext).ReplaceNodeBatch
    /workspace/controllers/upgrade.go:307
github.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeContext).RotateNodes
    /workspace/controllers/upgrade.go:108
github.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).Reconcile
     /workspace/controllers/rollingupgrade_controller.go:172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99
evicting pod xxxxxxxx-customer/xxxxxxxx-5859f68b5d-npkq9
evicting pod xxxxxxxx-customer/xxxxxxxx-5c5f9478fd-x7kqj
evicting pod xxxxxxxx-customer/xxxxxxxx-b9658cf44-q85r4
evicting pod xxxxxxxx-customer/xxxxxxxx-7b6f9964df-9qzkq
evicting pod xxxxxxxx-customer/xxxxxxxx-6bc5d48f5-jjn5l
evicting pod xxxxxxxx-customer/xxxxxxxx-6f7c9d9df-jfbj5
evicting pod xxxxxxxx-customer/xxxxxxxx-778fff6dd6-rhc4k
evicting pod xxxxxxxx-customer/xxxxxxxx-6599f96bd8-xf2vl
2021-08-25T05:20:09.633Z    ERROR    controller-runtime.manager.controller.rollingupgrade    Reconciler error    {"reconciler group": "upgrademgr.keikoproj.io", "reconciler kind": "RollingUpgrade", "name": "rollup-nodes-0c
c26b4a29182e2bc", "namespace": "instance-manager", "error": "DrainNode failed: instanceID - i-039af737f93705937, error draining node: [error when evicting pods/\"xxxxxxxx-5859f68b5d-npkq9\" -n \"xxxxxxxx-customer\
": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-5c5f9478fd-x7kqj\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-b965
8cf44-q85r4\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-7b6f9964df-9qzkq\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"ba
rclays-costallocb-agent-6bc5d48f5-jjn5l\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6f7c9d9df-jfbj5\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error wh
en evicting pods/\"xxxxxxxx-778fff6dd6-rhc4k\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6599f96bd8-xf2vl\" -n \"xxxxxxxx-customer\": global timeout reached: -1s]", "e
rrorVerbose": "DrainNode failed: instanceID - i-039af737f93705937, error draining node: [error when evicting pods/\"xxxxxxxx-5859f68b5d-npkq9\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evi
cting pods/\"xxxxxxxx-5c5f9478fd-x7kqj\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-b9658cf44-q85r4\" -n \"xxxxxxxx-customer\": global
 timeout reached: -1s, error when evicting pods/\"xxxxxxxx-7b6f9964df-9qzkq\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6bc5d48f5-jjn5l\" -n \
"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6f7c9d9df-jfbj5\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-778fff6dd6-rhc4k\"
 -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6599f96bd8-xf2vl\" -n \"xxxxxxxx-customer\": global timeout reached: -1s]\ngithub.com/keikoproj/upgrade-manager/controllers.
(*RollingUpgradeContext).ReplaceNodeBatch.func1\n\t/workspace/controllers/upgrade.go:272\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:267
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99
2021-08-25T05:20:09.633Z    INFO    controllers.RollingUpgrade    ***Reconciling***
2021-08-25T05:20:09.633Z    INFO    controllers.RollingUpgrade    rolling upgrade ended    {"name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc", "status": "error"}

@preflightsiren
Copy link
Author

I manually kubectl applyd a rollingUpgrade, this time setting drainTimeout: 600 and it succeeded. doing the same thing with drainTimeout: -1 also errored.

apiVersion: upgrademgr.keikoproj.io/v1alpha1
kind: RollingUpgrade
metadata:
  annotations:
    app.kubernetes.io/managed-by: instance-manager
    instancemgr.keikoproj.io/upgrade-scope: ew1p-fcp-k2-xxxxxxxx-sh-r5-2xlarge-eu-west-1c
  name: rollup-nodes-0098ee98d43054385-seb
  namespace: instance-manager
spec:
  asgName: ew1p-fcp-k2-xxxxxxxx-sh-r5-2xlarge-eu-west-1c
  nodeIntervalSeconds: 10
  postDrain:
    waitSeconds: 90
  postDrainDelaySeconds: 45
  region: eu-west-1
  strategy:
    drainTimeout: 600
    mode: eager

@preflightsiren
Copy link
Author

ok, I think I've figured out the chain of events to replicate the failure.

  1. The upgrade-manager readme specifies that strategy.drainTimeout has a default value of -1 (I'm assuming to represent infinity.)
    https://github.com/keikoproj/upgrade-manager/blob/master/README.md
  2. When draining a node we set the timeout: https://github.com/keikoproj/upgrade-manager/blob/controller-v2/controllers/providers/kubernetes/nodes.go#L60 which is just the strategy.drainTimeout * time.Second
  3. in the kubectl drain package, d.Timeout is checked if == 0 https://github.com/kubernetes/kubectl/blob/master/pkg/drain/drain.go#L277 otherwise is set to the value passed in (-1 by default)
  4. context.WithTimeout(d.getContext(), globalTimeout) is called https://github.com/kubernetes/kubectl/blob/master/pkg/drain/drain.go#L282 which exits immediately for timeouts <= 0
    I've created a playground https://play.golang.org/p/K9KmZ0vkX4f to demonstrate
package main

import (
	"context"
	"fmt"
	"time"
)

func main() {
	fmt.Println("Hello world.")
	errChan := make(chan error)
	globalTimeout := 0
	ctx, cancel := context.WithTimeout(context.TODO(), time.Duration(globalTimeout) * time.Second)
	defer cancel()

	go errFunc(errChan)

	select {
	case <-ctx.Done():
		fmt.Println("done")
	case err := <-errChan:
		fmt.Println(err)
	}
	time.Sleep(100 * time.Second)
}

func errFunc(c chan error) {
	time.Sleep(5 * time.Second)
	c <- fmt.Errorf("timer called")
}

@preflightsiren
Copy link
Author

preflightsiren commented Aug 26, 2021

some potential solutions:

@preflightsiren
Copy link
Author

looks like in the latest code this has been partially addressed by https://github.com/keikoproj/upgrade-manager/blob/controller-v2/controllers/upgrade.go#L121 which works in conjunction to: https://github.com/keikoproj/upgrade-manager/blame/controller-v2/main.go#L93 to set a default drainTimeout of 900s

@preflightsiren
Copy link
Author

setting drainTimeout to 0 also fails with the same "global timeout: -1s" - no idea why yet. worked around the issue by setting drainTimeout to maxInt32 (2147483647)

@preflightsiren
Copy link
Author

// validating the DrainTimeout value
	if strategy.DrainTimeout == 0 {
		r.Spec.Strategy.DrainTimeout = -1
	} else if strategy.DrainTimeout < -1 {
		err := fmt.Errorf("%s: Invalid value for startegy DrainTimeout - %d", r.Name, strategy.MaxUnavailable.IntVal)
		return false, err

that'll do it....

@eytan-avisror
Copy link
Contributor

@shreyas-badiger
We should handle the case of 0 drain timeout, drain timeout should never be 0, either more than 0 (e.g. explicitly set to a number of seconds) or -1 which would mean never timeout.
Seems we are not setting a default value if drainTimeout is not referenced (which defaults to 0)

@preflightsiren
Copy link
Author

@shreyas-badiger

We should handle the case of 0 drain timeout, drain timeout should never be 0, either more than 0 (e.g. explicitly set to a number of seconds) or -1 which would mean never timeout.

Seems we are not setting a default value if drainTimeout is not referenced (which defaults to 0)

I think it's more than the kubectl drain package you're using uses 0 for infinity instead of -1. Negative values are essentially "immediately timeout"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants