You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.
We are running node-drainer (sha-309d7dc) - using palantir/bouncer in canary mode, for our ASG with 6 desired nodes, 3 of them had old Launch template
So palantir/bouncer set ASG size to 9 (launching 3 new instances) and then sent autoscaling terminate instance-in-asg (should decrement desired count) for the 3 instances on the old launch template.
Some of them get properly drained and LCH completed is sent by node-drainer, but some seem to go into an infinite loop where I manually checked and confirmed the remaining Pods were part of DaemonSets (some of those pods have taint tolerations which only run on certain nodes so they aren't rescheduled....).
time="2019-08-16T09:19:03Z" level=info msg="Resolved Instance ID i-0c22f8c656c62a282 to Node Name ip-10-51-61-168.ap-southeast-1.compute.internal"
time="2019-08-16T09:19:03Z" level=info msg="Sending ASG heartbeat for instance i-0c22f8c656c62a282"
time="2019-08-16T09:19:03Z" level=info msg="Adding node ip-10-51-61-168.ap-southeast-1.compute.internal to the backlog"
...
# forever (waited 1 hour)
....
# manually ran:
aws autoscaling complete-lifecycle-action --instance-id i-0c22f8c656c62a282 --lifecycle-hook-name swat-stage-bohr-compute-workers-nodedrainerLCH --auto-scaling-group-name swat-stage-bohr-compute-workers --lifecycle-action-result CONTINUE
...
time="2019-08-16T09:25:42Z" level=info msg="Draining next node ip-10-51-57-12.ap-southeast-1.compute.internal from backlog"
time="2019-08-16T09:25:42Z" level=warning msg="nodes \"ip-10-51-57-12.ap-southeast-1.compute.internal\" not found"
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We are running node-drainer (sha-309d7dc) - using palantir/bouncer in
canary
mode, for our ASG with 6 desired nodes, 3 of them had old Launch templateSo palantir/bouncer set ASG size to 9 (launching 3 new instances) and then sent autoscaling terminate instance-in-asg (should decrement desired count) for the 3 instances on the old launch template.
Some of them get properly drained and LCH completed is sent by node-drainer, but some seem to go into an infinite loop where I manually checked and confirmed the remaining Pods were part of DaemonSets (some of those pods have taint tolerations which only run on certain nodes so they aren't rescheduled....).
The text was updated successfully, but these errors were encountered: