Infinite "adding to backlog" #27

so0k · 2019-08-16T09:35:39Z

We are running node-drainer (sha-309d7dc) - using palantir/bouncer in canary mode, for our ASG with 6 desired nodes, 3 of them had old Launch template

So palantir/bouncer set ASG size to 9 (launching 3 new instances) and then sent autoscaling terminate instance-in-asg (should decrement desired count) for the 3 instances on the old launch template.

Some of them get properly drained and LCH completed is sent by node-drainer, but some seem to go into an infinite loop where I manually checked and confirmed the remaining Pods were part of DaemonSets (some of those pods have taint tolerations which only run on certain nodes so they aren't rescheduled....).

time="2019-08-16T09:19:03Z" level=info msg="Resolved Instance ID i-0c22f8c656c62a282 to Node Name ip-10-51-61-168.ap-southeast-1.compute.internal"
time="2019-08-16T09:19:03Z" level=info msg="Sending ASG heartbeat for instance i-0c22f8c656c62a282"
time="2019-08-16T09:19:03Z" level=info msg="Adding node ip-10-51-61-168.ap-southeast-1.compute.internal to the backlog"
... 
# forever (waited 1 hour)
....
# manually ran:
aws autoscaling complete-lifecycle-action --instance-id i-0c22f8c656c62a282 --lifecycle-hook-name swat-stage-bohr-compute-workers-nodedrainerLCH --auto-scaling-group-name swat-stage-bohr-compute-workers --lifecycle-action-result CONTINUE
...
time="2019-08-16T09:25:42Z" level=info msg="Draining next node ip-10-51-57-12.ap-southeast-1.compute.internal from backlog"
time="2019-08-16T09:25:42Z" level=warning msg="nodes \"ip-10-51-57-12.ap-southeast-1.compute.internal\" not found"

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite "adding to backlog" #27

Infinite "adding to backlog" #27

so0k commented Aug 16, 2019

Infinite "adding to backlog" #27

Infinite "adding to backlog" #27

Comments

so0k commented Aug 16, 2019