You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
If I kill a worker pod, then
when I launch a subsequent task, hoping that the system will start up a new worker pod, instead my task progresses as far as Task is pending due to waiting-for-nodes and a new worker pod is not launched. This looks like its because the fork of the kubernetes provider in funcx does not check kubernetes for status, and continues to claim that worker pod exists -- this was fixed in the fork of kubernetes provider in parsl in early 2021 - see Parsl/parsl#1740.
I thought I'd already opened a funcx github issue on this but I can't find it.
Restarting the endpoint clears away the list of disappeared-pods.
This 2nd issue is a little bit disguised by scaling: once I have blocked the first missing container with sufficent hung tasks, the end point scales out a new pod to take on any excess work - which then succeeds to execute any new work. So a user experiencing this who accepts that "often funcx doesn't run very well, i should just keep retrying and not report a problem" will trigger that effect without reporting a problem
To Reproduce
Delete a worker pod
Expected behavior
Something more like the parsl fork of the kubernetes provider, Parsl/parsl#1740
Environment
my kubernetes dev environment, main branches as of 2022-02-28
The text was updated successfully, but these errors were encountered:
Describe the bug
If I kill a worker pod, then
when I launch a subsequent task, hoping that the system will start up a new worker pod, instead my task progresses as far as
Task is pending due to waiting-for-nodes
and a new worker pod is not launched. This looks like its because the fork of the kubernetes provider in funcx does not check kubernetes for status, and continues to claim that worker pod exists -- this was fixed in the fork of kubernetes provider in parsl in early 2021 - see Parsl/parsl#1740.I thought I'd already opened a funcx github issue on this but I can't find it.
Restarting the endpoint clears away the list of disappeared-pods.
This 2nd issue is a little bit disguised by scaling: once I have blocked the first missing container with sufficent hung tasks, the end point scales out a new pod to take on any excess work - which then succeeds to execute any new work. So a user experiencing this who accepts that "often funcx doesn't run very well, i should just keep retrying and not report a problem" will trigger that effect without reporting a problem
To Reproduce
Delete a worker pod
Expected behavior
Something more like the parsl fork of the kubernetes provider, Parsl/parsl#1740
Environment
my kubernetes dev environment, main branches as of 2022-02-28
The text was updated successfully, but these errors were encountered: