should instance-stop
*really* move instances to failed when they are discovered to already be gone?
#6809
Labels
nexus
Related to nexus
The
PUT /instances/{instance}/stop
API will send a request to sled-agent asking it to terminate a running instance. If sled-agent responds with something indicating that it actually didn't know about that instance in the first place, Nexus will then transition it toFailed
:omicron/nexus/src/app/instance.rs
Lines 755 to 775 in 0640bb2
This will not happen when the reason the instance is gone is because another concurrent attempt to stop it has succeeded, because the racing stop attempt will have advanced the VMM's generation number whilst moving it to
Destroyed
, so we won't mark it asFailed
. However, in the event of a sled-agent crash, we may encounter an already-gone VMM here, and may move it toFailed
.This seems a bit wacky to me, since
Failed
instances (which is what the instance will eventually become as a result of its VMM being markedFailed
) are eligible to be auto-restarted, whileStopped
instances are not --- because the user actually wanted them to be stopped. And, in this case, the user is expressing intent to have an instance stop running, and we just happened to discover that we had already anticipated their desire to stop it and went ahead and stopped it for them before they even asked us to. Admittedly, we weren't supposed to have done that! But, in this case, the requested state is "instance is not running", and it's not running, so it seems a bit unfortunate to go "oh no, i was supposed to make the instance not be running, and when i tried to do that, i discovered that it was not running because we made a mistake, so now i'm actually going to...make it be running again?"Imagine a scenario where a user goes to stop an instance so that they can change its boot disk or something, and while doing so, we discover that sled-agent has crashed and the instance isn't there. Moving it to
Failed
results in the instance being restarted, so now the user has to stop the instance a second time before they can actually do what they were trying to do originally.Maybe we should just always move the instance to stopped when such an error is encountered by an instance-stop attempt. Obviously we would still go to
Failed
when attempting to do other things to the instance.The text was updated successfully, but these errors were encountered: