should `instance-stop` really move instances to failed when they are discovered to already be gone? #6809

hawkw · 2024-10-09T20:28:42Z

The PUT /instances/{instance}/stop API will send a request to sled-agent asking it to terminate a running instance. If sled-agent responds with something indicating that it actually didn't know about that instance in the first place, Nexus will then transition it to Failed:

omicron/nexus/src/app/instance.rs

Lines 755 to 775 in 0640bb2

    
           if let Err(e) = self 
        
               .instance_request_state( 
        
                   opctx, 
        
                   state.instance(), 
        
                   state.vmm(), 
        
                   InstanceStateChangeRequest::Stop, 
        
               ) 
        
               .await 
        
           { 
        
               if let (InstanceStateChangeError::SledAgent(inner), Some(vmm)) = 
        
                   (&e, state.vmm()) 
        
               { 
        
                   if inner.vmm_gone() { 
        
                       let _ = self 
        
                           .mark_vmm_failed(opctx, authz_instance, vmm, inner) 
        
                           .await; 
        
                   } 
        
               } 
        
               return Err(e); 
        
           }

This will not happen when the reason the instance is gone is because another concurrent attempt to stop it has succeeded, because the racing stop attempt will have advanced the VMM's generation number whilst moving it to Destroyed, so we won't mark it as Failed. However, in the event of a sled-agent crash, we may encounter an already-gone VMM here, and may move it to Failed.

This seems a bit wacky to me, since Failed instances (which is what the instance will eventually become as a result of its VMM being marked Failed) are eligible to be auto-restarted, while Stopped instances are not --- because the user actually wanted them to be stopped. And, in this case, the user is expressing intent to have an instance stop running, and we just happened to discover that we had already anticipated their desire to stop it and went ahead and stopped it for them before they even asked us to. Admittedly, we weren't supposed to have done that! But, in this case, the requested state is "instance is not running", and it's not running, so it seems a bit unfortunate to go "oh no, i was supposed to make the instance not be running, and when i tried to do that, i discovered that it was not running because we made a mistake, so now i'm actually going to...make it be running again?"

Imagine a scenario where a user goes to stop an instance so that they can change its boot disk or something, and while doing so, we discover that sled-agent has crashed and the instance isn't there. Moving it to Failed results in the instance being restarted, so now the user has to stop the instance a second time before they can actually do what they were trying to do originally.

Maybe we should just always move the instance to stopped when such an error is encountered by an instance-stop attempt. Obviously we would still go to Failed when attempting to do other things to the instance.

The text was updated successfully, but these errors were encountered:

faithanalog · 2024-10-09T20:36:14Z

I don't think the instance should ever turn on if i told it to be stopped. If "failed" is useful information for the user in this case, perhaps a "failed but don't auto restart" would be a useful backend state to create.

But I'm not sure if it is. What would I do if I told my machine to stop, and it said "failed"?

I might think "maybe it failed to stop. is it still running? well I can't get to it on the network, so at least services don't seem to be running, or maybe it tore down networking. ???"

Or I might think "well it's not running. but did I break something in the backend I need to worry about?"

gjcolombo · 2024-10-10T23:09:02Z

I agree with pretty much all of the above.

The piece of information that gets lost by transitioning to Stopped (and not Failed) when a VMM is gone is the unexpectedness of the VMM's absence. The tradeoff is that (by default) a Failed instance is eligible to be restarted automatically, and I agree that this doesn't make very much sense if we know for certain that the user intends for the VM not to be running, which is the case in the context of a Stop API call. It seems way more sensible to drop the "oops it didn't reach the end of the ride" bit than to force a user to do more work to keep an instance from restarting.¹

In terms of what would need to change in Nexus to do this: I think some care is required because an instance-stop API call can race with the instance watcher to determine the final state of a missing VMM:

Stop API observes VMM record for missing VMM at generation 10
Instance watcher task reads VMM record for missing VMM at generation 10
Watcher calls sled agent, gets back "not here"
Watcher advances VMM to Failed at generation 11
Stop API tries to advance VMM to Destroyed, can't do so because the VMM is already at generation 11

I'd have to think about what to do in this case--does the Stop API call fail (with a 409 or something like that), or does it try to advance the VMM record to Destroyed at generation 12 and then succeed? If the latter, what if the reincarnation task gets to the instance in the meantime and restarts it? Then you have a Stop request that returns a Running instance... It might be simplest just to fail if the stop API was not able to take some meaningful action to drive the instance's state to Stopped.

Assuming these practical details sound OK, I'm on board with this conception of the purpose of instance_stop: the point is to drive the system toward a state where the instance has no active VMM and is not eligible to be automatically started, and observing that there's no active VMM (and successfully commandeering its state machine if needed) achieves that goal. If we really want we can always add a separate mechanism later to report that the instance didn't keep its limbs inside the state machine before it came to a complete stop.

This is particularly true today because the stop API pulls power from the guest. The distinction is maybe more meaningful for VMs that users shut down from within the guest--in that case Failed is more indicative of a guest that may not have gone through a clean shutdown sequence. ↩

hawkw added the nexus Related to nexus label Oct 9, 2024

hawkw assigned gjcolombo, hawkw and iximeow Oct 9, 2024

augustuswm mentioned this issue Oct 10, 2024

Requesting a "Stopping" instance to stop results in eventual running state #6819

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should `instance-stop` really move instances to failed when they are discovered to already be gone? #6809

should `instance-stop` really move instances to failed when they are discovered to already be gone? #6809

hawkw commented Oct 9, 2024

faithanalog commented Oct 9, 2024

gjcolombo commented Oct 10, 2024

should instance-stop *really* move instances to failed when they are discovered to already be gone? #6809

should instance-stop *really* move instances to failed when they are discovered to already be gone? #6809

Comments

hawkw commented Oct 9, 2024

faithanalog commented Oct 9, 2024

gjcolombo commented Oct 10, 2024

Footnotes

should `instance-stop` really move instances to failed when they are discovered to already be gone? #6809

should `instance-stop` really move instances to failed when they are discovered to already be gone? #6809