Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: return bound endpoints after server started #113

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

kriswuollett
Copy link
Contributor

The impetus behind the change is to support adding monitoring endpoints with an optional additional TCP listener. While refactoring the server it was discovered that the run method moves the Server self, and so fixed #112 to also ensure that the API endpoint was running before tests could discover the listener's locally bound port.

In addition a common cancellation token is added to the server to propagate shutdown to both the core service as well as the API server endpoint. It can be later used to support additional services later such as a monitoring endpoint(s).

The impetus behind the change is to support adding monitoring endpoints with
an optional additional TCP listener. While refactoring the server it was
discovered that the run method moves the Server self, and so fixed bytecodealliance#112 to
also ensure that the API endpoint was running before tests could discover the
listener's locally bound port.

In addition a common cancellation token is added to the server to propagate
shutdown to both the core service as well as the API server endpoint. It can be
later used to support additional services later such as a monitoring
endpoint(s).
@kriswuollett
Copy link
Contributor Author

Manually tested that Ctrl-C still works:

CleanShot 2023-05-12 at 11 42 52@2x

Copy link
Member

@peterhuene peterhuene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kriswuollett Thanks for refactoring this!

Overall, I think this makes a lot of sense and will make things cleaner once we have additional endpoints we want to run as part of the same "warg server".

I just have some comments to go over with you.

crates/server/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +206 to +211
self.tasks.spawn(async move {
token.cancelled().await;
tracing::info!("waiting for core service to stop");
handle.stop().await;
Ok(())
});
Copy link
Member

@peterhuene peterhuene May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this task be merged with the next task? That is to say, after the server future completes in the next task, we simply wait on the stop handle for the core service.

Given this design, I would expect a 1:1 between a task in the task set and a running Hyper server (API in this case, but soon a monitoring server, etc). Each would shutdown in their own way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think due to the nature of fanning out a cancelation signal chain, a cancellation token should be cloned for each owned depedency, somewhat similar to how Go Context works. Otherwise there virtually will be for-loop boilerplate being written to cancel "children" all separately as a server grows. That allows things to shut down in parallel easily if needed.

But yes, could keep things simple for now and merge the shutdown tasks instead of renaming chaining of them through the tokens themselves. However I'd either want to use a broadcast channel instead, or an additional token for draining in #111.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want the core service to be stopped in parallel to the hyper server shutting down as it might race.

If we call stop on the core service, it will break out of its loop and the task will complete; note that the CoreService instance itself won't be dropped as a result of this as there are references to it from the hyper server.

If this is done in parallel to the hyper server shutting down, then it's possible a request being drained attempts to enqueue work for the core service to do; if it so happens that the core service already stopped and the send limit for the message queue is reached, it'll block waiting for the core service to remove a message which will never occur since its task has ended.

We also don't want to close the message queue's sender as part of stopping the core service as this would result in a panic (even a graceful error back to the client here is not ideal) and a reset of the connection for any request attempting to enqueue work for the core service to do after it has stopped.

I believe the correct order of operations here is to shutdown the hyper server then join on the core service, like we had before.

Copy link
Member

@peterhuene peterhuene May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that eventually the core service won't be part warg-server, but instead another backend service for it to talk to via IPC (i.e. a real distributed and reliable message queue) for processing new record entries; it's the limiting factor of why we can't spin up multiple warg-server instances right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was merely suggesting making parallel shutdown possible, but did not point out a use case yet where to apply it now. I'm well aware about needing to keep dependencies going, like a core service, while a user of it drains, like the API server.

We share the same view of the shutdown order, but what is your thoughts on using a CancellationToken to signal some thing(s) to shut down, versus explicit function call on a dependency to shut down? But please note I'm not suggesting a change in this PR to inject a token into into CoreService itself. The handle.stop().await would remain. This helps a bit when/if drain needs to be handled before an actual shutdown.

crates/server/src/lib.rs Show resolved Hide resolved
crates/server/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +151 to +154
assert!(
self.endpoints.is_none(),
"cannot start server multiple times"
);
Copy link
Member

@peterhuene peterhuene May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm personally not a big fan of taking &mut self for this as it leads to confusing internal state and needing to wrap things in Option just for this method to take their contents.

What if we renamed this start method to spawn and in addition to returning the endpoints, we return a type wrapping both the JoinSet representing the spawned server's tasks and the CancellationToken that can be used to stop the server. Thus, the type could have a stop method on it to stop the server and also implement Future that a caller can await on to join the tasks in the set?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I was just about to ask why required fields were becoming optional. I'd much rather have spawning produce a dedicated type that represents the info for a started service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just trying to minimize the patch since existing code was &mut self and encountered issues like #112 when working on #111. Also wasn't sure why there was a struct called Server if it wasn't meant to be used after calling a start-like method.

Things became Option-like in config so self wasn't partially moved. If I had continued in same direction as the PR was going, I'd have flattened the Config's members into Server and not kept a config member in the Server struct.

See other comment about providing a CancellationToken to chain a shutdown procedure rather than explicitly telling dependencies to shut down.

Copy link
Member

@peterhuene peterhuene May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code for bind method is &mut self, but the former run method was self, both are now combined into your start method.

That said, I think perhaps we could better model our Server off of hyper's Server; theirs is already a bound and listening server upon construction thus there doesn't need to be a start (or spawn) method.

That is, you contruct/build a server, it internally binds based on its configuration; the binding can be inspected via a method; in the case of hype there is only a single bound address for the server, but our implementation might have multiple bound addresses as there might be multiple hyper servers internally (e.g. API, monitoring, etc).

Finally, it implements Future, so that you simply await it for it to run to graceful completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, yes, I'd follow some patterns used on some existing servers already like the hyper used here already.

kriswuollett and others added 2 commits May 18, 2023 12:10
docs: fix code comments on server

Co-authored-by: Peter Huene <[email protected]>
Signed-off-by: Kristopher Wuollett <[email protected]>
Co-authored-by: Peter Huene <[email protected]>
Signed-off-by: Kristopher Wuollett <[email protected]>
Copy link
Contributor Author

@kriswuollett kriswuollett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed comments, thanks! Primary pushback is the method of controlling shutdown.

Comment on lines +151 to +154
assert!(
self.endpoints.is_none(),
"cannot start server multiple times"
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just trying to minimize the patch since existing code was &mut self and encountered issues like #112 when working on #111. Also wasn't sure why there was a struct called Server if it wasn't meant to be used after calling a start-like method.

Things became Option-like in config so self wasn't partially moved. If I had continued in same direction as the PR was going, I'd have flattened the Config's members into Server and not kept a config member in the Server struct.

See other comment about providing a CancellationToken to chain a shutdown procedure rather than explicitly telling dependencies to shut down.

crates/server/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +206 to +211
self.tasks.spawn(async move {
token.cancelled().await;
tracing::info!("waiting for core service to stop");
handle.stop().await;
Ok(())
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think due to the nature of fanning out a cancelation signal chain, a cancellation token should be cloned for each owned depedency, somewhat similar to how Go Context works. Otherwise there virtually will be for-loop boilerplate being written to cancel "children" all separately as a server grows. That allows things to shut down in parallel easily if needed.

But yes, could keep things simple for now and merge the shutdown tasks instead of renaming chaining of them through the tokens themselves. However I'd either want to use a broadcast channel instead, or an additional token for draining in #111.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup: remove server partial move of self in run
3 participants