fix: return bound endpoints after server started #113

kriswuollett · 2023-05-12T16:36:28Z

The impetus behind the change is to support adding monitoring endpoints with an optional additional TCP listener. While refactoring the server it was discovered that the run method moves the Server self, and so fixed #112 to also ensure that the API endpoint was running before tests could discover the listener's locally bound port.

In addition a common cancellation token is added to the server to propagate shutdown to both the core service as well as the API server endpoint. It can be later used to support additional services later such as a monitoring endpoint(s).

The impetus behind the change is to support adding monitoring endpoints with an optional additional TCP listener. While refactoring the server it was discovered that the run method moves the Server self, and so fixed bytecodealliance#112 to also ensure that the API endpoint was running before tests could discover the listener's locally bound port. In addition a common cancellation token is added to the server to propagate shutdown to both the core service as well as the API server endpoint. It can be later used to support additional services later such as a monitoring endpoint(s).

kriswuollett · 2023-05-12T16:43:26Z

Manually tested that Ctrl-C still works:

peterhuene

@kriswuollett Thanks for refactoring this!

Overall, I think this makes a lot of sense and will make things cleaner once we have additional endpoints we want to run as part of the same "warg server".

I just have some comments to go over with you.

crates/server/src/lib.rs

peterhuene · 2023-05-16T18:31:28Z

crates/server/src/lib.rs

+        self.tasks.spawn(async move {
+            token.cancelled().await;
+            tracing::info!("waiting for core service to stop");
+            handle.stop().await;
+            Ok(())
+        });


Could this task be merged with the next task? That is to say, after the server future completes in the next task, we simply wait on the stop handle for the core service.

Given this design, I would expect a 1:1 between a task in the task set and a running Hyper server (API in this case, but soon a monitoring server, etc). Each would shutdown in their own way.

I'd think due to the nature of fanning out a cancelation signal chain, a cancellation token should be cloned for each owned depedency, somewhat similar to how Go Context works. Otherwise there virtually will be for-loop boilerplate being written to cancel "children" all separately as a server grows. That allows things to shut down in parallel easily if needed.

But yes, could keep things simple for now and merge the shutdown tasks instead of renaming chaining of them through the tokens themselves. However I'd either want to use a broadcast channel instead, or an additional token for draining in #111.

I don't think we want the core service to be stopped in parallel to the hyper server shutting down as it might race.

If we call stop on the core service, it will break out of its loop and the task will complete; note that the CoreService instance itself won't be dropped as a result of this as there are references to it from the hyper server.

If this is done in parallel to the hyper server shutting down, then it's possible a request being drained attempts to enqueue work for the core service to do; if it so happens that the core service already stopped and the send limit for the message queue is reached, it'll block waiting for the core service to remove a message which will never occur since its task has ended.

We also don't want to close the message queue's sender as part of stopping the core service as this would result in a panic (even a graceful error back to the client here is not ideal) and a reset of the connection for any request attempting to enqueue work for the core service to do after it has stopped.

I believe the correct order of operations here is to shutdown the hyper server then join on the core service, like we had before.

Note that eventually the core service won't be part warg-server, but instead another backend service for it to talk to via IPC (i.e. a real distributed and reliable message queue) for processing new record entries; it's the limiting factor of why we can't spin up multiple warg-server instances right now.

I was merely suggesting making parallel shutdown possible, but did not point out a use case yet where to apply it now. I'm well aware about needing to keep dependencies going, like a core service, while a user of it drains, like the API server.

We share the same view of the shutdown order, but what is your thoughts on using a CancellationToken to signal some thing(s) to shut down, versus explicit function call on a dependency to shut down? But please note I'm not suggesting a change in this PR to inject a token into into CoreService itself. The handle.stop().await would remain. This helps a bit when/if drain needs to be handled before an actual shutdown.

crates/server/src/lib.rs

peterhuene · 2023-05-16T18:44:08Z

crates/server/src/lib.rs

+        assert!(
+            self.endpoints.is_none(),
+            "cannot start server multiple times"
+        );


So I'm personally not a big fan of taking &mut self for this as it leads to confusing internal state and needing to wrap things in Option just for this method to take their contents.

What if we renamed this start method to spawn and in addition to returning the endpoints, we return a type wrapping both the JoinSet representing the spawned server's tasks and the CancellationToken that can be used to stop the server. Thus, the type could have a stop method on it to stop the server and also implement Future that a caller can await on to join the tasks in the set?

Agree, I was just about to ask why required fields were becoming optional. I'd much rather have spawning produce a dedicated type that represents the info for a started service.

Was just trying to minimize the patch since existing code was &mut self and encountered issues like #112 when working on #111. Also wasn't sure why there was a struct called Server if it wasn't meant to be used after calling a start-like method.

Things became Option-like in config so self wasn't partially moved. If I had continued in same direction as the PR was going, I'd have flattened the Config's members into Server and not kept a config member in the Server struct.

See other comment about providing a CancellationToken to chain a shutdown procedure rather than explicitly telling dependencies to shut down.

The existing code for bind method is &mut self, but the former run method was self, both are now combined into your start method.

That said, I think perhaps we could better model our Server off of hyper's Server; theirs is already a bound and listening server upon construction thus there doesn't need to be a start (or spawn) method.

That is, you contruct/build a server, it internally binds based on its configuration; the binding can be inspected via a method; in the case of hype there is only a single bound address for the server, but our implementation might have multiple bound addresses as there might be multiple hyper servers internally (e.g. API, monitoring, etc).

Finally, it implements Future, so that you simply await it for it to run to graceful completion.

Cool, yes, I'd follow some patterns used on some existing servers already like the hyper used here already.

docs: fix code comments on server Co-authored-by: Peter Huene <[email protected]> Signed-off-by: Kristopher Wuollett <[email protected]>

Co-authored-by: Peter Huene <[email protected]> Signed-off-by: Kristopher Wuollett <[email protected]>

kriswuollett

Reviewed comments, thanks! Primary pushback is the method of controlling shutdown.

kriswuollett · 2023-05-18T16:59:50Z

crates/server/src/lib.rs

+        assert!(
+            self.endpoints.is_none(),
+            "cannot start server multiple times"
+        );


Was just trying to minimize the patch since existing code was &mut self and encountered issues like #112 when working on #111. Also wasn't sure why there was a struct called Server if it wasn't meant to be used after calling a start-like method.

Things became Option-like in config so self wasn't partially moved. If I had continued in same direction as the PR was going, I'd have flattened the Config's members into Server and not kept a config member in the Server struct.

See other comment about providing a CancellationToken to chain a shutdown procedure rather than explicitly telling dependencies to shut down.

crates/server/src/lib.rs

kriswuollett · 2023-05-18T17:07:29Z

crates/server/src/lib.rs

+        self.tasks.spawn(async move {
+            token.cancelled().await;
+            tracing::info!("waiting for core service to stop");
+            handle.stop().await;
+            Ok(())
+        });


I'd think due to the nature of fanning out a cancelation signal chain, a cancellation token should be cloned for each owned depedency, somewhat similar to how Go Context works. Otherwise there virtually will be for-loop boilerplate being written to cancel "children" all separately as a server grows. That allows things to shut down in parallel easily if needed.

But yes, could keep things simple for now and merge the shutdown tasks instead of renaming chaining of them through the tokens themselves. However I'd either want to use a broadcast channel instead, or an additional token for draining in #111.

kriswuollett added 2 commits May 12, 2023 11:21

docs: fix wording of server methods

41fd3d7

peterhuene reviewed May 16, 2023

View reviewed changes

kriswuollett and others added 2 commits May 18, 2023 12:10

Apply suggestions from code review

a46c7f0

docs: fix code comments on server Co-authored-by: Peter Huene <[email protected]> Signed-off-by: Kristopher Wuollett <[email protected]>

fix: clarify copying of endpoints struct

4648e46

Co-authored-by: Peter Huene <[email protected]> Signed-off-by: Kristopher Wuollett <[email protected]>

kriswuollett commented May 18, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: return bound endpoints after server started #113

fix: return bound endpoints after server started #113

kriswuollett commented May 12, 2023

kriswuollett commented May 12, 2023

peterhuene left a comment

peterhuene May 16, 2023 •

edited

Loading

kriswuollett May 18, 2023

peterhuene May 18, 2023

peterhuene May 18, 2023 •

edited

Loading

kriswuollett May 18, 2023

peterhuene May 16, 2023 •

edited

Loading

esoterra May 16, 2023

kriswuollett May 18, 2023

peterhuene May 18, 2023 •

edited

Loading

kriswuollett May 18, 2023

kriswuollett left a comment

kriswuollett May 18, 2023

kriswuollett May 18, 2023

fix: return bound endpoints after server started #113

Are you sure you want to change the base?

fix: return bound endpoints after server started #113

Conversation

kriswuollett commented May 12, 2023

kriswuollett commented May 12, 2023

peterhuene left a comment

Choose a reason for hiding this comment

peterhuene May 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterhuene May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterhuene May 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterhuene May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kriswuollett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterhuene May 16, 2023 •

edited

Loading

peterhuene May 18, 2023 •

edited

Loading

peterhuene May 16, 2023 •

edited

Loading

peterhuene May 18, 2023 •

edited

Loading