Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix halfway-stopping of listeners #348

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

juhlig
Copy link
Contributor

@juhlig juhlig commented Jun 25, 2024

Fixes #347.

The changes in this PR actually do two things:

  • by executing the stopping procedure in a separate process, a crash of the process calling ranch:stop_listener at an unfortunate time (namely, between the supervisor:terminate_child and supervisor:delete_child calls involved) should no longer prevent the stopping procedure from going to completion
  • by separating the fetching of the transport and transport options from the actual stopping, fully stopping a for some reason terminated by not deleted listener should complete

There are no tests yet (and for the former case, I don't see how we could reliably create the scenario of the calling process crashing at just the right time). I want to hear what you think about this approach first before investing more time.

* if the process calling ranch:stop_listener crashes before finishing,
  the stopping is still exetuted completely
* if a listener is terminated but not deleted, calling ranch:stop_listener
  removes the remnant
@zuiderkwast
Copy link
Contributor

Hi Jan!

In #347, @JeppeTh outlined a test case as an Erlang shell session. It involves supervisor:terminate_child to simulate the problem and put ranch in the problematic state.

@juhlig
Copy link
Contributor Author

juhlig commented Jun 25, 2024

Hi Jan!

Hi Viktor, long time no see 😅

In #347, @JeppeTh outlined a test case as an Erlang shell session. It involves supervisor:terminate_child to simulate the problem and put ranch in the problematic state.

Yes, that would be a blueprint for a test case for my second point, when a listener already is in the halfway-stopped state.

The hard-to-test part revolves around my first point, which should make it even impossible to put a listener in that state by "normal means" and crash at unfortunate times =^^=

@juhlig
Copy link
Contributor Author

juhlig commented Jun 27, 2024

@essen? WDYT of the approach I used and the PR in general?

@essen
Copy link
Member

essen commented Jun 28, 2024

Why isn't this done in ranch_server instead?

@juhlig
Copy link
Contributor Author

juhlig commented Jul 2, 2024

Why isn't this done in ranch_server instead?

Sure, could be done, not a bad idea :)
It would be a first, though: up to now ranch_server only does some bookkeeping on the listeners, ie it doesn't really do anything.

I'll make a draft to show to you, maybe tomorrow.

@juhlig
Copy link
Contributor Author

juhlig commented Jul 2, 2024

I'll make a draft to show to you, maybe tomorrow.

... or maybe today =^^= WDYT?

@essen
Copy link
Member

essen commented Jul 2, 2024

I would not move the stop logic to ranch_server but would instead add a way for it to execute a callback and return whatever the result is. That way we don't need to worry about the details in ranch_server and it can be reused in the future if necessary. But mostly to keep the stop code all in one place in ranch really.

Something like gen_server:call({rpc, M, F, A}). To be honest I wonder if erpc or something else can be used instead.

@juhlig
Copy link
Contributor Author

juhlig commented Jul 3, 2024

Ah, ok, so you mean leave the stopping code in the ranch module, but run it in the ranch_server process, instead of in a process spawned from the process calling stop_listener, correct?

The way you propose would enable anyone to run any code in the ranch_server process, not sure if that is a good thing, it gives me an uneasy feeling though 🤷‍♂️
At the least, it would enable crashing the ranch_server process (like {rpc, erlang, exit, [foo]}). We could wrap the apply in a try...catch to defend against that.
The given MFArgs could also stall ranch_server (like {rpc, timer, sleep, [infinity]}). We could run the given MFArgs in a spawned process to defend against that, but then we could just as well spawn from the process calling ranch:stop_listener?
(Back in the good old days of IRC, I remember you saying "if users want to shoot themselves in the foot we let them" or sth like that, so I'm just pointing it out.)

erpc, by which I think you mean specifically erpc:call, hm. What it in a nutshell does is a fancy spawn, that would make it a viable alternative to the ranch_server rpc thing you proposed.
On closer inspection however, erpc:call takes a shortcut when called with the local node as first argument, in that case it does an apply instead, running the given MFArgs in the context of the calling process, which leaves us with the same problem as we have now: if the calling process exits at an awkward point (ie, between the applied MFArgs calling terminate_child and delete_child), we end up in the same scenario of a halfway-stopped listener.
(I think this behavior can actually be called a bug, I'll probably create an issue with OTP about it later.)

@essen
Copy link
Member

essen commented Jul 3, 2024

Ah, ok, so you mean leave the stopping code in the ranch module, but run it in the ranch_server process, instead of in a process spawned from the process calling stop_listener, correct?

Yes.

The way you propose would enable anyone to run any code in the ranch_server process, not sure if that is a good thing, it gives me an uneasy feeling though 🤷‍♂️

We can restrict the M and F to ranch and do_stop_listener or whatever the name is, for now.

@zuiderkwast
Copy link
Contributor

@essen What's the point of this extra abstraction? To me it looks like over-engineering.

Why not just do the simple protection (or workaround if consider it an OTP bug) for the fact that supervisor:terminate_child and supervisor:delete_child can't be called together as an atomic unit?

@essen
Copy link
Member

essen commented Jul 3, 2024

We need to do it in a process that won't be the caller, it can either be a random process that we spawn for this specifically, or ranch_server which already exists. Better it being the latter as it's easy to find, trace, debug and whatever else, and we don't have to reimplement a call mechanism + monitors etc. like we would for spawn. It's not fire and forget, we have to wait for the listener to be stopped before returning.

Not sure which part is over engineering.

@essen
Copy link
Member

essen commented Jul 3, 2024

If erpc applies its optimisation only for infinity timeout then it could be used then? Maybe have a large timeout, larger than the calls stop_listener do.

@zuiderkwast
Copy link
Contributor

Thanks, makes sense.

@juhlig juhlig force-pushed the fix_half-stopped_listeners branch from 1c3b4f7 to de6d4b6 Compare July 4, 2024 07:45
src/ranch.erl Show resolved Hide resolved
src/ranch.erl Outdated Show resolved Hide resolved
@juhlig juhlig force-pushed the fix_half-stopped_listeners branch from de6d4b6 to a2c959e Compare July 4, 2024 08:35
src/ranch.erl Show resolved Hide resolved
@essen
Copy link
Member

essen commented Jul 4, 2024

Looks good, thanks! I'm off for the next two weeks but I'll look into merging and releasing a new version when I get back.

@juhlig
Copy link
Contributor Author

juhlig commented Jul 4, 2024

Looks good, thanks! I'm off for the next two weeks but I'll look into merging and releasing a new version when I get back.

Have a nice vacation then 👋

@juhlig
Copy link
Contributor Author

juhlig commented Oct 21, 2024

@essen what about this PR? 3+ months have passed 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restart issues found when upgrading 1.8 -> 2.1.0
3 participants