Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peer:start_link/1 times out when the given node already exists #8930

Open
xxdavid opened this issue Oct 11, 2024 · 6 comments
Open

peer:start_link/1 times out when the given node already exists #8930

xxdavid opened this issue Oct 11, 2024 · 6 comments
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@xxdavid
Copy link

xxdavid commented Oct 11, 2024

Describe the bug

When I try to start a peer node which already exists, the call times out, instead of immediately returning a clear error.

To Reproduce

$ erl -sname foo
Erlang/OTP 27 [erts-15.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1]

Eshell V15.1 (press Ctrl+G to abort, type help(). for help)
(foo@ROFL)1> peer:start_link(#{name => bar}).
{ok,<0.96.0>,bar@ROFL}
(foo@ROFL)2> peer:start_link(#{name => bar}).
** exception exit: timeout
     in function  peer:start_it/2 (peer.erl, line 929)
=ERROR REPORT==== 11-Oct-2024::14:02:56.698739 ===
** Generic server <0.96.0> terminating
** Last message in was {'EXIT',<0.94.0>,
                           {timeout,
                               [{peer,start_it,2,
                                    [{file,"peer.erl"},{line,929}]},
                                {erl_eval,do_apply,7,
                                    [{file,"erl_eval.erl"},{line,904}]},
                                {shell,exprs,7,
                                    [{file,"shell.erl"},{line,893}]},
                                {shell,eval_exprs,7,
                                    [{file,"shell.erl"},{line,849}]},
                                {shell,eval_loop,4,
                                    [{file,"shell.erl"},{line,834}]}]}}
** When Server state == {peer_state,#{name => bar},
                                    bar@ROFL,
                                    "/Users/David/.asdf/installs/erlang/27.1/erts-15.1/bin/erl",
                                    ["-sname","bar","-detached",
                                     "-peer_detached","-user","peer",
                                     "-origin",
                                     "g1h3CGZvb0BST0ZMAAAAYAAAAABnCRJw"],
                                    undefined,undefined,<<>>,running,
                                    {<0.94.0>,
                                     #Ref<0.2057912559.480509953.125513>},
                                    0,#{}}
** Reason for termination ==
** {timeout,[{peer,start_it,2,[{file,"peer.erl"},{line,929}]},
             {erl_eval,do_apply,7,[{file,"erl_eval.erl"},{line,904}]},
             {shell,exprs,7,[{file,"shell.erl"},{line,893}]},
             {shell,eval_exprs,7,[{file,"shell.erl"},{line,849}]},
             {shell,eval_loop,4,[{file,"shell.erl"},{line,834}]}]}
=CRASH REPORT==== 11-Oct-2024::14:02:56.699768 ===
  crasher:
    initial call: peer:init/1
    pid: <0.96.0>
    registered_name: []
    exception exit: {timeout,
                        [{peer,start_it,2,[{file,"peer.erl"},{line,929}]},
                         {erl_eval,do_apply,7,
                             [{file,"erl_eval.erl"},{line,904}]},
                         {shell,exprs,7,[{file,"shell.erl"},{line,893}]},
                         {shell,eval_exprs,7,[{file,"shell.erl"},{line,849}]},
                         {shell,eval_loop,4,[{file,"shell.erl"},{line,834}]}]}
      in function  gen_server:decode_msg/9 (gen_server.erl, line 2299)
    ancestors: [<0.94.0>,<0.93.0>,<0.76.0>,<0.71.0>,<0.75.0>,<0.70.0>,
                  kernel_sup,<0.47.0>]
    message_queue_len: 2
    messages: [{'EXIT',<0.94.0>,normal},{nodedown,bar@ROFL}]
    links: []
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 10958
    stack_size: 29
    reductions: 14648
  neighbours:

Expected behavior

I would expect a clear error, similar to what slave:start_link/2 returns.

$ erl -sname foo
Erlang/OTP 27 [erts-15.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1]

Eshell V15.1 (press Ctrl+G to abort, type help(). for help)
(foo@ROFL)1> {ok, Host} = inet:gethostname().
{ok,"ROFL"}
(foo@ROFL)2> slave:start_link(Host, bar).
{ok,bar@ROFL}
(foo@ROFL)3> slave:start_link(Host, bar).
{error,{already_running,bar@ROFL}}

Affected versions

I guess all OTP versions since OTP 25. I tested it on OTP 27.1 and OTP 26.2.1.

@xxdavid xxdavid added the bug Issue is reported as a bug label Oct 11, 2024
@max-au
Copy link
Contributor

max-au commented Oct 14, 2024

This is, indeed, a bug - it should immediately crash with this Reason (as tested [url=https://github.com/erlang/otp/blob/master/lib/stdlib/test/peer_SUITE.erl#L503]here[/url]):

** Reason for termination ==
** {{boot_failed,{exit_status,1}},
    [{peer,start_it,2,[{file,"peer.erl"},{line,922}]},
     {erl_eval,do_apply,7,[{file,"erl_eval.erl"},{line,904}]},

Somehow it works for standard_io but not normal distribution.

I'll take a look at it.

@garazdawi garazdawi self-assigned this Oct 14, 2024
@garazdawi garazdawi added the team:VM Assigned to OTP team VM label Oct 14, 2024
@max-au
Copy link
Contributor

max-au commented Oct 20, 2024

That turned to be more complex than I anticipated.
When the child node starts with default arguments (using TCP control channel), it starts detached, disconnecting standard IO ports. Hence exit_code sent from the port from such a child is always 0.

But later down the road the child exits (with code -1). There is no way for parent to know it, because the child had no chance to report it to parent (control connection over TCP could not be open, - the attempt is made after distribution startup).

Ideas I thought of:

  1. Check erl_epmd and see if other nodes with the requested name is running. Sounds really fragile and race-condition-prone.
  2. Start child node attached. This will likely break a lot of existing logic, and I'm hesitant to do that.
  3. Document this behaviour and make a stronger recommendation against creating named nodes, as it is often too error-prone, hence ?CT_PEER macro to tackle the unique naming.

While at it, I found a few minor issues (incorrect documentation around register option that has been removed before peer was released), so a PR is needed regardless of the decision made.

@garazdawi
Copy link
Contributor

I've always thought that it was a mistake to detach from the peer. Do you know of which type of usages would break if peer stays attached to the child node?

@max-au
Copy link
Contributor

max-au commented Oct 22, 2024

It's been a while... I vaguely recall something about proxy processes (namely, ssh, when starting peers on remote nodes the same way as slave did, because I needed peer to replace slave completely).
Even if it's no longer applicable, I can't imagine what are the other types of failures that may happen if that's changed. So everything below is my speculation. What happens when the process starts detached, where does output go, and what happens if input is requested? What would be the difference if we start attached?

I personally very much like standard_io control connection (which keeps stdio attached), and hence the bug was left unnoticed for a few years.

@garazdawi
Copy link
Contributor

hmm, yeah, proxy processes could be a problem.

When testing the shell using tmux we start erlang using tmux new-window -d -- erl, which will create a background erlang session. So in that scenario we don't want to keep the stdin connection open, or at least not consider it a failure if the port is closed with exit reason 0.

Maybe that is a good compromise? That is, for dist/tcp connection types, we stay attached to the port and forward all output to the group_leader, if the port exits with 0 then we ignore that and wait for the connection to close via dist/tcp. If it exits with a non-zero we stop peer with that reason. At the same time we flip the default for the detached option to be false?

and what happens if input is requested? What would be the difference if we start attached?

If you do io:get_line/1 in detached mode it will hang, just as if you do it in any other scenario an no-one enters any input, so I think we are safe to keep it attached.

@max-au
Copy link
Contributor

max-au commented Oct 23, 2024

for dist/tcp connection types, we stay attached to the port and forward all output to the group_leader, if the port exits with 0 then we ignore that and wait for the connection to close via dist/tcp.

IIRC with dist/tcp control channel output is already forwarded through dist/tcp. Standard output may be just discarded (to keep existing behaviour), and zero exit code discarded too.

Now what I don't know is how many test suites depend on -detached true specified. Inside OTP, probably, none or just a few. I can try making these changes and running all tests to see what breaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

3 participants