Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high broker load from steady stream of disconnect messages aimed at down level client #6362

Open
garlick opened this issue Oct 11, 2024 · 0 comments

Comments

@garlick
Copy link
Member

garlick commented Oct 11, 2024

Problem: @grondo investigated a high load a rank 0 broker on a test cluster. A flux overlay trace revealed a steady stream of

[  +8.042307]  tx * c disconnect 0 [0]
[  +8.042326]  tx * c disconnect 0 [0]
[  +8.042524]  tx * c disconnect 0 [0]
[  +8.042551]  tx * c disconnect 0 [0]
[  +8.042578]  tx * c disconnect 0 [0]
[  +8.042761]  tx * c disconnect 0 [0]
[  +8.042788]  tx * c disconnect 0 [0]
[  +8.043487]  tx * c disconnect 0 [0]
[  +8.043515]  tx * c disconnect 0 [0]
[  +8.043542]  tx * c disconnect 0 [0]

A stack trace of the spinning broker revealed

(gdb) where
#0  __GI___libc_write (nbytes=8, buf=0xffffeda873e0, fd=<optimized out>)
    at ../sysdeps/unix/sysv/linux/write.c:26
#1  __GI___libc_write (fd=<optimized out>, buf=0xffffeda873e0, nbytes=8)
    at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x0000ffff895ae8dc in ?? () from /lib/aarch64-linux-gnu/libzmq.so.5
#3  0x0000ffff895a56ec in ?? () from /lib/aarch64-linux-gnu/libzmq.so.5
#4  0x0000ffff895aba8c in ?? () from /lib/aarch64-linux-gnu/libzmq.so.5
#5  0x0000ffff895b5730 in ?? () from /lib/aarch64-linux-gnu/libzmq.so.5
#6  0x0000ffff895cb3f4 in zmq_send () from /lib/aarch64-linux-gnu/libzmq.so.5
#7  0x0000aaaac5ac7008 in zmqutil_msg_send_ex (sock=0xaaaad2d42a80, 
    msg=0xaaaad2dfbe60, nonblock=<optimized out>)
    at ../common/libzmqutil/msg_zsock.c:52
#8  0x0000aaaac5ab5fe8 in overlay_sendmsg_child (ov=0xaaaad2d39180, 
    msg=0xaaaad2dfbe60) at ./src/broker/overlay.c:805
#9  0x0000aaaac5ae6ad8 in overlay_control_child.constprop.0 (
    ov=0xaaaad2d39180, 
    uuid=0xaaaad2e1dfc0 "f982f794-27d4-464b-88f0-f41976ffdf24", status=0, 
    type=CONTROL_DISCONNECT) at ./src/broker/overlay.c:568
#10 0x0000aaaac5ab77e8 in child_cb (r=<optimized out>, w=<optimized out>, 
    revents=<optimized out>, arg=0xaaaad2d39180) at ./src/broker/overlay.c:1041
#11 0x0000aaaac5ac54a8 in check_cb (loop=0xffff896b24d8 <default_loop_struct>, 
    w=0xaaaad2d43f08, revents=<optimized out>)
    at ../common/libzmqutil/ev_zmq.c:79
#12 0x0000ffff89676504 in ev_invoke_pending (
    loop=0xffff896b24d8 <default_loop_struct>) at libev/ev.c:3770
#13 0x0000ffff8964f044 in ev_run (flags=0, loop=<optimized out>)
    at libev/ev.c:4190
#14 ev_run (flags=0, loop=<optimized out>) at libev/ev.c:4021
#15 flux_reactor_run (r=0xaaaad2d30f10, flags=flags@entry=0)
    at libflux/reactor.c:124
#16 0x0000aaaac5aadb08 in main (argc=<optimized out>, argv=<optimized out>)
    at ./src/broker/broker.c:529

There was a downrev broker in the system

 grondo@pi0:~$ flux version
commands:    		0.58.0-92-g8d24e946f
libflux-core:		0.58.0-92-g8d24e946f
libflux-security:	0.10.0
build-options:		+systemd+hwloc==2.4.0+zmq==4.3.4

That broker's logs were filled with

 DROP upstream control topic - : message received before hello handshake

Stopping the downrev broker made the high load stop.

Restarting the broker did not make the high load return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant