Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better error message for rdma connection error #1474

Open
baallan opened this issue Oct 22, 2024 · 6 comments
Open

better error message for rdma connection error #1474

baallan opened this issue Oct 22, 2024 · 6 comments

Comments

@baallan
Copy link
Collaborator

baallan commented Oct 22, 2024

connecting (attempting) as a user other than root to rdma ldmsd port needs a better error message. currently:

ldms_ls -x rdma -h c1x1-ib0 -a munge -p 412
ZAP_RDMA [1729640099.834756914](63868) __rdma_buffer_pool_alloc:645 ibv_reg_mr() failed, errno: 12
Connection failed/rejected.

It should in handling the failed/rejected check the current UID and. if not uid==0, print a hint that the user must have privileges/be root to connect over rdma.

If there actually is a way for a non-root process to connect to remote root-owned daemon via rdma, then we need to make it work by default. (I'm guessing there isn't).

@morrone
Copy link
Collaborator

morrone commented Oct 23, 2024

In the client, we wouldn't necessarily know that the other end only allows root, right?

I think if the connection fails for authentication reason what we really should have is something like an "Authentication failed" message, rather than just assuming things about the reason for the authentication failure. Knowing that the connection was denied for authentication reasons would be much more helpful than "Connection failed/rejected", and would likely put most people on the right path to addressing the problem.

@baallan
Copy link
Collaborator Author

baallan commented Oct 23, 2024

the root case here is the local device cannot be opened by the transport on the client, rather than it being a server end issue.

@morrone
Copy link
Collaborator

morrone commented Oct 23, 2024

Are you certain that the problem is entirely on the client side? Maybe it is on your system, I don't know.

But non-root processes absolutely can connect to remote root-owned daemons via rdma. And as long as your non-root user is in the uid or gid of the metric set, you'll be able to see the metric set. So since that does work, putting out a blanket statement about using root doesn't seem appropriate.

If you are correct that your issue is entirely client side, then I would agree that some better error message is in order, but maybe not the one you are suggesting.

@baallan
Copy link
Collaborator Author

baallan commented Oct 24, 2024

@morrone if i elevate first on the client side, then all goes smoothly with the rdma connection. If i configure/use sock instead of rdma, then all goes smoothly. This is about not even being able to start the connection; auth and perm on the server end aren't yet involved when it failed.

@tom95858
Copy link
Collaborator

@morrone, @baallan I think the problem is that the ulimit set for the user isn't large enough for the requested ibv_reg_mr which is going to pin the memory. When you changed to root, it changed the ulimit and it worked. The limit in question is "max locked memory". @baallan, perhaps you should confirm my thesis and then submit a pull request with a "better" error message.

@baallan
Copy link
Collaborator Author

baallan commented Oct 24, 2024

@tom95858 some testing results as a regular user:

# ulimit -l
64

# ldms_ls -m 64k $ibhost -x rdma -p 412 -a munge
ZAP_RDMA [1729779340.041355232](195278) __rdma_buffer_pool_alloc:645 ibv_reg_mr() failed, errno: 12

Connection failed/rejected.
# ldms_ls -m 32k $ibhost -x rdma -p 412 -a munge
ZAP_RDMA [1729779346.736221843](195379) __rdma_buffer_pool_alloc:645 ibv_reg_mr() failed, errno: 12
Connection failed/rejected.

# ulimit -l 256
-bash: ulimit: max locked memory: cannot modify limit: Operation not permitted

I'll see if i can better test your thesis when I can get in as root and lower the ulimit to provoke the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants