Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ldmsd aggregator dies when a hostname of a sampler node can't be resolved #1489

Open
johnstile opened this issue Oct 29, 2024 · 11 comments
Open
Milestone

Comments

@johnstile
Copy link

johnstile commented Oct 29, 2024

ldmsd aggregator dies when a hostname of a sampler node can't be resolved

ldmsd_prdcr_new: my_bad_hostname:6002 not resolved.
msg_no 4654: error 97: Error resolving hostname 'my_bad_hostname:'

I would like for ldmsd to continue working and simply kick the bad host out of the list.

Version: ovis-4.4.2

@baallan
Copy link
Collaborator

baallan commented Oct 30, 2024

@johnstile given that in some scenarios dns failures/network failures are temporary, why kick it out of the list instead of simply retrying later?

I agree that crashing is not an option.

@baallan baallan added this to the v4.4.5 milestone Oct 30, 2024
@evanmcc
Copy link

evanmcc commented Oct 30, 2024

Our failures tend to be permanent (at least over the span of the process lifetime), but I agree that it might be better to keep it in the list and keep trying.

We started working on a simple patch to skip the config line in the case of that particular error, but it's kind of unsatisfying as that error code might be returned for some related but not identical reason.

@johnstile
Copy link
Author

johnstile commented Oct 30, 2024

The evaluation of the config file is implemented in a way where the config file is not re-read after the initialization (unlike projects like samba).

This has the benefit that the vulnerability only exists when ldmsd starts, and a dns issue after initialization will leave ldmsd running, and it will try again, as recommend.

I would be satisfied if the ldmsd would simply skip the hostname check.

I have a dumb patch so that we can survive right now, but it is a hack.

From ldms/src/ldmsd/ldmsd_config.c, in __process_config_file, when xprt.rsp_err == 97, it calls goto next_req;

I'm not sure if this is the best error code to return from this function.

Error code 97 - Address family not supported

@jstile-lbl
Copy link

Created a pull request

#1491

@tom95858
Copy link
Collaborator

@jstile-lbl I don't think we want to skip the request. I think we want it to add the producer so that the normal producer retry logic will attempt to reconnect. The issue here is that the resolution of the host name is happening at config time, not at connect time. To really fix this, I think we need to move the resolution logic to the connect path. @nichamon, can you take a quick look at that?

@nichamon
Copy link
Collaborator

@nichamon, can you take a quick look at that?

Yes, I'm wrapping up my current item, and this will be next on my list.

@jstile-lbl
Copy link

Since we aren't going to exit when the hostname can't be resolved, would the solution be to not run the hostname check?

@nichamon
Copy link
Collaborator

@jstile-lbl We still want to validate hostnames, but we're going to move this check from config parsing time to the connection phase before connecting to the client. This means ldmsd won't fail during startup due to unresolvable hostnames. In the connection phase, ldmsd will print an error message if the hostname cannot be resolved and try to resolve the hostname again at the next reconnect interval.

@tom95858
Copy link
Collaborator

@jstile-lbl We have to convert (resolve) a "dns-name" to an ip-address. It is something that we have to do at some point or we can't connect.

So what we are discussing doing is moving that resolution (hostname --> Ip-address) from the configuration path to the connect path. This would delay the notice of silly hostname errors, but improve resilience to DNS failures. We might do both, i.e. log a warning at configuration time, and then log errors at connect time.

@tom95858
Copy link
Collaborator

@jstile-lbl, @nichamon just to be clear...it's not a "validation", it's a conversion/resolution. It has to be done or we can't connect.

@baallan
Copy link
Collaborator

baallan commented Oct 31, 2024

@tom95858 @nichamon can we leave a 'check and warn' of hostnames at parse time so that issues are discovered sooner at startup? I agree that rewarning at connect time, but please let's not have it warn on every retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants