-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump up oximeter redundancy to 3 perhaps? #6900
Comments
I think this is trickier than it sounds; by design, starting more oximeters wouldn't really be redundant:
Multiple oximeters would help distribute load. I'm not sure what we'd information we'd need to go from "oximeter is the heaviest consumer" to "oximeter's consumption is too heavy and needs to be split". |
Despite the challenges @jgallagher mentioned, "high-availability metrics collection" is a reasonable product goal. If this becomes a priority, the first step may be a discussion about approaches and tradeoffs. For example I could see having multiple collectors assigned to each producer with a way of dedup'ing data in Clickhouse (e.g., an implicit field on all metrics for "the oximeter that collected it"). It's a little tricky to deal with this on the querying side but I think that problem may be intrinsic and again we can work through various approaches and their tradeoffs. From what I understand this isn't currently a priority. |
I agree with most of the points here. I think it's definitely important to have fault-tolerant metrics collection, and that we also haven't seen much reason to believe
I like this framing. Answering two questions seems useful: "How close is the single |
We also want to be clear about whether the product goal is "horizontal scalability" or "high availability". These are different paths in general but especially with Oximeter. |
Oximeter is one of the few services that has no redundancy in the current provisioning policy. Metrics hasn't been considered mission-critical so far because they weren't exposed to users previously and is still in experimental mode at this time via OxQL. But as customer starts to consume the data for monitoring purposes, service availability will become more important than before.
Besides redundancy, distributing the metrics collection across different sleds will also help balance the network traffic load across different sleds. The
sled_data_link:bytes_sent|received
metrics on rack2 show that oximeter is the heaviest consumer of network bandwidth among all the non-crucible control plane services.The text was updated successfully, but these errors were encountered: