Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump up oximeter redundancy to 3 perhaps? #6900

Open
askfongjojo opened this issue Oct 18, 2024 · 4 comments
Open

Bump up oximeter redundancy to 3 perhaps? #6900

askfongjojo opened this issue Oct 18, 2024 · 4 comments
Labels

Comments

@askfongjojo
Copy link

askfongjojo commented Oct 18, 2024

Oximeter is one of the few services that has no redundancy in the current provisioning policy. Metrics hasn't been considered mission-critical so far because they weren't exposed to users previously and is still in experimental mode at this time via OxQL. But as customer starts to consume the data for monitoring purposes, service availability will become more important than before.

Besides redundancy, distributing the metrics collection across different sleds will also help balance the network traffic load across different sleds. The sled_data_link:bytes_sent|received metrics on rack2 show that oximeter is the heaviest consumer of network bandwidth among all the non-crucible control plane services.

@jgallagher
Copy link
Contributor

I think this is trickier than it sounds; by design, starting more oximeters wouldn't really be redundant:

  • For any given metric producer, Nexus chooses one oximeter collector to assign. If there are multiple oximeters Nexus will distribute the producers among them, but if one of those oximeters goes down, we'll lose all the metrics from producers that were assigned to it until it comes back.
  • We do support reassignment of collectors, but only if the oximeter they were assigned to has been expunged. If we had multiple oximeters, expungement of a bad one wouldn't happen until a support window, at which point we're in a position to restore service whether or not there are multiple oximeters.
  • If we allow multiple oximeters to exist, we open the door for multiple oximeters collecting from the same producer at the same time and recording both samples in clickhouse. This shouldn't happen in general or very much (since collectors shouldn't overlap on producers), but last time Ben and I chatted about multiple oximeters, expungement, reassignment, etc., it seemed like depending on implementation details we might have some time periods where double collecting might be possible. (This was a very informal discussion; it's certainly possible we could be more precise with some more investigation.)

Multiple oximeters would help distribute load. I'm not sure what we'd information we'd need to go from "oximeter is the heaviest consumer" to "oximeter's consumption is too heavy and needs to be split".

@davepacheco
Copy link
Collaborator

Despite the challenges @jgallagher mentioned, "high-availability metrics collection" is a reasonable product goal. If this becomes a priority, the first step may be a discussion about approaches and tradeoffs. For example I could see having multiple collectors assigned to each producer with a way of dedup'ing data in Clickhouse (e.g., an implicit field on all metrics for "the oximeter that collected it"). It's a little tricky to deal with this on the querying side but I think that problem may be intrinsic and again we can work through various approaches and their tradeoffs. From what I understand this isn't currently a priority.

@bnaecker
Copy link
Collaborator

I agree with most of the points here. I think it's definitely important to have fault-tolerant metrics collection, and that we also haven't seen much reason to believe oximeter cannot keep up in the current design.

Multiple oximeters would help distribute load. I'm not sure what we'd information we'd need to go from "oximeter is the heaviest consumer" to "oximeter's consumption is too heavy and needs to be split".

I like this framing. Answering two questions seems useful: "How close is the single oximeter to not being able to handle its load?" and "To what extent is oximeter's traffic impacting other services on the sled?" That might help us nail down the priority of this work.

@davepacheco
Copy link
Collaborator

We also want to be clear about whether the product goal is "horizontal scalability" or "high availability". These are different paths in general but especially with Oximeter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants