Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: publish trace decisions through redis pubsub #1362

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

VinozzZ
Copy link
Contributor

@VinozzZ VinozzZ commented Oct 3, 2024

Which problem is this PR solving?

When a sampling decision is made, the decider node needs to publish the decision alongside all metadata related to that decision to all peers.

Short description of the changes

  • introduce a new TraceDecision type to capture all information needed
  • separate redistributeNotifier into its own file to reduce the length of the collect.go file
  • break up send() into two functions makeDecision and send
  • create a new trace_decision pubsub topic for transferring trace decisions
  • trace decision messages are multiplexed on the same pubsub topic through prefix drop and kept
  • refactor MockSharder so that we can test trace ownership easier

Copy link
Contributor

@MikeGoldsmith MikeGoldsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some feedback, looks like it's going in the right direction.

collect/collect.go Outdated Show resolved Hide resolved
collect/collect.go Show resolved Hide resolved
collect/collect.go Show resolved Hide resolved
err error
)
if tr.Kept() {
// TODO: late span in this case won't get HasRoot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this okay? If a trace times out without receiving the root span, a decision is made without the root and the decision is cached. Root and non-root spans should then still honour the decision, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision is still honored. The late span just won't be decorated with metadata like span count and the it won't increase trace_send_got_root metric

KeptReason: keptReason,
SampleRate: tr.Rate(),
Count: uint32(tr.SpanCount()),
EventCount: uint32(tr.SpanEventCount()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying we have to cast this type, would be nicer if they both uint32's

}
toDelete := generics.NewSet[string]()
for _, td := range tds {
trace := i.cache.Get(td.TraceID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should pull the trace just to check if it exists. Trying to remove items from the cache that don't exist should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. My intention is to retrieve the trace object from the cache before processing it in the send() function. If it doesn't exist in the cache, we don't need to make sampling decision for it

collect/collect.go Outdated Show resolved Hide resolved
triggered chan struct{}
once sync.Once
}
i.sampleTraceCache.Record(trace, td.Kept, td.KeptReason)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cache both for drop and kept traces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this cache can be renamed to something like TraceDecisionCache to make it more clear

// Calculate the backoff interval using exponential backoff with a base time.
backoff := time.Duration(math.Min(float64(lastBackoff)*2, float64(r.maxDelay)))
// Add jitter to the backoff to avoid retry collisions.
jitter := time.Duration(rand.Float64() * float64(backoff) * 0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter if all nodes coming online may end up with the same value? I've send backoff and jitter values use a randomness components to try to prevent two processes polling at the same time.

switch data[0] {
case droppedPrefix:
for _, traceID := range strings.Split(data[1], ",") {
td = append(td, TraceDecision{TraceID: traceID})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the trace decision indicate kept = false here?

@VinozzZ VinozzZ marked this pull request as ready for review October 11, 2024 22:17
@VinozzZ VinozzZ requested a review from a team as a code owner October 11, 2024 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants