Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AlertLifeCycleObserver that allows consumers to hook into Alert life cycle #3461

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

emanlodovice
Copy link
Contributor

What this pull request does

This pull requests introduces a new AlertLifeCycleObserver interface that is accepted in the API, Dispatcher, and the notification pipeline. This interface contains methods to allow tracking what happens to an alert in alert manager.

Motivation

Currently, when a customer complains “I think my alert is delayed”, we currently have no straightforward way to troubleshoot. At minimum, we should be able to quickly identify if the problem is post-notification (we sent to the receiver on time but the receiver has some delay) or pre-notification.

By introducing a new interface that allows to hook into the alert life cycle, consumers of the alert manager package would be able to implement whatever observability solution works best for them.

@emanlodovice emanlodovice force-pushed the alert-observer branch 5 times, most recently from 50b60d1 to b284f53 Compare August 16, 2023 18:34
@emanlodovice emanlodovice marked this pull request as ready for review August 16, 2023 18:40
api/v1/api.go Outdated
@@ -447,6 +451,9 @@ func (api *API) insertAlerts(w http.ResponseWriter, r *http.Request, alerts ...*
if err := a.Validate(); err != nil {
validationErrs.Add(err)
api.m.Invalid().Inc()
if api.alertLCObserver != nil {
api.alertLCObserver.Rejected("Invalid", a)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the invalid to the actual error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

api/v1/api.go Outdated
@@ -456,8 +463,14 @@ func (api *API) insertAlerts(w http.ResponseWriter, r *http.Request, alerts ...*
typ: errorInternal,
err: err,
}, nil)
if api.alertLCObserver != nil {
api.alertLCObserver.Rejected("Failed to create", validAlerts...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is rejecting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is when alerts.Put failed. Since we don't end up recording the alert I considered it as rejected.

@@ -153,6 +154,20 @@ func TestAddAlerts(t *testing.T) {
body, _ := io.ReadAll(res.Body)

require.Equal(t, tc.code, w.Code, fmt.Sprintf("test case: %d, StartsAt %v, EndsAt %v, Response: %s", i, tc.start, tc.end, string(body)))

observer := alertobserver.NewFakeAlertLifeCycleObserver()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe create a separate test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}
require.Equal(t, 1, len(recorder.Alerts()))
require.Equal(t, inputAlerts[0].Fingerprint(), observer.AggrGroupAlerts[0].Fingerprint())
o, ok := notify.AlertLCObserver(dispatcher.ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a fake observer for example increment a counter, then verify if the observer's function get called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we already do that. In line 598 we create a fake observer and in like 616 we verify that the function was called by checking the recorded alert.

d.ctx, d.cancel = context.WithCancel(context.Background())
ctx := context.Background()
if d.alertLCObserver != nil {
ctx = notify.WithAlertLCObserver(ctx, d.alertLCObserver)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put the observer into the stages rather than in ctx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean pass it as one of the arguments in the Exec call instead of adding it in the context?

@grobinson-grafana
Copy link
Contributor

This is great! I've been thinking about doing something similar, for the exact reasons mentioned:

when a customer complains “I think my alert is delayed”, we currently have no straightforward way to troubleshoot. At minimum, we should be able to quickly identify if the problem is post-notification (we sent to the receiver on time but the receiver has some delay) or pre-notification.

"github.com/prometheus/alertmanager/types"
)

type AlertLifeCycleObserver interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a large interface with a method per event, have you considered having a generic Observe method that accepts metadata?

For example:

Suggested change
type AlertLifeCycleObserver interface {
type LifeCycleObserver interface {
Observe(event string, alerts []*types.Alert, meta Metadata)
}

The metadata could be something as simple as:

type Metadata map[string]interface{}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'm not a fan of large interfaces either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can update the code as suggested. Thanks for checking 🙇

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated 🙇

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure to understand how it would be used outside of prometheus/alertmanager. Can you share some code?
Also though not exactly the same, I wonder if we shouldn't implement tracing inside Alertmanager to provide this visibility about "where's my alert?".

"github.com/prometheus/alertmanager/types"
)

type AlertLifeCycleObserver interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'm not a fan of large interfaces either.

@emanlodovice
Copy link
Contributor Author

I'm not 100% sure to understand how it would be used outside of prometheus/alertmanager. Can you share some code? Also though not exactly the same, I wonder if we shouldn't implement tracing inside Alertmanager to provide this visibility about "where's my alert?".

The use that we are thinking of is just adding logs for these events. It sort of becomes an alert history that we can query when the customer comes in. We would like to have the flexibility in implementing how we collect and format the logs and how we will store them.

@@ -338,6 +345,9 @@ func (d *Dispatcher) processAlert(alert *types.Alert, route *Route) {
// function, to make sure that when the run() will be executed the 1st
// alert is already there.
ag.insert(alert)
if d.alertLCObserver != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need an event at d.metrics.aggrGroupLimitReached.Inc()?

notify/notify.go Outdated
m := alertobserver.AlertEventMeta{
"ctx": ctx,
"msg": "Unrecoverable error",
"integration": r.integration.Name(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we care about each retry? should we just record final fail or final success at func (r RetryStage) Exec()

Copy link
Contributor Author

@emanlodovice emanlodovice Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should care about retry here, currently we only record the final/success fail hence the if !retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean why not put this into line 758 Exec function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code the log the sent alerts instead because it is the correct list of alerts that was sent. I think because we don't return the sent alerts we have to keep the code where it currently is.

@qinxx108
Copy link
Contributor

qinxx108 commented Oct 9, 2023

Just some nits but overall looks good!

@emanlodovice emanlodovice force-pushed the alert-observer branch 4 times, most recently from 7eb6d7b to fef64c8 Compare October 10, 2023 20:42
@emanlodovice emanlodovice force-pushed the alert-observer branch 12 times, most recently from 9a6a3ea to c700916 Compare October 11, 2023 20:31
@emanlodovice
Copy link
Contributor Author

@grobinson-grafana @simonpasquier could you have a look at this PR when you have time? Thank you

@emanlodovice
Copy link
Contributor Author

Rebased PR and fixed conflicts

@emanlodovice
Copy link
Contributor Author

@simonpasquier this draft PR in cortex gives the general idea of our use case for this feature https://github.com/cortexproject/cortex/pull/5602/commits

@emanlodovice emanlodovice force-pushed the alert-observer branch 2 times, most recently from 4de8e25 to 34e94ef Compare November 16, 2023 07:20
@emanlodovice
Copy link
Contributor Author

@gotjosh good day. Can you take a look at this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants