Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding safeguard alert #7233

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

adding safeguard alert #7233

wants to merge 2 commits into from

Conversation

viennaa
Copy link
Contributor

@viennaa viennaa commented Oct 14, 2024

  • there was no critical alerting when vROps is gone entirely
  • playbook still needs to be added here before merging

The last part is crucial, please add the playbook before merging!

* there was no critical alerting when vROps is gone entirely
* playbook still needs to be added here before merging
@sapcc-bot
Copy link
Contributor

Failed to validate the Prometheus rules. Details. Readme.

Copy link
Contributor

@richardtief richardtief left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for adding the absent condition to the alert above. There is a VropsAPIDown alert already. The condition is not appropriate anymore, too.

(
  sum by (target) (vrops_api_response) 
  / 
  count by (target) (vrops_api_response) > 500
)
or absent(vrops_api_response)

Co-authored-by: Richard Tief <[email protected]>
@viennaa
Copy link
Contributor Author

viennaa commented Oct 14, 2024

Initially I thought the same, however I would not like to add it to the existing alert anymore, since it uses $labels.target in meta and summary, which won't be filled in case of the absent condition hitting.

@viennaa
Copy link
Contributor Author

viennaa commented Oct 14, 2024

Another thought: Sunny wanted this as a warning. So we can give it the same name with warning, different messages and would be good to go.

@himanip94
Copy link
Contributor

I am not sure why are we keeping the alert as a warning if all collectors are down. Alternatively, we could change the previous alert to warning and set the new one to critical, using the same alert name?

@viennaa
Copy link
Contributor Author

viennaa commented Oct 14, 2024

To me both of them are worth critical. The recent case was without noticing since October, 9th.
I would generally tend to increase the for condition to 2h here, so it is a rather slow firing alert.

Sunny's argument was, that this can easily fire during maintenance activities. If these activities would take more than 2h, it would be wise, to silence the alert beforehand. I can give a quick guidance how to do this preventatively.

@himanip94
Copy link
Contributor

I agree to silencing alerts before maintenance activities, that way can keep both as critical.

@sunnyrarora
Copy link
Contributor

Silencing is one way, but here i believe we are increasing every next alert to critical in that sense, it will delete purpose of monitoring. InfraOps onduty primary task is to handle MIMs, i dont think it would be wise to involve him/her on troubelshooting monitoring tasks. here secondary can assist better. thats my two cents, you are free to keep it primary, but again one needs to sure if they are ready to handle it.

@viennaa
Copy link
Contributor Author

viennaa commented Oct 15, 2024

Just to keep in mind: there will be no critical alerts fired from affected vc if a node is down or anything else happens. Technically nothing in production is affected at this time, but if something would be affected from then on, no alerting takes place.

Maybe you can take this into your meeting and come back with a decision here.

@viennaa
Copy link
Contributor Author

viennaa commented Oct 18, 2024

any decision?

@himanip94
Copy link
Contributor

Will discuss this in Tuesday monitoring call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants