Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Prow cluster management #824

Open
NymanRobin opened this issue Jul 25, 2024 · 3 comments
Open

Improve Prow cluster management #824

NymanRobin opened this issue Jul 25, 2024 · 3 comments
Labels
triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@NymanRobin
Copy link
Member

Current Situation

Currently there is no clear instructions to when or how to update the Prow cluster (besides a small not in the prow README Apply the changes and then create a PR with the changes.). However this can lead to scenarios when the actual configuration in the repository and the live cluster diverges. In scenarios such as two persons working with the cluster at the same time and overwriting each others work. Also recently seen scenario when image bumps there was no clear process, leaving one PR hanging and the main diverged from live cluster

  1. PR was merged without applying: Update k8s-prow images as needed #777
  2. PR was on hold waiting for someone to apply: Update k8s-prow images as needed #802

Potential Solution

What would be beneficial is a process so all updates are handled in one way and also some automation to support this.
Some ideas for the automation could be automatically applying changes this of course have the risk of a bad change breaking the automation itself. Another approach would be to simply checking the diff of the live cluster vs a PR and only allow for merge when the PR changes can be found in the cluster or have a periodic job that alerts in case there is a diff between main and the live cluster

@metal3-io-bot metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label Jul 25, 2024
@NymanRobin
Copy link
Member Author

There seems to already be some kind of check-prow-config job
https://prow.apps.test.metal3.io/view/s3/prow-logs/pr-logs/pull/metal3-io_project-infra/821/check-prow-config/1815332764979826688

Maybe this can be used to block PR's until the config is correct, but needs to be double checked if this works as expected 🤔

@tuminoid
Copy link
Member

tuminoid commented Aug 5, 2024

Check prow config just validates the config is syntactically correct, and won't explode Prow when deployed. It does nothing (or little at max) to address the config otherwise.

I do agree wholeheartdly that PR merging -> config deployment should be automated, and not independent operations. We may not need a test cluster to deploy as if properly automated, we can just revert the config and manually merge that to restore cluster, but up to discussion if we need canary cluster.

@Rozzii
Copy link
Member

Rozzii commented Aug 7, 2024

/triage accepted

@metal3-io-bot metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/accepted Indicates an issue is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants