Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Usually once" promotions #214

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

"Usually once" promotions #214

wants to merge 6 commits into from

Conversation

squaremo
Copy link
Contributor

@squaremo squaremo commented Nov 23, 2023

  • Attempt each ready promotion just once
    - [ ] Retry failed promotions (needs a minor bit of design -- e.g., retry how many times?)
  • Record particulars of promotion, e.g., PR URL, in status

EDIT: I'm removing the retry requirement from this PR, so I don't tangle the PR in more design. Recording the failure will be enough for here.

@squaremo squaremo changed the title 197 reliable promotions "Usually once" promotions Nov 23, 2023
 - promotion status goes in the environment status, and is optional since it might not be happening
   - revision gives the logical point in time that the record applies to, so you can judge whether it's relevant or not
   - lastAttemptedTime is informational
   - succeeded is true if it's done, false if it needs to be retried
 - PR details (optional, fill out if it's a PR)
This is the least test that pipeline fills in promotion status (and it
doesn't pass, yet).
This way it can be used to check an invariant.
This commit records what happens when a promotion is attempted, in the
pipeline status. This is enough to get the test to pass.

Two slightly wider changes are included here:

 - be more careful about checking whether a particular environment
   status exists, when looping through them;

 - don't start with an empty map for environment status. This used to
   be reasonable, since we were building the whole picture up again by
   looking at the targets; now, it would lose the promotion
   status, which can't be observed each time.

Signed-off-by: Michael Bridgen <[email protected]>
If you've set up promotions via PR, or CI, or (hypothetical) chatbot,
there will be a human-scale lag between a promotion being triggered
and it taking effect in the target environment. If the controller
processes the pipeline during that time, it shouldn't trigger the
promotion again.

This commit expands the mock promotion machinery into a little state
machine, so the gap between promotion trigger and completion can be
explicit in test cases. It asserts that no promotion of {env, version}
will be attempted twice, which (_provided_ we don't roll any
environment back a version) is what we want to prove here.

There's a race between the cache getting an object with the promotion
status, and the reconciler processing the object again. It's possible
for the reconciler, having just triggered a promotion, to run again
and see a version of the pipeline object that doesn't have the
promotion recorded -- and that will mean attempting the promotion
again.

In testing, this happens almost always, because patches will cause the
object to be requeued before it's finished reconciling, and it'll run
again as soon as its exited. I would expect it to also be common in
normal operation; and the outcome is that a promotion is re-run, which
is not great!

To mitigate the race, I've added a predicate to the watch which will
drop updates that didn't change the spec. This means that the patches
earlier in Reconcile won't cause the object to be requeued, and it is
much, much less likely that it'll run again straight away. It's still
possible -- maybe the spec changed while it was running.

Other, more involved ways to avoid it:

 - use conflict detection on write to bail before running a promotion,
   if the observed pipeline has changed

 - create Promotion objects and guard promotions by trying to create
   (another form of detecting conflict on write)
If a cache has not yet synced, or a target is not to be found, and the
controller has incomplete information, then the safest thing to do is
not make any decisions on promotions, and instead requeue the
pipeline for another try later.

This fixes a problem that was exposed by adding a predicate to the
watch on Pipeline: if a cluster cache isn't ready when the controller
looks, and the target doesn't exist, there's nothing to trigger
another go-round. This presented itself as a problem with the remote
and Kustomization tests, which explicitly check for a "not found"
target status.

(I fixed another problem with those tests, in passing -- if they are
going to check for an updated target status, they had better fetch the
object again.)

Signed-off-by: Michael Bridgen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant