Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: timeout protection #224

Merged
merged 11 commits into from
Jan 10, 2023
Merged

feat: timeout protection #224

merged 11 commits into from
Jan 10, 2023

Conversation

flavio
Copy link
Member

@flavio flavio commented Dec 19, 2022

This is quite a big change. It's required to allow consumption of the wasmtime epoch interruption capability. Which we leverage to allow policies to be stopped after they have exceeded a given run time.

This is required to fix kubewarden/policy-server#254

While working on this code the following changes have been done:

  • feat!: burrego - epoch deadline feature + builder: thinking about Policy Server, each worker has just one engine, which is then used to run both waPC and burrego policies. Once the engine has the epoch deadline feature turned on, some operations have to be performed to ensure the policy can be executed (add some epoch ticks to the wasmtime::Store). Because of that, Burrego had to be "made aware" of epoch interruption. I took advantage of this chance to do some cleanup and introduce a builder helper for the burrego::Evaluator
  • feat!: expose epoch deadline feature: at a higher level, PolicyEvaluator now exposes epoch deadline configuration. The builder then handles the enablement of this feature both inside of waPC and burrego
  • makefile: add check target: a small change done to make the development experience smoother
  • Add envrc file: similar to the previous point
  • wapc policy: reset evaluator after epoch interruption: while testing the feature, I realized that once a waPC guest is interrupted by the epoch deadline feature, the wasm module might be left in a broken state. See the long explanation about that inside of the code. With this commit, the waPC-based PolicyEvaluator is reset after an epoch interruption happens
  • refactor![burrego]: get rid of anyhow: this is a huge refactor inside of burrego. Now the public - and internal - APIs of burrego have dedicated typed errors. This is required for later
  • feat! [burrego]:introduce ExecutionDeadlineExceeded errors: catch the epoch interrupt errors raised by Wasmtime and convert them into a dedicated burrego error
  • [burrego] tag v0.3.0 release: no more changes are required to burrego, doing a minor version bump
  • feat: burrego evaluator - perform reset after epoch interrupt: now that burrego has typed errors, ensure that the burego evaluator is reset after a wasmtime epoch interruption error is raised

@flavio flavio requested a review from a team as a code owner December 19, 2022 13:14
@flavio flavio added the kind/enhancement New feature or request label Dec 19, 2022
@flavio flavio self-assigned this Dec 19, 2022
@flavio
Copy link
Member Author

flavio commented Dec 19, 2022

@ereslibre can you take a look at this PR if you have some spare time?

@jvanz
Copy link
Member

jvanz commented Dec 22, 2022

@flavio is this a breaking change? I saw in the first comment bunch of feat!. If so, we should tag this as kind/breaking-change. Just to ensure that when we added the release-drafter CI the change will be properly considered.

@jvanz jvanz self-requested a review December 22, 2022 11:53
crates/burrego/src/evaluator.rs Show resolved Hide resolved
src/runtimes/wapc.rs Show resolved Hide resolved
Copy link
Member

@viccuad viccuad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, amazing PR!

I would like a WAPC_EPOCH_INTERRUPTION_ERR_MSG unit test. Apart from that, I have some questions just for my benefit.

@flavio
Copy link
Member Author

flavio commented Jan 9, 2023

@flavio is this a breaking change? I saw in the first comment bunch of feat!. If so, we should tag this as kind/breaking-change. Just to ensure that when we added the release-drafter CI the change will be properly considered.

The public API of burrego has been changed in a breaking way.

The same does not apply to policy-evaluator

Allow burrego to handle `wasmtime::Engines` that have the
epoch deadline feature enabled.

Also, introduce the new `EvaluatorBuilder` struct that is now used to
create `Evaluator` instances.

Signed-off-by: Flavio Castelli <[email protected]>
Epoch deadline is a feature of wasmtime that allows the execution of a
wasm module after some time has elapsed.

This is needed to implement kubewarden/policy-server#254

Signed-off-by: Flavio Castelli <[email protected]>
Adding the check target, this simplifies the development process now
that we have to set K8S_OPENAPI_ENABLED_VERSION via an environment
variable.

Signed-off-by: Flavio Castelli <[email protected]>
TL;DR: after code execution is interrupted because of an
epoch deadline being reached, we have to reset the waPC host
to ensure further invocations of the policy work as expected.

The waPC host is using the wasmtime_provider, which internally
uses a wasmtime::Engine and a wasmtime::Store.
The Store keeps track of the stateful data of the policy. When an
epoch deadline is reached, wasmtime::Engine stops the execution of
the wasm guest. There's NO CLEANUP code called inside of the guest.
It's like unplugging the power cord from a turned on computer.

When the guest function is invoked again, the previous state stored
inside of wasmtime::Store is used.
That can lead to unexpected issues. For example, if the guest makes
uses of a Mutex, something like that can happen (I've witnessed that):

* Guest code 1st run:
  - Mutex.lock
* Host: interrupt code execution because of epoch deadline
* Guest code 2nd run:
  - The Mutex is still locked, because that's what is stored inside
    of the wasmtime::Store
  - Guest attempts to `lock` the Mutex -> error is raised

The guest code will stay in this broken state forever. The only
solution to that is to reinitialize the wasmtime::Store.
It's hard to provide a facility for that inside of WapcHost, because
epoch deadline is a feature provided only by the wasmtime backend.
Hence it's easier to just recreate the wapc_host associated with this
policy evaluator

On top of this new code, a small refactor of the `runtimes/mod.rs`
was done to splits wapc and burrego code into dedicated `.rs` files.

Signed-off-by: Flavio Castelli <[email protected]>
Now all the errors raised by burrego are typed.

Signed-off-by: Flavio Castelli <[email protected]>
Ensure a BurregoError::ExecutionDeadlineExceeded error is returned
whenever an epoch deadline interruption happens inside of Wasmtime.

While working on that, DRY the code by using some Rust macros.

Signed-off-by: Flavio Castelli <[email protected]>
Remove a warning raised by clippy.

Signed-off-by: Flavio Castelli <[email protected]>
Bump the minor version of burrego

Signed-off-by: Flavio Castelli <[email protected]>
Ensure also the burrego evaluator is reset after the policy execution is
interrupted because of a wasmtime epoch interruption.

This should never happen, because Rego policies are not Turing complete,
hence there's no risk of infinite loops. Moreover, Rego policies do not
interact with external (slow) APIs like registries, dns,... at least for
now.

Signed-off-by: Flavio Castelli <[email protected]>
This file sets the K8S_OPENAPI_ENABLED_VERSION environment variable
required by k8s-openapi.

This, paired with the envrc utility, allows to invoke commands like
`cargo check` without having to use the makefile target

Signed-off-by: Flavio Castelli <[email protected]>
Introduce a new unit tests that ensures the error raised by waPC
whenever an epoch interruption happens is the one we expect.

Signed-off-by: Flavio Castelli <[email protected]>
@flavio
Copy link
Member Author

flavio commented Jan 10, 2023

@viccuad, @jvanz: I think this is ready for one last round of reviews. I just rebased the code against main and implemented the unit test required by @viccuad

Copy link
Member

@jvanz jvanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The WebAssembly code is very interesting.

Copy link
Member

@viccuad viccuad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@flavio flavio merged commit b3d9797 into kubewarden:main Jan 10, 2023
@flavio flavio deleted the timeout-protection branch January 10, 2023 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add timeout to policy evaluation
3 participants