Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "--disallow-inconsistent-metadata" flag in cli-migrations docker image #10599

Open
chardo opened this issue Nov 14, 2024 · 4 comments
Open
Labels
k/enhancement New feature or improve an existing feature

Comments

@chardo
Copy link

chardo commented Nov 14, 2024

Is your proposal related to a problem?

My team runs the community edition of the Hasura graphql-engine on Kubernetes in a horizontally replicated configuration. We use rolling updates to deploy, use the cli-migrations (v2) docker image as our base image in production (as described here), and use the /healthz?strict=true endpoint as our container's readiness/liveness check. This allows us to deploy Hasura in prod without exposing the console or metadata API, while still allowing for git-driven continuous delivery of metadata changes by version-controlling our metadata files and packaging them into new docker images that get deployed to our Kube cluster on each commit to main. This works great most of the time, allowing us to roll out changes to our GraphQL API just by merging metadata changes to main.

However, this configuration has one major drawback. The cli-migrations entrypoint script (ref) runs hasura-cli metadata apply without the --disallow-inconsistent-metadata flag. This means that if we accidentally merge inconsistent metadata to main (eg, by applying metadata changes before a production DB migration has completed), the new pods for our updated Kubernetes deployment will apply that metadata and persist it to the database regardless of any issues. This metadata update in turn gets picked up by the existing/old pods, which listen for changes to the metadata DB, and "leaks" into them even though they (in theory) were built from an older, functioning revision of our code.

In effect, this defeats the purpose of using a deployment tool like Kubernetes, because what should have been a failed deploy ends up taking down the existing pods, rather than simply failing to spin up new ones. This breaks from the standard mental model of rolling updates, in which a pre-existing stable deployment should stay available until new pods are ready to receive traffic. This scenario befell us earlier this week, leading to a full API outage instead of what we'd expect for any other API that can't become ready - a failed rollout of the new without affecting the old.

Related to #8095

Describe the solution you'd like

I'd love to have some way to configure the cli-migrations image to disallow the application of inconsistent metadata, the same way I know is possible using the CLI's hasura metadata apply --disallow-inconsistent-metadata command. I'm open to whether that should be possible via a global Hasura configuration (ie, configure this as a setting on the server itself) or just a flag/env-var that can be passed to the cli-migrations image similar to the HASURA_GRAPHQL_METADATA_DIR variable. I expect the latter would be simpler to implement.

Doing this would allow us to have a fully safe CI/CD setup in Kubernetes by preventing pods which contain inconsistent metadata from poisoning the shared metadata DB and taking down the entire API.

Just noting - I would be very happy to work on this feature myself, in particular if we'd prefer to scope it to a change to the cli-migrations image only. This would be really helpful for my team, and I'd love a chance to help contribute back to Hasura given how much use we've gotten out of the graphql engine over the last year or two!

Describe alternatives you've considered

I see a few different ways of working around this, which I'll explain my POV on inline:

  • Stop using the cli-migrations image, and use the CLI to apply metadata in production directly. I'd really prefer not to do this for my team, because it would mean exposing the metadata API in prod, which would then make it accessible to anyone who has the admin key. This would be a pretty major step up in the risk/impact of leaking that key, which I'd rather not take on.
  • Don't enforce "strict=true" on the /healthz endpoint for our readiness and liveness checks, allow inconsistent metadata to applied to prod, and monitor for it some other way. I'd also prefer to avoid this - given it is possible to determine prior to deployment whether a batch of metadata files will be inconsistent (and thus lead to some amount of API degradation), I see no good reason we'd want our prod service to declare itself ready for traffic in those scenarios.
  • Improve our own testing/quality gates prior to merging to reduce the odds of inconsistent metadata making it into main in the first place. Of course I'd love to do this as well, but no testing is going to cover 100% of possible mistakes, and the change I'm recommending seems (to me) like a surefire way to prevent broken deploys anyways.
@chardo chardo added the k/enhancement New feature or improve an existing feature label Nov 14, 2024
chardo added a commit to chardo/graphql-engine that referenced this issue Nov 14, 2024
…ow use of the --disallow-inconsistent-metadata flag when applying metadata (fixes hasura#10599)
chardo added a commit to chardo/graphql-engine that referenced this issue Nov 15, 2024
…ow use of the --disallow-inconsistent-metadata flag when applying metadata (fixes hasura#10599)
@chardo
Copy link
Author

chardo commented Nov 18, 2024

I ended up putting together an approach for this in PR #10602 - not sure if that's the approach the maintenance team wants to take, or if this is a feature you all actually want to pick up or not. If not, no worries, just wanted to take a stab at it cause I was curious if I could get it working. Let me know either way!

@jflambert
Copy link
Contributor

Please link to #8095

@chardo
Copy link
Author

chardo commented Nov 29, 2024

thanks @jflambert, and good call! when I opened this, that issue hadn't had any updates for 2 years, and it seemed like the flags/env var names had changed in that time, so I wasn't positive if this was related or not. glad to have confirmation though. I'll reference your issue from my pull request as well!

@jflambert
Copy link
Contributor

jflambert commented Nov 29, 2024

Let it be known that I've actually stopped using the hasura-cli-migrations image specifically due to this issue almost 2 years ago! We wrote our own god damn hasura-operator in kubernetes LOL (basically your first alternative)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k/enhancement New feature or improve an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants