New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[RFC] Flyte Admin RBAC + Project/Domain Isolation #5871

Open

Sovietaced wants to merge 2 commits into flyteorg:master from Sovietaced:rbac-rfc

Contributor

Sovietaced commented Oct 20, 2024 •

edited

Loading

Tracking issue

Related to #5189
Related to #4622


          Flyte Admin RBAC RFC draft

f0dd120

Signed-off-by: Jason Parraga <[email protected]>

codecov bot commented Oct 20, 2024 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.71%. Comparing base (bdaf79f) to head (ef505e2).
Report is 16 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5871   +/-   ##
=======================================
  Coverage   36.71%   36.71%           
=======================================
  Files        1304     1304           
  Lines      130081   130081           
=======================================
  Hits        47764    47764           
  Misses      78147    78147           
  Partials     4170     4170

Flag	Coverage Δ
unittests-datacatalog	`51.58% <ø> (ø)`
unittests-flyteadmin	`54.41% <ø> (ø)`
unittests-flytecopilot	`11.73% <ø> (ø)`
unittests-flytectl	`62.40% <ø> (ø)`
unittests-flyteidl	`6.89% <ø> (ø)`
unittests-flyteplugins	`53.62% <ø> (ø)`
unittests-flytepropeller	`42.84% <ø> (ø)`
unittests-flytestdlib	`54.78% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


          Updates

ef505e2

Signed-off-by: Jason Parraga <[email protected]>

Sovietaced force-pushed the rbac-rfc branch from 89f3cac to ef505e2 Compare

October 21, 2024 05:48

Sovietaced marked this pull request as ready for review

October 21, 2024 05:48

Contributor

eapolinario commented Oct 21, 2024

cc: @robert-ulbrich-mercedes-benz

katrogan reviewed

View reviewed changes

rfc/system/0000-flyteadmin-rbac.md

+                 * This is a dedicated utility function used in repository code to create resource since you cannot add a WHERE clause filter for records that don't exist yet :)
+                 * ```go
+                   func (r *ExecutionRepo) Create(ctx context.Context, input models.Execution, executionTagModel []*models.ExecutionTag) error {
+                       if err := util.AuthorizeResourceCreation(ctx, input.Project, input.Domain); err != nil {

Contributor

katrogan Oct 22, 2024

This is interesting, I like the middleware approach because it lets you intercept new service methods for free. In this proposal, how do you ensure that new DB methods always call the authorization util?

Contributor Author

Sovietaced Oct 23, 2024

It is cooperative so you cannot ensure it. The implementation I have adds it for every resource possible IIRC so it would probably have to be enforced through code review

rfc/system/0000-flyteadmin-rbac.md

+                  - name: read-only
+                    rules:
+                      - name: "read everything"
+                        methodPattern: "Get.*|List.*"

Contributor

katrogan Oct 22, 2024

Curious about your thoughts on brittleness for regex parsing here. We mostly try to follow restful gRPC method names but we have a few notable exceptions like TerminateExecution, RecoverExecution, RelaunchExecution. If upsteam OSS Flyte merges a change that adds a new syntax, is it the responsibility of whoever deploys Flyte at their company and manages these policies to update the regexes appropriately?

Contributor Author

Sovietaced Oct 23, 2024

Using something besides .* on the method pattern is opt in so yeah I think it would be up to the company to update their regexes and I'd imagine that is the behavior they would want. Ideally they validate the ux in non-prod and roll out to prod.

I'm also open to supporting different matching types like prefix and exact matches.

rfc/system/0000-flyteadmin-rbac.md

+                    rules:
+                      - name: "r/w for every project in production"
+                        methodPattern: ".*"
+                        domain: production # you can wildcard project and declare domain level access

Contributor

katrogan Oct 22, 2024

does the same work for wild carding domain if you want, for example, mapping team write access?

Contributor Author

Sovietaced Oct 23, 2024

Yes absolutely

rfc/system/0000-flyteadmin-rbac.md

+              #### Authorization Utils + DB Layer
+              The final piece of the puzzle is what performs resource level authorization and filtering. Historically, I have found
+              that the best (albeit challenging) way to do this is at the database layer for a few reasons:

Contributor

katrogan Oct 22, 2024

This proposal makes sense, but there's some notable instances where we could have unauthorized side effects

specifically, for create execution, a user who has read access may be able to read the respective launch plan, we then create the CRD and only then insert the model in the DB (which gets rejected) here:

flyte/flyteadmin/pkg/manager/impl/execution_manager.go

Lines 1027 to 1062 in 5f69589

    
           workflowExecutor := plugins.Get[workflowengineInterfaces.WorkflowExecutor](m.pluginRegistry, plugins.PluginIDWorkflowExecutor) 
        
           execInfo, execErr := workflowExecutor.Execute(ctx, workflowengineInterfaces.ExecutionData{ 
        
           	Namespace:                namespace, 
        
           	ExecutionID:              workflowExecutionID, 
        
           	ReferenceWorkflowName:    workflow.Id.Name, 
        
           	ReferenceLaunchPlanName:  launchPlan.Id.Name, 
        
           	WorkflowClosure:          workflow.Closure.CompiledWorkflow, 
        
           	WorkflowClosureReference: storage.DataReference(workflowModel.RemoteClosureIdentifier), 
        
           	ExecutionParameters:      executionParameters, 
        
           	OffloadedInputsReference: inputsURI, 
        
           }) 
        
           if execErr != nil { 
        
           	createExecModelInput.Error = execErr 
        
           	m.systemMetrics.PropellerFailures.Inc() 
        
           	logger.Infof(ctx, "failed to execute workflow %+v with execution id %+v and inputs %+v with err %v", 
        
           		request, workflowExecutionID, executionInputs, execErr) 
        
           } else { 
        
           	m.systemMetrics.AcceptanceDelay.Observe(acceptanceDelay.Seconds()) 
        
           	createExecModelInput.Cluster = execInfo.Cluster 
        
           } 
        
           executionModel, err := transformers.CreateExecutionModel(createExecModelInput) 
        
           if err != nil { 
        
           	logger.Infof(ctx, "Failed to create execution model in transformer for id: [%+v] with err: %v", 
        
           		workflowExecutionID, err) 
        
           	return nil, nil, nil, err 
        
           } 
        
           executionTagModel, err := transformers.CreateExecutionTagModel(createExecModelInput) 
        
           if err != nil { 
        
           	logger.Infof(ctx, "Failed to create execution tag model in transformer for id: [%+v] with err: %v", 
        
           		workflowExecutionID, err) 
        
           	return nil, nil, nil, err 
        
           } 
        
           return ctx, executionModel, executionTagModel, nil

so in this scenario, we create a rogue Flyte CRD that never gets committed to the db (operation fails with permission denied) that propeller will attempt to process and send event updates for, leading to potential system weirdness

Contributor Author

Sovietaced Oct 23, 2024

That is a good point, I wasn't aware of that interaction.

I think there are a couple ways to work around this.

Bring authorization for creation into the application layer (in addition to the db layer I suppose).
Rework the code such that the execution is written to the database (maybe partially) prior to creating the CRD. And then writing the rest to the DB after the CRD is created if necessary.

I am being a bit naive with the complexity to approach number 2 but I do think this is a general problem with distributed systems. For example, the way the code is written as is makes the system susceptible to ghost CRDs since the system could crash before writing to the DB. I also understand this issue is probably much more likely to happen with authorization errors though :)

The architectural issue I see with this type of behavior boils down to how business logic does what is effectively a distributed transaction across other services (submitting CRDs in this case) and writing to the database. The most robust way to attempt this sort of behavior that I've seen is to always write to the DB first (this is the source of truth) and then opportunistically do the rest of the distributed transaction. For cases where the server crashes after writing to the DB you'll basically need some anti entropy background task to handle edge cases.

Contributor

katrogan Oct 23, 2024

For sure, it's definitely a problem now even without authz checks failing. I recall we waffled on this approach initially because we didn't want to commit the entry to the db until after the CRD was created (since the workflowengine packages gracefully handles already exists on retries, but falsely implying an execution was created in the reversed case was more confusing)

if we do bring authorization to the application layer (even for these exceptional cases) I'm a bit concerned we're now depending on code review to enforce authorization checks in both the db and application levels

rfc/system/0000-flyteadmin-rbac.md

+              #### Authorization Config
+              The authorization config will be used to configure the authorization behavior. The first thing it describe is a way to
+              resolve roles from the user's identity context. Ideally this should be flexible enough to resolve a role from different

Contributor

katrogan Oct 22, 2024

for a multi-cluster set-up, where the control plane (admin) and data plane (propeller) reside on separate clusters, propeller authenticates with a set of client credentials against flyteadmin to send execution events and create executions (child workflows, e.g. calling a launch plan in code)

if I'm understanding correctly, in this case we could use the userID of the propeller credentials to permit deployment-wide RW access? this isn't quite userID, but I think your suggestion can very well support applicationID using this approach. My understanding is that it's a bit more tricky to add scopes & claims to applications rather than user identities

Contributor Author

Sovietaced Oct 23, 2024

We use a multi cluster setup and admittedly our implementation of this proposal actually has a separate config field for allow listing certain user IDs, which in our case are Okta client Ids for the propeller Okta app. This allow list gives them full access to all APIs and all projects/domains. This was more of an incremental thing I added when I realized very quickly that authorization broke propeller :)

When I was writing the proposal I realized that I could probably do away with user ID allow lists in our implementation and support role resolution more generally by allowing users to pick out which values from the token they want to potentially match with policies. So yeah, I think for Flyte workloads you'd just turn on the user ID or application ID resolution strategy and create a role like so, not having to deal with scopes at all.

 - name: 0oahjhk34aUxGnWcZ0h7 # the names can even include things like okta app IDs
      rules:
        - name: "flyte propeller"
          methodPattern: ".*"

I guess to be clear, the role resolution strategies configured will ultimately generate a set of strings and then try and match those strings against the policy map keys.

But I'm open to doing something special for the Flyte workloads

Contributor

katrogan Oct 23, 2024

this proposal sounds like it unblocks propeller access! I'm curious how this works for supporting dynamically provisioned app credentials (besides just propeller, but for things like CICD, etc) since I believe Okta doesn't allow you to attach claims to applications unless you use custom claims in the authorization server which is a bit more involved

davidmirror-ops added the rfc label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfc