Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move all scope parsing to 'experimental' #752

Merged
merged 18 commits into from
Aug 3, 2023

Conversation

sirosen
Copy link
Member

@sirosen sirosen commented Jun 15, 2023

This is a large body of work which I've had sitting in a branch for some time.
I haven't been able to fully review it although I rebased it, cleaned it up, and got tests working against it.

We also haven't discussed the pros/cons of this being available in globus_sdk.experimental.
My summary of what I was able to pull together follows, but there may be important details missing. Given that this is meant for experimental, I think we can accept some gaps in terms of direction (e.g. Should it go through the Graph IR at all? I'm not fully sure).


Scope parsing is now exposed via Scope.parse, which is only available in globus_sdk.experimental.scope_parser as a method of parsing scopes.

The scope parsing passes through two IRs -- first a tree, then a graph -- which reject cycles, normalize any optional/non-optional redundancies, and ensure that the tree is built in the same way that it will be during authorization flows in Globus Auth.

This replaces our "shadow" implementation of MutableScope, which had been left in scopes.scope_definition under the name Scope but was never exported or documented. All of the features of Scope under the old location are now integrated into
globus_sdk.experimental.scope_parser.Scope.

Scope parsing is tested independently of other scope tests, moving and expanding these tests into tests/unit/experimental/.


📚 Documentation preview 📚: https://globus-sdk-python--752.org.readthedocs.build/en/752/

Scope parsing is now exposed via `Scope.parse`, which is only
available in `globus_sdk.experimental.scope_parser` as a method of
parsing scopes.

The scope parsing passes through two IRs -- first a tree, then a graph
-- which reject cycles, normalize any optional/non-optional
redundancies, and ensure that the tree is built in the same way that
it will be during authorization flows in Globus Auth.

This replaces our "shadow" implementation of MutableScope, which had
been left in `scopes.scope_definition` under the name `Scope` but was
never exported or documented. All of the features of `Scope` under the
old location are now integrated into
`globus_sdk.experimental.scope_parser.Scope`.

Scope parsing is tested independently of other scope tests, moving and
expanding these tests into `tests/unit/experimental/`.
pylint flagged `graph` (used earlier in a function) as a redefinition
of a name from outer scope. This happens because the block under the
`if __name__ == "__main__"` check is technically in module scope.
Therefore, at analysis-time, there is a module-scoped variable named
`graph` and we "need" to avoid redefinition.

This could be tucked into a `def _main()` to "hide" the name from
module scope, but a rename is equally simple to dodge the
overly-aggressive pylint check.
Copy link
Member

@kurtmckee kurtmckee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to discuss this further before finishing the review.

The scope trees produced by parsing do not necessarily describe real consent structures

This acknowledgement concerns me.

docs/experimental/scope_parser.rst Outdated Show resolved Hide resolved
Comment on lines 22 to 23
representation of consent trees. The scope trees produced by parsing do
not necessarily describe real consent structures in Globus Auth.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope trees produced by parsing do not necessarily describe real consent structures

I don't like this. I think this has been discussed before, but I'm not keen to introduce something that doesn't match the actual Globus Auth implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one thing that I want to have very clearly stated is that it is not possible to represent the consent structures in Auth as a string. Scope string parsing can only match the consent structure produced by a single authentication and consent flow.

The implementation does match what Auth does for a scope string used in a single consent flow. We're just bound by the limitations of the string representation vis-a-vis Auth's more elaborate internal state.

Copy link
Member Author

@sirosen sirosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the typo fixes (applied)!

I do want us to discuss this more, and I'm sure we'll have a productive conversation when we find time for this.
I've left one inline note about not matching the Auth consent trees, and there is one other thing I'd like to share up-front: I'm motivated pretty strongly to take this implementation -- most of which exists but is not exposed in the project today -- and move it out of the "normal" parts of the source tree.

Comment on lines 22 to 23
representation of consent trees. The scope trees produced by parsing do
not necessarily describe real consent structures in Globus Auth.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one thing that I want to have very clearly stated is that it is not possible to represent the consent structures in Auth as a string. Scope string parsing can only match the consent structure produced by a single authentication and consent flow.

The implementation does match what Auth does for a scope string used in a single consent flow. We're just bound by the limitations of the string representation vis-a-vis Auth's more elaborate internal state.

Copy link
Contributor

@ada-globus ada-globus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel comfortable with the correctness given the advisories about usage in the code.

I have a couple things that felt a little odd to me as I was reading, but none that make me want to block this from heading to experimental (but might be worth considering if we want to promote this out of experimental at some point):

  • In thinking about what we're exposing on the ScopeGraph, I'm wondering if we want to lock down the public interface a bit more strictly as it seems possible to fairly easily break the representation by manipulating the public attributes (e.g., naively as a user I might think it safe to call self.edges.add(foo) w/o realizing that there is a private attribute that needs to track changes here)
  • It's surprising that when I access ScopeGraph.nodes I'm not working with ScopeTreeNodes (I get that one is ScopeTree and one is ScopeGraph but we do build a ScopeGraph using ScopeTreeNodes so it all gets kind of 😵‍💫).

While I haven't thought about it super deeply, I'm wondering if it might make sense to add something like a ScopeGraph.add_scope() method that will accept strings or ScopeTreeNodes and allow that to manage the properties of the graph so we don't need to leak some of these details, which I think would let us make nodes and edges private, and make top_level_scopes read-only...? But very possible I haven't thought through the downstream consequences sufficiently.

Regardless, this is non-blocking feedback IMO, so I am simultaneously hitting approve.

src/globus_sdk/experimental/scope_parser/parser.py Outdated Show resolved Hide resolved
Copy link
Member Author

@sirosen sirosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to adjust in response to one bit of inline feedback but I'm also going to move the entire parsing implementation into a privately named module, _parser.

Neither ScopeTreeNode nor ScopeGraph are, in their current implementations, sanitized enough to be user consumable. I tend to think that the absence of docstrings is a deal-breaker on that in its own right. This is currently not as clear as it could be, and the presence of an underscore in the name should serve as a stronger warning sign (for users) and a better disambiguator (for contributors).

src/globus_sdk/experimental/scope_parser/parser.py Outdated Show resolved Hide resolved
In order to better indicate that the parser is an implementation
detail, move several components into a `_parser.py` module. The
`errors` are separately declared and exposed to better represent that
they are part of the public interface of the package.
The 'dependencies' property was never used. Remove it.
In the experimental scope graph component, the cycle detection code
was not correct. The simplicity of the approach used did not properly
account for a number of scenarios like `foo[bar[baz] baz]`, which
would be incorrectly detected as a cycle. These errors were masked
by a secondary flaw, an incorrect call to `set.clear()` which wiped
out important information.

Rather than continuing to pursue this approach, for a first draft of
this component simply use a well-understood recursive cycle detection
algorithm which operates linearly in the size of the graph.

The implementation can and should be unrolled to be iterative in the
future.
Our goal is still correctness, but the use of Tarjan's strongly
connected components algorithm makes the ScopeGraph class
significantly more complex in pursuit of efficiency.
Worse, the prior implementation was recursive, which carries
significant costs in python.

Revert to a simpler iterative cycle detector which is adjusted
to be certain to be correct but somewhat inefficient.
Make cycle detection much more efficient in time (but inefficient in
space) by employing a BFS graph traversal which tracks the ancestry
paths through the graph which have been discovered.

If a back-edge is ever found, that means that there is a cycle.
@sirosen sirosen force-pushed the add-experimental-scope-parsing branch from 4c113aa to b3f0e8d Compare July 29, 2023 17:26
@sirosen
Copy link
Member Author

sirosen commented Jul 29, 2023

I've just pushed through some relatively significant refinements to the cycle detection used to find and reject foo[bar[baz]] baz[bar], which I think merit new and fresh review.

Basically what happened was this:

  1. Found a flaw in cycle detection while introducing some test cases for it
  2. Did some research and implemented a classic algorithm for cycle detection (Tarjan's, details in the commits), which was fun and interesting but required a recursive implementation
  3. Did some tests on absurdly large scope strings and found the recursive call performance to be unacceptable
  4. Looked for any information from reliable sources on iterative unrolling of the original recursive algorithm and found none. In fact, I found numerous incorrect implementations in SO, Rosetta Code, etc. But nothing even remotely usable or cite-able.
  5. Rewrote back to the "simplest possible" cycle detector taking O(n^2) time.
  6. Final rewrite to a space-inefficient but time-efficient BFS-based cycle detector. Include a performance test in the testsuite (with appropriate warnings about slow hardware and flakiness).
  7. The "maybe flaky" perf test fails in GitHub Actions. Adjust threshold up and amend.

All in all it was a fun adventure. I would tend to keep the commit history in this case. The simpler (correct) cycle detector and the full implementation of Tarjan's could be useful to us in the future if we need to do more work on this.

The "time efficient" approach in the end does several list copying operations which are, of course, O(n) in their inputs. Those inputs are typically lengths of paths in the graph, so that's some flavor of log(n) amortized. My fun graph theory research told me that the theoretical bounds are linear in the size of the set of vertices and edges -- O(V+E). I think we probably have something like O(V*log(E)) here in the average case, but it's a bit beyond me to figure out the exact theoretical complexity. The wall time is more important to me.

These tests were sensitive to the exact whitespace generated by
__str__, which is a cosmetic (non-semantic) detail.
It is relatively simple to remove blank lines, which makes the tests
more robust to any refinements we make in the future.
@joshbryan-globus
Copy link
Contributor

joshbryan-globus commented Jul 31, 2023

Reading through the conversation here I want to leave a couple of notes for consideration. None of this should be considered blocking critique, but rather just food for thought as you discuss. By and large I'm fine with the PR.

Regarding "it is not possible to represent the consent structures in Auth as a string", while true from perspective of representing them as a single scope string that can passed to /authorize, I think some of the disagreement may be arguing semantics. Going from a scope string to a tree representation of the consents is always possible (auth itself does this), the slightly surprising bit is just that the parse trees are a subset of the consent trees that are created by that string (e.g. a[b] b[c] creates a consent tree that includes an a->b->c even though the parse trees only have the a->b tree). Going from a consent tree as stored in auth to a single scope string isn't possible, but you can always create a sequence of scope strings that would generate that consent structure. E.g. if the trees a->b and b->c existed in auth, then a[b] followed by b[c] would generate that structure.

In reality, neither the scope mini-dsl used for dynamic dependencies nor our internal data structure in auth for storing consents represent them explicitly as a graph. The "graph" nature comes from the fact that "static dependencies" do impose a graph structure on scopes and the semantics of the scope mini-dsl say "for the purposes of this auth call assume static dependencies exist where edges between scopes exist in the parse tree of this string". I think some of the confusion in the description comes from a desire to equate a consent tree and a scope parse tree. It might be worth being explicit about it. E.g. rather than turn a scope string into a ScopeTree into a ScopeGraph, perhaps the final representation should be a ConsentTree (either via a ScopeGraph or directly).

Finally, a quick note on cycle detection (mostly to dig the rabbit hole stephen went down a bit deeper). Tarjan's algorithm is "efficient" if you want to find all cycles, but, as you discovered complex to make iterative. Since we only want to abort when "some" cycle is detected, BFS and DFS are options (just abort the first time you add a node to a path already in the path). Both BFS and DFS should have the same worst case time complexity, but if you care about space complexity, DFS will win as you only need to keep the "current" path in memory. It can be a little less intuitive to unroll into a non-recursive form, but the key difference between the two algorithms is that BFS will use a queue (paths_to_explorer.append / shift) and DFS will use a stack (.pop).

Experimental scope parsing had a vague warning which did not properly
announce the limitations of the scope parsing interface.
The replacement note instead clarifies only what a parse represents
and notes that there is no external callout for validation.
@sirosen
Copy link
Member Author

sirosen commented Aug 1, 2023

I think our meeting today synchronized us on most elements of this.
I mostly wanted to clarify that because there's no single-string for an arbitrary Consent Forest, scope parsing produces only a specific subset of Consent Forests (notably, those which can be created via a single authentication flow). I think by being overly attentive to this detail, I may have unfortunately confused people a bit and possibly sent some misleading signals (e.g. the documentation warning, now downgraded to a note and clarified).

Finally, a quick note on cycle detection (mostly to dig the rabbit hole stephen went down a bit deeper). Tarjan's algorithm is "efficient" if you want to find all cycles, but, as you discovered complex to make iterative. Since we only want to abort when "some" cycle is detected, BFS and DFS are options (just abort the first time you add a node to a path already in the path). Both BFS and DFS should have the same worst case time complexity, but if you care about space complexity, DFS will win as you only need to keep the "current" path in memory. It can be a little less intuitive to unroll into a non-recursive form, but the key difference between the two algorithms is that BFS will use a queue (paths_to_explorer.append / shift) and DFS will use a stack (.pop).

Yeah, I think I got a full half-hour into iterative Tarjan's before realizing that I wasn't sure it was correct anymore. "Complex" almost understates it!
The funny thing about that algorithm is that it relies heavily on the recursive steps happening in the middle of a series of other operations. I almost built a small state machine to do "manual recursion". 😬

I'm looking at the cycle checker right now to reconsider the potential merit of switching it DFS. I think it's the most efficient approach I'm willing to implement for these purposes, but I still can't quite make the implementation as simple and "obvious" as the current BFS one. If I find a phrasing of it which is clear, I think I'll be satisfied.

There are three changes here which make cycle detection faster.
1. Use DFS rather than BFS, which will find cycles faster in some
   cases and is less memory intensive for large graphs
2. Represent paths using `(frozenset, str)` tuples, since we don't
   care about ordering at all. This reduces the information in any
   given "path" from the prior implementation to a bag of nodes and
   the current terminus. Order is lost, but frozenset construction and
   membership checking is faster than the list equivalents
3. Rename such that `get_child_edges` becomes `adjacency_matrix`,
   converting a function call to an attribute lookup and skipping an
   unnecessary set construction step
@sirosen
Copy link
Member Author

sirosen commented Aug 1, 2023

I couldn't help it. Somehow it bothered me too much that I couldn't reshape that BFS to DFS cleanly, when my past experience has usually been that DFS is faster and simpler. Turns out, it was just a matter of phrasing it the right way.
Since I was in there, I started playing performance optimization games (I almost started putting __slots__ on things!), and got the implementation ~25% faster on the 1000-nested-scopes test. It seemed a shame not to commit the results, and here we are...

@joshbryan-globus
Copy link
Contributor

I couldn't help it. Somehow it bothered me too much that I couldn't reshape that BFS to DFS cleanly, when my past experience has usually been that DFS is faster and simpler. Turns out, it was just a matter of phrasing it the right way. Since I was in there, I started playing performance optimization games (I almost started putting __slots__ on things!), and got the implementation ~25% faster on the 1000-nested-scopes test. It seemed a shame not to commit the results, and here we are...

Ha ... you beat me to it by 20 minutes ... I had just slapped together this general idea example and then came to post it and see you already got there ... not to let a good nerd sniping go to waste, here it is:

g = {
    'a': ['b', 'c', 'd'],
    'b': ['d', 'c'],
    'c': ['e'],
    'd': ['e'],
    'e': ['d'],
}

def has_cycle(g, start_node):
    stack = [({start_node} , g[start_node])]
    
    while stack:
        print(stack)
        path_set, (next_child, *remaining_children) = stack.pop()
        
        if next_child in path_set:
            # we've visited this node, abort!
            return True
        
        if remaining_children:
            # if there are more children to explorer, re-insert the current path on the stack to visit again
            stack.append((path_set, remaining_children))
        
        if g[next_child]:
            # if the next node has children then we need to push it on the stack and explorer
            stack.append((path_set | {next_child}, g[next_child]))
            
    return False

has_cycle(g, 'a')

Anyway, I agree today's discussion was useful, and I'm glad to see we landed on a phrasing we are all happy with.

1. Use a dataclass for parsed tokens for a cleaner, more declarative
   style
2. Move the test script into a `__main__.py` module to allow `-m`
   execution for testing purposes

In (2), this now also includes some timing tests which reflect some of
the work which was done in order to tune and optimize the cycle
detector.
Copy link
Member

@kurtmckee kurtmckee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving, but I had wanted to test timings on my PC as well and couldn't. Would you double-check that the from...import line is still working on your machine?

@sirosen sirosen force-pushed the add-experimental-scope-parsing branch 2 times, most recently from 5cbe131 to 5d7161a Compare August 3, 2023 19:15
- add missing annotation
- fix parameter usage in the non-timing test
- test for wide scopes as well as deep ones for timing
- report more precise timing stats, including `min`

Some of the timing data manipulation was informed by reading the
`timeit` module source and recommendations.
@sirosen sirosen force-pushed the add-experimental-scope-parsing branch from 5d7161a to 928b774 Compare August 3, 2023 19:29
@sirosen sirosen merged commit e70cd23 into globus:main Aug 3, 2023
13 checks passed
@sirosen sirosen deleted the add-experimental-scope-parsing branch August 3, 2023 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants