Move all scope parsing to 'experimental' #752

sirosen · 2023-06-15T22:17:24Z

This is a large body of work which I've had sitting in a branch for some time.
I haven't been able to fully review it although I rebased it, cleaned it up, and got tests working against it.

We also haven't discussed the pros/cons of this being available in globus_sdk.experimental.
My summary of what I was able to pull together follows, but there may be important details missing. Given that this is meant for experimental, I think we can accept some gaps in terms of direction (e.g. Should it go through the Graph IR at all? I'm not fully sure).

Scope parsing is now exposed via Scope.parse, which is only available in globus_sdk.experimental.scope_parser as a method of parsing scopes.

The scope parsing passes through two IRs -- first a tree, then a graph -- which reject cycles, normalize any optional/non-optional redundancies, and ensure that the tree is built in the same way that it will be during authorization flows in Globus Auth.

This replaces our "shadow" implementation of MutableScope, which had been left in scopes.scope_definition under the name Scope but was never exported or documented. All of the features of Scope under the old location are now integrated into
globus_sdk.experimental.scope_parser.Scope.

Scope parsing is tested independently of other scope tests, moving and expanding these tests into tests/unit/experimental/.

📚 Documentation preview 📚: https://globus-sdk-python--752.org.readthedocs.build/en/752/

Scope parsing is now exposed via `Scope.parse`, which is only available in `globus_sdk.experimental.scope_parser` as a method of parsing scopes. The scope parsing passes through two IRs -- first a tree, then a graph -- which reject cycles, normalize any optional/non-optional redundancies, and ensure that the tree is built in the same way that it will be during authorization flows in Globus Auth. This replaces our "shadow" implementation of MutableScope, which had been left in `scopes.scope_definition` under the name `Scope` but was never exported or documented. All of the features of `Scope` under the old location are now integrated into `globus_sdk.experimental.scope_parser.Scope`. Scope parsing is tested independently of other scope tests, moving and expanding these tests into `tests/unit/experimental/`.

pylint flagged `graph` (used earlier in a function) as a redefinition of a name from outer scope. This happens because the block under the `if __name__ == "__main__"` check is technically in module scope. Therefore, at analysis-time, there is a module-scoped variable named `graph` and we "need" to avoid redefinition. This could be tucked into a `def _main()` to "hide" the name from module scope, but a rename is equally simple to dodge the overly-aggressive pylint check.

kurtmckee

I'd like to discuss this further before finishing the review.

The scope trees produced by parsing do not necessarily describe real consent structures

This acknowledgement concerns me.

changelog.d/20230719_164657_sirosen_add_experimental_scope_parsing.rst

docs/experimental/scope_parser.rst

kurtmckee · 2023-07-24T13:34:12Z

docs/experimental/scope_parser.rst

+    representation of consent trees. The scope trees produced by parsing do
+    not necessarily describe real consent structures in Globus Auth.


The scope trees produced by parsing do not necessarily describe real consent structures

I don't like this. I think this has been discussed before, but I'm not keen to introduce something that doesn't match the actual Globus Auth implementation.

The one thing that I want to have very clearly stated is that it is not possible to represent the consent structures in Auth as a string. Scope string parsing can only match the consent structure produced by a single authentication and consent flow.

The implementation does match what Auth does for a scope string used in a single consent flow. We're just bound by the limitations of the string representation vis-a-vis Auth's more elaborate internal state.

Co-authored-by: Kurt McKee <[email protected]>

sirosen

Thanks for the typo fixes (applied)!

I do want us to discuss this more, and I'm sure we'll have a productive conversation when we find time for this.
I've left one inline note about not matching the Auth consent trees, and there is one other thing I'd like to share up-front: I'm motivated pretty strongly to take this implementation -- most of which exists but is not exposed in the project today -- and move it out of the "normal" parts of the source tree.

sirosen · 2023-07-24T14:27:51Z

docs/experimental/scope_parser.rst

+    representation of consent trees. The scope trees produced by parsing do
+    not necessarily describe real consent structures in Globus Auth.


The one thing that I want to have very clearly stated is that it is not possible to represent the consent structures in Auth as a string. Scope string parsing can only match the consent structure produced by a single authentication and consent flow.

The implementation does match what Auth does for a scope string used in a single consent flow. We're just bound by the limitations of the string representation vis-a-vis Auth's more elaborate internal state.

ada-globus

I feel comfortable with the correctness given the advisories about usage in the code.

I have a couple things that felt a little odd to me as I was reading, but none that make me want to block this from heading to experimental (but might be worth considering if we want to promote this out of experimental at some point):

In thinking about what we're exposing on the ScopeGraph, I'm wondering if we want to lock down the public interface a bit more strictly as it seems possible to fairly easily break the representation by manipulating the public attributes (e.g., naively as a user I might think it safe to call self.edges.add(foo) w/o realizing that there is a private attribute that needs to track changes here)
It's surprising that when I access ScopeGraph.nodes I'm not working with ScopeTreeNodes (I get that one is ScopeTree and one is ScopeGraph but we do build a ScopeGraph using ScopeTreeNodes so it all gets kind of 😵‍💫).

While I haven't thought about it super deeply, I'm wondering if it might make sense to add something like a ScopeGraph.add_scope() method that will accept strings or ScopeTreeNodes and allow that to manage the properties of the graph so we don't need to leak some of these details, which I think would let us make nodes and edges private, and make top_level_scopes read-only...? But very possible I haven't thought through the downstream consequences sufficiently.

Regardless, this is non-blocking feedback IMO, so I am simultaneously hitting approve.

src/globus_sdk/experimental/scope_parser/parser.py

sirosen

I'm going to adjust in response to one bit of inline feedback but I'm also going to move the entire parsing implementation into a privately named module, _parser.

Neither ScopeTreeNode nor ScopeGraph are, in their current implementations, sanitized enough to be user consumable. I tend to think that the absence of docstrings is a deal-breaker on that in its own right. This is currently not as clear as it could be, and the presence of an underscore in the name should serve as a stronger warning sign (for users) and a better disambiguator (for contributors).

src/globus_sdk/experimental/scope_parser/parser.py

In order to better indicate that the parser is an implementation detail, move several components into a `_parser.py` module. The `errors` are separately declared and exposed to better represent that they are part of the public interface of the package.

The 'dependencies' property was never used. Remove it.

In the experimental scope graph component, the cycle detection code was not correct. The simplicity of the approach used did not properly account for a number of scenarios like `foo[bar[baz] baz]`, which would be incorrectly detected as a cycle. These errors were masked by a secondary flaw, an incorrect call to `set.clear()` which wiped out important information. Rather than continuing to pursue this approach, for a first draft of this component simply use a well-understood recursive cycle detection algorithm which operates linearly in the size of the graph. The implementation can and should be unrolled to be iterative in the future.

Our goal is still correctness, but the use of Tarjan's strongly connected components algorithm makes the ScopeGraph class significantly more complex in pursuit of efficiency. Worse, the prior implementation was recursive, which carries significant costs in python. Revert to a simpler iterative cycle detector which is adjusted to be certain to be correct but somewhat inefficient.

Make cycle detection much more efficient in time (but inefficient in space) by employing a BFS graph traversal which tracks the ancestry paths through the graph which have been discovered. If a back-edge is ever found, that means that there is a cycle.

sirosen · 2023-07-29T17:45:32Z

I've just pushed through some relatively significant refinements to the cycle detection used to find and reject foo[bar[baz]] baz[bar], which I think merit new and fresh review.

Basically what happened was this:

Found a flaw in cycle detection while introducing some test cases for it
Did some research and implemented a classic algorithm for cycle detection (Tarjan's, details in the commits), which was fun and interesting but required a recursive implementation
Did some tests on absurdly large scope strings and found the recursive call performance to be unacceptable
Looked for any information from reliable sources on iterative unrolling of the original recursive algorithm and found none. In fact, I found numerous incorrect implementations in SO, Rosetta Code, etc. But nothing even remotely usable or cite-able.
Rewrote back to the "simplest possible" cycle detector taking O(n^2) time.
Final rewrite to a space-inefficient but time-efficient BFS-based cycle detector. Include a performance test in the testsuite (with appropriate warnings about slow hardware and flakiness).
The "maybe flaky" perf test fails in GitHub Actions. Adjust threshold up and amend.

All in all it was a fun adventure. I would tend to keep the commit history in this case. The simpler (correct) cycle detector and the full implementation of Tarjan's could be useful to us in the future if we need to do more work on this.

The "time efficient" approach in the end does several list copying operations which are, of course, O(n) in their inputs. Those inputs are typically lengths of paths in the graph, so that's some flavor of log(n) amortized. My fun graph theory research told me that the theoretical bounds are linear in the size of the set of vertices and edges -- O(V+E). I think we probably have something like O(V*log(E)) here in the average case, but it's a bit beyond me to figure out the exact theoretical complexity. The wall time is more important to me.

These tests were sensitive to the exact whitespace generated by __str__, which is a cosmetic (non-semantic) detail. It is relatively simple to remove blank lines, which makes the tests more robust to any refinements we make in the future.

joshbryan-globus · 2023-07-31T23:00:44Z

Reading through the conversation here I want to leave a couple of notes for consideration. None of this should be considered blocking critique, but rather just food for thought as you discuss. By and large I'm fine with the PR.

Regarding "it is not possible to represent the consent structures in Auth as a string", while true from perspective of representing them as a single scope string that can passed to /authorize, I think some of the disagreement may be arguing semantics. Going from a scope string to a tree representation of the consents is always possible (auth itself does this), the slightly surprising bit is just that the parse trees are a subset of the consent trees that are created by that string (e.g. a[b] b[c] creates a consent tree that includes an a->b->c even though the parse trees only have the a->b tree). Going from a consent tree as stored in auth to a single scope string isn't possible, but you can always create a sequence of scope strings that would generate that consent structure. E.g. if the trees a->b and b->c existed in auth, then a[b] followed by b[c] would generate that structure.

In reality, neither the scope mini-dsl used for dynamic dependencies nor our internal data structure in auth for storing consents represent them explicitly as a graph. The "graph" nature comes from the fact that "static dependencies" do impose a graph structure on scopes and the semantics of the scope mini-dsl say "for the purposes of this auth call assume static dependencies exist where edges between scopes exist in the parse tree of this string". I think some of the confusion in the description comes from a desire to equate a consent tree and a scope parse tree. It might be worth being explicit about it. E.g. rather than turn a scope string into a ScopeTree into a ScopeGraph, perhaps the final representation should be a ConsentTree (either via a ScopeGraph or directly).

Finally, a quick note on cycle detection (mostly to dig the rabbit hole stephen went down a bit deeper). Tarjan's algorithm is "efficient" if you want to find all cycles, but, as you discovered complex to make iterative. Since we only want to abort when "some" cycle is detected, BFS and DFS are options (just abort the first time you add a node to a path already in the path). Both BFS and DFS should have the same worst case time complexity, but if you care about space complexity, DFS will win as you only need to keep the "current" path in memory. It can be a little less intuitive to unroll into a non-recursive form, but the key difference between the two algorithms is that BFS will use a queue (paths_to_explorer.append / shift) and DFS will use a stack (.pop).

Experimental scope parsing had a vague warning which did not properly announce the limitations of the scope parsing interface. The replacement note instead clarifies only what a parse represents and notes that there is no external callout for validation.

sirosen · 2023-08-01T21:37:16Z

I think our meeting today synchronized us on most elements of this.
I mostly wanted to clarify that because there's no single-string for an arbitrary Consent Forest, scope parsing produces only a specific subset of Consent Forests (notably, those which can be created via a single authentication flow). I think by being overly attentive to this detail, I may have unfortunately confused people a bit and possibly sent some misleading signals (e.g. the documentation warning, now downgraded to a note and clarified).

Finally, a quick note on cycle detection (mostly to dig the rabbit hole stephen went down a bit deeper). Tarjan's algorithm is "efficient" if you want to find all cycles, but, as you discovered complex to make iterative. Since we only want to abort when "some" cycle is detected, BFS and DFS are options (just abort the first time you add a node to a path already in the path). Both BFS and DFS should have the same worst case time complexity, but if you care about space complexity, DFS will win as you only need to keep the "current" path in memory. It can be a little less intuitive to unroll into a non-recursive form, but the key difference between the two algorithms is that BFS will use a queue (paths_to_explorer.append / shift) and DFS will use a stack (.pop).

Yeah, I think I got a full half-hour into iterative Tarjan's before realizing that I wasn't sure it was correct anymore. "Complex" almost understates it!
The funny thing about that algorithm is that it relies heavily on the recursive steps happening in the middle of a series of other operations. I almost built a small state machine to do "manual recursion". 😬

I'm looking at the cycle checker right now to reconsider the potential merit of switching it DFS. I think it's the most efficient approach I'm willing to implement for these purposes, but I still can't quite make the implementation as simple and "obvious" as the current BFS one. If I find a phrasing of it which is clear, I think I'll be satisfied.

There are three changes here which make cycle detection faster. 1. Use DFS rather than BFS, which will find cycles faster in some cases and is less memory intensive for large graphs 2. Represent paths using `(frozenset, str)` tuples, since we don't care about ordering at all. This reduces the information in any given "path" from the prior implementation to a bag of nodes and the current terminus. Order is lost, but frozenset construction and membership checking is faster than the list equivalents 3. Rename such that `get_child_edges` becomes `adjacency_matrix`, converting a function call to an attribute lookup and skipping an unnecessary set construction step

sirosen · 2023-08-01T22:07:29Z

I couldn't help it. Somehow it bothered me too much that I couldn't reshape that BFS to DFS cleanly, when my past experience has usually been that DFS is faster and simpler. Turns out, it was just a matter of phrasing it the right way.
Since I was in there, I started playing performance optimization games (I almost started putting __slots__ on things!), and got the implementation ~25% faster on the 1000-nested-scopes test. It seemed a shame not to commit the results, and here we are...

joshbryan-globus · 2023-08-01T22:31:19Z

I couldn't help it. Somehow it bothered me too much that I couldn't reshape that BFS to DFS cleanly, when my past experience has usually been that DFS is faster and simpler. Turns out, it was just a matter of phrasing it the right way. Since I was in there, I started playing performance optimization games (I almost started putting __slots__ on things!), and got the implementation ~25% faster on the 1000-nested-scopes test. It seemed a shame not to commit the results, and here we are...

Ha ... you beat me to it by 20 minutes ... I had just slapped together this general idea example and then came to post it and see you already got there ... not to let a good nerd sniping go to waste, here it is:

g = {
    'a': ['b', 'c', 'd'],
    'b': ['d', 'c'],
    'c': ['e'],
    'd': ['e'],
    'e': ['d'],
}

def has_cycle(g, start_node):
    stack = [({start_node} , g[start_node])]
    
    while stack:
        print(stack)
        path_set, (next_child, *remaining_children) = stack.pop()
        
        if next_child in path_set:
            # we've visited this node, abort!
            return True
        
        if remaining_children:
            # if there are more children to explorer, re-insert the current path on the stack to visit again
            stack.append((path_set, remaining_children))
        
        if g[next_child]:
            # if the next node has children then we need to push it on the stack and explorer
            stack.append((path_set | {next_child}, g[next_child]))
            
    return False

has_cycle(g, 'a')

Anyway, I agree today's discussion was useful, and I'm glad to see we landed on a phrasing we are all happy with.

1. Use a dataclass for parsed tokens for a cleaner, more declarative style 2. Move the test script into a `__main__.py` module to allow `-m` execution for testing purposes In (2), this now also includes some timing tests which reflect some of the work which was done in order to tune and optimize the cycle detector.

kurtmckee

I'm approving, but I had wanted to test timings on my PC as well and couldn't. Would you double-check that the from...import line is still working on your machine?

tests/unit/experimental/test_scope_parser.py

src/globus_sdk/experimental/scope_parser/__main__.py

Scope parsing performance

- add missing annotation - fix parameter usage in the non-timing test - test for wide scopes as well as deep ones for timing - report more precise timing stats, including `min` Some of the timing data manipulation was informed by reading the `timeit` module source and recommendations.

sirosen requested review from aaschaer, ada-globus, derek-globus and kurtmckee as code owners June 15, 2023 22:17

sirosen added 2 commits July 19, 2023 16:45

Add changelog for the scope parser addition

f296e4f

kurtmckee requested changes Jul 24, 2023

View reviewed changes

Apply suggestions from code review

0236670

Co-authored-by: Kurt McKee <[email protected]>

sirosen commented Jul 24, 2023

View reviewed changes

ada-globus approved these changes Jul 28, 2023

View reviewed changes

src/globus_sdk/experimental/scope_parser/parser.py Outdated Show resolved Hide resolved

sirosen commented Jul 28, 2023

View reviewed changes

src/globus_sdk/experimental/scope_parser/parser.py Outdated Show resolved Hide resolved

sirosen added 5 commits July 28, 2023 17:05

Remove unused attr from experimental ScopeGraph

fe33944

The 'dependencies' property was never used. Remove it.

sirosen force-pushed the add-experimental-scope-parsing branch from 4c113aa to b3f0e8d Compare July 29, 2023 17:26

ada-globus approved these changes Jul 31, 2023

View reviewed changes

Make ScopeGraph __str__ tests less fragile

d8759e2

These tests were sensitive to the exact whitespace generated by __str__, which is a cosmetic (non-semantic) detail. It is relatively simple to remove blank lines, which makes the tests more robust to any refinements we make in the future.

kurtmckee approved these changes Aug 2, 2023

View reviewed changes

Avoid list appends and joins to construct token strings

ca0dc6e

kurtmckee and others added 3 commits August 3, 2023 11:50

Avoid dataclasses and enums in favor of speed

aa05cc1

Use sets, which appear to be faster

6d6ad89

Merge pull request #7 from globus/scope-parsing-performance

6d407ce

Scope parsing performance

kurtmckee approved these changes Aug 3, 2023

View reviewed changes

sirosen force-pushed the add-experimental-scope-parsing branch 2 times, most recently from 5cbe131 to 5d7161a Compare August 3, 2023 19:15

sirosen force-pushed the add-experimental-scope-parsing branch from 5d7161a to 928b774 Compare August 3, 2023 19:29

sirosen merged commit e70cd23 into globus:main Aug 3, 2023
13 checks passed

sirosen deleted the add-experimental-scope-parsing branch August 3, 2023 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move all scope parsing to 'experimental' #752

Move all scope parsing to 'experimental' #752

sirosen commented Jun 15, 2023 •

edited by github-actions bot

Loading

kurtmckee left a comment

kurtmckee Jul 24, 2023

sirosen Jul 24, 2023

sirosen left a comment

sirosen Jul 24, 2023

ada-globus left a comment

sirosen left a comment

sirosen commented Jul 29, 2023

joshbryan-globus commented Jul 31, 2023 •

edited

Loading

sirosen commented Aug 1, 2023

sirosen commented Aug 1, 2023

joshbryan-globus commented Aug 1, 2023

kurtmckee left a comment

		representation of consent trees. The scope trees produced by parsing do
		not necessarily describe real consent structures in Globus Auth.

Move all scope parsing to 'experimental' #752

Move all scope parsing to 'experimental' #752

Conversation

sirosen commented Jun 15, 2023 • edited by github-actions bot Loading

kurtmckee left a comment

Choose a reason for hiding this comment

kurtmckee Jul 24, 2023

Choose a reason for hiding this comment

sirosen Jul 24, 2023

Choose a reason for hiding this comment

sirosen left a comment

Choose a reason for hiding this comment

sirosen Jul 24, 2023

Choose a reason for hiding this comment

ada-globus left a comment

Choose a reason for hiding this comment

sirosen left a comment

Choose a reason for hiding this comment

sirosen commented Jul 29, 2023

joshbryan-globus commented Jul 31, 2023 • edited Loading

sirosen commented Aug 1, 2023

sirosen commented Aug 1, 2023

joshbryan-globus commented Aug 1, 2023

kurtmckee left a comment

Choose a reason for hiding this comment

sirosen commented Jun 15, 2023 •

edited by github-actions bot

Loading

joshbryan-globus commented Jul 31, 2023 •

edited

Loading