-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move all scope parsing to 'experimental' #752
Conversation
Scope parsing is now exposed via `Scope.parse`, which is only available in `globus_sdk.experimental.scope_parser` as a method of parsing scopes. The scope parsing passes through two IRs -- first a tree, then a graph -- which reject cycles, normalize any optional/non-optional redundancies, and ensure that the tree is built in the same way that it will be during authorization flows in Globus Auth. This replaces our "shadow" implementation of MutableScope, which had been left in `scopes.scope_definition` under the name `Scope` but was never exported or documented. All of the features of `Scope` under the old location are now integrated into `globus_sdk.experimental.scope_parser.Scope`. Scope parsing is tested independently of other scope tests, moving and expanding these tests into `tests/unit/experimental/`.
pylint flagged `graph` (used earlier in a function) as a redefinition of a name from outer scope. This happens because the block under the `if __name__ == "__main__"` check is technically in module scope. Therefore, at analysis-time, there is a module-scoped variable named `graph` and we "need" to avoid redefinition. This could be tucked into a `def _main()` to "hide" the name from module scope, but a rename is equally simple to dodge the overly-aggressive pylint check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to discuss this further before finishing the review.
The scope trees produced by parsing do not necessarily describe real consent structures
This acknowledgement concerns me.
changelog.d/20230719_164657_sirosen_add_experimental_scope_parsing.rst
Outdated
Show resolved
Hide resolved
docs/experimental/scope_parser.rst
Outdated
representation of consent trees. The scope trees produced by parsing do | ||
not necessarily describe real consent structures in Globus Auth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scope trees produced by parsing do not necessarily describe real consent structures
I don't like this. I think this has been discussed before, but I'm not keen to introduce something that doesn't match the actual Globus Auth implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one thing that I want to have very clearly stated is that it is not possible to represent the consent structures in Auth as a string. Scope string parsing can only match the consent structure produced by a single authentication and consent flow.
The implementation does match what Auth does for a scope string used in a single consent flow. We're just bound by the limitations of the string representation vis-a-vis Auth's more elaborate internal state.
Co-authored-by: Kurt McKee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the typo fixes (applied)!
I do want us to discuss this more, and I'm sure we'll have a productive conversation when we find time for this.
I've left one inline note about not matching the Auth consent trees, and there is one other thing I'd like to share up-front: I'm motivated pretty strongly to take this implementation -- most of which exists but is not exposed in the project today -- and move it out of the "normal" parts of the source tree.
docs/experimental/scope_parser.rst
Outdated
representation of consent trees. The scope trees produced by parsing do | ||
not necessarily describe real consent structures in Globus Auth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one thing that I want to have very clearly stated is that it is not possible to represent the consent structures in Auth as a string. Scope string parsing can only match the consent structure produced by a single authentication and consent flow.
The implementation does match what Auth does for a scope string used in a single consent flow. We're just bound by the limitations of the string representation vis-a-vis Auth's more elaborate internal state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel comfortable with the correctness given the advisories about usage in the code.
I have a couple things that felt a little odd to me as I was reading, but none that make me want to block this from heading to experimental
(but might be worth considering if we want to promote this out of experimental at some point):
- In thinking about what we're exposing on the
ScopeGraph
, I'm wondering if we want to lock down the public interface a bit more strictly as it seems possible to fairly easily break the representation by manipulating the public attributes (e.g., naively as a user I might think it safe to callself.edges.add(foo)
w/o realizing that there is a private attribute that needs to track changes here) - It's surprising that when I access
ScopeGraph.nodes
I'm not working withScopeTreeNodes
(I get that one isScopeTree
and one isScopeGraph
but we do build aScopeGraph
usingScopeTreeNodes
so it all gets kind of 😵💫).
While I haven't thought about it super deeply, I'm wondering if it might make sense to add something like a ScopeGraph.add_scope()
method that will accept strings or ScopeTreeNodes
and allow that to manage the properties of the graph so we don't need to leak some of these details, which I think would let us make nodes
and edges
private, and make top_level_scopes
read-only...? But very possible I haven't thought through the downstream consequences sufficiently.
Regardless, this is non-blocking feedback IMO, so I am simultaneously hitting approve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to adjust in response to one bit of inline feedback but I'm also going to move the entire parsing implementation into a privately named module, _parser
.
Neither ScopeTreeNode
nor ScopeGraph
are, in their current implementations, sanitized enough to be user consumable. I tend to think that the absence of docstrings is a deal-breaker on that in its own right. This is currently not as clear as it could be, and the presence of an underscore in the name should serve as a stronger warning sign (for users) and a better disambiguator (for contributors).
In order to better indicate that the parser is an implementation detail, move several components into a `_parser.py` module. The `errors` are separately declared and exposed to better represent that they are part of the public interface of the package.
The 'dependencies' property was never used. Remove it.
In the experimental scope graph component, the cycle detection code was not correct. The simplicity of the approach used did not properly account for a number of scenarios like `foo[bar[baz] baz]`, which would be incorrectly detected as a cycle. These errors were masked by a secondary flaw, an incorrect call to `set.clear()` which wiped out important information. Rather than continuing to pursue this approach, for a first draft of this component simply use a well-understood recursive cycle detection algorithm which operates linearly in the size of the graph. The implementation can and should be unrolled to be iterative in the future.
Our goal is still correctness, but the use of Tarjan's strongly connected components algorithm makes the ScopeGraph class significantly more complex in pursuit of efficiency. Worse, the prior implementation was recursive, which carries significant costs in python. Revert to a simpler iterative cycle detector which is adjusted to be certain to be correct but somewhat inefficient.
Make cycle detection much more efficient in time (but inefficient in space) by employing a BFS graph traversal which tracks the ancestry paths through the graph which have been discovered. If a back-edge is ever found, that means that there is a cycle.
4c113aa
to
b3f0e8d
Compare
I've just pushed through some relatively significant refinements to the cycle detection used to find and reject Basically what happened was this:
All in all it was a fun adventure. I would tend to keep the commit history in this case. The simpler (correct) cycle detector and the full implementation of Tarjan's could be useful to us in the future if we need to do more work on this. The "time efficient" approach in the end does several list copying operations which are, of course, O(n) in their inputs. Those inputs are typically lengths of paths in the graph, so that's some flavor of |
These tests were sensitive to the exact whitespace generated by __str__, which is a cosmetic (non-semantic) detail. It is relatively simple to remove blank lines, which makes the tests more robust to any refinements we make in the future.
Reading through the conversation here I want to leave a couple of notes for consideration. None of this should be considered blocking critique, but rather just food for thought as you discuss. By and large I'm fine with the PR. Regarding "it is not possible to represent the consent structures in Auth as a string", while true from perspective of representing them as a single In reality, neither the scope mini-dsl used for dynamic dependencies nor our internal data structure in auth for storing consents represent them explicitly as a graph. The "graph" nature comes from the fact that "static dependencies" do impose a graph structure on scopes and the semantics of the scope mini-dsl say "for the purposes of this auth call assume static dependencies exist where edges between scopes exist in the parse tree of this string". I think some of the confusion in the description comes from a desire to equate a consent tree and a scope parse tree. It might be worth being explicit about it. E.g. rather than turn a scope string into a ScopeTree into a ScopeGraph, perhaps the final representation should be a ConsentTree (either via a ScopeGraph or directly). Finally, a quick note on cycle detection (mostly to dig the rabbit hole stephen went down a bit deeper). Tarjan's algorithm is "efficient" if you want to find all cycles, but, as you discovered complex to make iterative. Since we only want to abort when "some" cycle is detected, BFS and DFS are options (just abort the first time you add a node to a path already in the path). Both BFS and DFS should have the same worst case time complexity, but if you care about space complexity, DFS will win as you only need to keep the "current" path in memory. It can be a little less intuitive to unroll into a non-recursive form, but the key difference between the two algorithms is that BFS will use a queue ( |
Experimental scope parsing had a vague warning which did not properly announce the limitations of the scope parsing interface. The replacement note instead clarifies only what a parse represents and notes that there is no external callout for validation.
I think our meeting today synchronized us on most elements of this.
Yeah, I think I got a full half-hour into iterative Tarjan's before realizing that I wasn't sure it was correct anymore. "Complex" almost understates it! I'm looking at the cycle checker right now to reconsider the potential merit of switching it DFS. I think it's the most efficient approach I'm willing to implement for these purposes, but I still can't quite make the implementation as simple and "obvious" as the current BFS one. If I find a phrasing of it which is clear, I think I'll be satisfied. |
There are three changes here which make cycle detection faster. 1. Use DFS rather than BFS, which will find cycles faster in some cases and is less memory intensive for large graphs 2. Represent paths using `(frozenset, str)` tuples, since we don't care about ordering at all. This reduces the information in any given "path" from the prior implementation to a bag of nodes and the current terminus. Order is lost, but frozenset construction and membership checking is faster than the list equivalents 3. Rename such that `get_child_edges` becomes `adjacency_matrix`, converting a function call to an attribute lookup and skipping an unnecessary set construction step
I couldn't help it. Somehow it bothered me too much that I couldn't reshape that BFS to DFS cleanly, when my past experience has usually been that DFS is faster and simpler. Turns out, it was just a matter of phrasing it the right way. |
Ha ... you beat me to it by 20 minutes ... I had just slapped together this general idea example and then came to post it and see you already got there ... not to let a good nerd sniping go to waste, here it is:
Anyway, I agree today's discussion was useful, and I'm glad to see we landed on a phrasing we are all happy with. |
1. Use a dataclass for parsed tokens for a cleaner, more declarative style 2. Move the test script into a `__main__.py` module to allow `-m` execution for testing purposes In (2), this now also includes some timing tests which reflect some of the work which was done in order to tune and optimize the cycle detector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving, but I had wanted to test timings on my PC as well and couldn't. Would you double-check that the from...import
line is still working on your machine?
5cbe131
to
5d7161a
Compare
- add missing annotation - fix parameter usage in the non-timing test - test for wide scopes as well as deep ones for timing - report more precise timing stats, including `min` Some of the timing data manipulation was informed by reading the `timeit` module source and recommendations.
5d7161a
to
928b774
Compare
This is a large body of work which I've had sitting in a branch for some time.
I haven't been able to fully review it although I rebased it, cleaned it up, and got tests working against it.
We also haven't discussed the pros/cons of this being available in
globus_sdk.experimental
.My summary of what I was able to pull together follows, but there may be important details missing. Given that this is meant for
experimental
, I think we can accept some gaps in terms of direction (e.g. Should it go through the Graph IR at all? I'm not fully sure).Scope parsing is now exposed via
Scope.parse
, which is only available inglobus_sdk.experimental.scope_parser
as a method of parsing scopes.The scope parsing passes through two IRs -- first a tree, then a graph -- which reject cycles, normalize any optional/non-optional redundancies, and ensure that the tree is built in the same way that it will be during authorization flows in Globus Auth.
This replaces our "shadow" implementation of MutableScope, which had been left in
scopes.scope_definition
under the nameScope
but was never exported or documented. All of the features ofScope
under the old location are now integrated intoglobus_sdk.experimental.scope_parser.Scope
.Scope parsing is tested independently of other scope tests, moving and expanding these tests into
tests/unit/experimental/
.📚 Documentation preview 📚: https://globus-sdk-python--752.org.readthedocs.build/en/752/