Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token spans for non-AST nodes #61

Open
ssbr opened this issue Sep 15, 2020 · 10 comments
Open

Token spans for non-AST nodes #61

ssbr opened this issue Sep 15, 2020 · 10 comments
Assignees

Comments

@ssbr
Copy link
Contributor

ssbr commented Sep 15, 2020

The main pain point I have with asttokens is when there's a non-AST object inside of the AST that semantically has a token span, but which asttokens does not expose (to the best of my knowledge). Two examples:

  1. If you match foo.bar, what is the token span for bar? Unfortunately, the bar is a string object in the AST (Attribute.attr) and gets no first_token / last_token attributes. (Similar issues exist for alias nodes; see Strange behavior with tokens of import nodes #27, which describes an attempted workaround) and other parts of the AST.) This makes it harder to replace one attribute with another, or change one import to another, things like that, without replacing the entire expr/stmt.

  2. If you match [], and want to insert a new element to the list literal, you need to manually do the math, as there is no declared token span for the elts attribute, which is just an empty list. (For nonempty lists, you can use the token spans of the members of the list.)


Feature request: I would like to define a method on ASTTokens, something like get_span(node: AST, attr: str) -> Optional[(Token, Token)], which returns the first_token / last_token of an attribute, instead of of the node itself, as a best effort attempt. These would be either written down explicitly in mark_tokens off to the side (as e.g. a different attribute), or deduced on the fly.

Also, I would like some way to mark these inclusive/exclusive -- for example, in the empty list case, the token span for elts should be identical to the token span for the container, but exclusive of the [ and ] etc. Maybe an inclusive attribute on Token, or maybe just an extra pair of bool return values for get_span, something like this.

Context links: (1) is the main motivator for this limitation in my asttokens-using tool, and (2) will become a problem in ssbr/refex#6 (support for globs, like [$x...], which includes the empty list).

(If this sounds good, I can volunteer to do at least enough to make this work for the empty list case and probably the function call case, which I will need, and maybe some trivial easy ones like Attribute.attr. I don't want to try to solve it for everything though. :))

@alexmojaki
Copy link
Contributor

I haven't tried it myself but you might be interested in https://github.com/Instagram/LibCST

@dsagal
Copy link
Member

dsagal commented Sep 15, 2020

For attribute lookup, the Attribute node's last_token will always be the name of the attribute itself. It's a single token. So I think that's a simple case, and not sure that adding any layer to that would make it simpler. (I could be mistaken though.)

For brackets, that's a different issue. Haven't looked into it, but it reminds me of a somewhat similar needs involving parentheses: #11

@ssbr
Copy link
Contributor Author

ssbr commented Sep 15, 2020

For attribute lookup, the Attribute node's last_token will always be the name of the attribute itself. It's a single token. So I think that's a simple case, and not sure that adding any layer to that would make it simpler. (I could be mistaken though.)

It's not so much "making it simpler" as avoiding hardcoding all of these rules -- they should mostly be just as simple, but they start piling up.

For example, alias nodes are similarly simple -- as it's constrained to the form $x and $x as $y, where $x is n tokens and $y is 1 token, you can pretty easily get the token span for each if you have the token span for the whole alias node. (If it has a non-None asname, then $x is all but the last 2 tokens, and $y is the last token. Otherwise, $y doesn't exist and $x is all of the alias tokens.)

It'd be nice to centralize all of these simple rules inside asttokens, although I can still write them by hand in my library-that-uses-asttokens if need be.

(This is also true of use case 2 / brackets.)

For brackets, that's a different issue. Haven't looked into it, but it reminds me of a somewhat similar needs involving parentheses: #11

They're definitely related.

A motivating use case for this span is to e.g. replace (a, b) with (a, b, c), which requires knowledge of where the elts is without the parens to insert the c. The degenerate case is transforming () to (c,), where there's no way to know where the elts is without manually stripping parens.

If the resolution for #11 is to include parens, then the Tuple node would include the parens, and to strip them, you would access the span for elts. (This adds an element of choice, as well as enabling things like the (a, b, c) rewrite.)

@ssbr
Copy link
Contributor Author

ssbr commented Sep 15, 2020

I haven't tried it myself but you might be interested in https://github.com/Instagram/LibCST

Sorry, I forgot to reply to this.

I'd be open to (additionally) using LibCST, but I really like how asttokens is built on the existing AST, and would prefer something like LibCST that layers itself on top of the AST. (Which is kind of what I view asttokens as, though it's missing many desirable things such as this FR, or knowledge of parenthesis-expressions.)

@dsagal
Copy link
Member

dsagal commented Sep 15, 2020 via email

@ssbr
Copy link
Contributor Author

ssbr commented Sep 15, 2020

(Replying out of order)

To handle empty spans, perhaps better to return (startTok, endTok), where the endTok is AFTER the last token? That would allow covering an empty token span, while still exposing a relevant position in the text.

Oh, actually, yeah, that should work perfectly! My only concern is that this is totally different from last_token semantics, but I guess that's just something callers need to know.

So you mean a helper like get_span(attrNode, “attr”) or get_span(tupleNode, “elts”)? That sounds good to me, and seems helpful to include into the asttokens library (rather than a separate one on top of it).

Yes, exactly right! Only catch is that it can't be a top-level function, I think it needs to either be a method, or accept an ASTTokens object as a parameter, so that it can call next_token(). For example, the implementation for elts would presumably be something like:

return (self.next_token(container_node.first_token), container_node.last_token)

If this plan looks good to everyone, I can volunteer to implement this for several easy cases and send a PR implementing those -- at the least Attribute.attr and container elts, and I would like to try to do something for functions as well. I would continue to contribute as cases arise for me, but those are the important cases for me right at this moment.

@dsagal
Copy link
Member

dsagal commented Sep 16, 2020

This sounds fine to me.

I did a little thinking to clarify the goals for this. Such rules (e.g. for getting parts of Attribute or List nodes) only make sense in a particular context, and would usually occur along with other hard-coded logic specific to this context. One needs to deal with a specific kind of node to make use of get_span(node, "attr") or get_span(node, "elts"). In other words, it doesn't really enable any generic usage.

But there are other benefits:

  1. Convenience. It's easy enough to remove open/close bracket from a range of tokens, but better to save others the trouble of figuring out the same few lines of code.
  2. Documentation. It would be valuable to have this documented well (what attributes it supports for what nodes, with an example of the result), since it's kind of hard to know from the standard-library documentation what's available in what node.
  3. Consistency. These node-parts differ a bit across versions, e.g. I think attr is called attrname in Python2 and in astroid. This method would allow code that doesn't need to be aware of the AST version.

There are some related needs for which it would be useful to have a method with similar usage. E.g. imagine that get_span(node, "with_parens") returns an expanded range of tokes that includes any parentheses around the node (#11). Also get_span(node, "node") could be similar to .get_tokens(node), but return the (startTok, endTok) pair to match the semantics of other get_span() usage.

Interface-wise, separate methods may make more sense (e.g. get_span_attr(attr_node), get_span_elts(list_node)) but it's definitely an advantage to keep them grouped one way or another, so that it's easier for the user to be aware of what's available.

@ssbr
Copy link
Contributor Author

ssbr commented Sep 16, 2020

I did a little thinking to clarify the goals for this. Such rules (e.g. for getting parts of Attribute or List nodes) only make sense in a particular context, and would usually occur along with other hard-coded logic specific to this context. One needs to deal with a specific kind of node to make use of get_span(node, "attr") or get_span(node, "elts"). In other words, it doesn't really enable any generic usage.

If by generic usage you mean like, usage that is not aware of specific AST object type, that is still possible. For example, if you were writing some tool that dumped the AST and information about token spans for debugging, you might write something like this:

def dump_ast(astt, node):
  print(ast.dump(node))
  print(f"{node.first_token.startpos}-{node.last_token.endpos}")
  for field, child in ast.iter_fields(node):
    child_span = astt.get_span(node, field)
    if child_span is not None:
      print(f"    {field}: {span[0].startpos}-{span[1].startpos}")
    else:
      print(f"    {field} (no span)")

But yeah, there isn't much use for this if you don't know what the type is -- the use case is generally that you want to do some specific transform. Without that, there is not as much use for token data, and it could really be arbitrary or missing.

There are some related needs for which it would be useful to have a method with similar usage. E.g. imagine that get_span(node, "with_parens") returns an expanded range of tokes that includes any parentheses around the node (#11). Also get_span(node, "node") could be similar to .get_tokens(node), but return the (startTok, endTok) pair to match the semantics of other get_span() usage.

I really want that with_parens thing, in some API 😍 (also probably without_parens) -- that's further along in my very long todo list though.

There is a corner case where an AST node has a list of strings, instead of a list of nodes, where you might want to know the span of one of those substrings -- for example, global a, b, c has the AST Global(names=['a', 'b', 'c']) (and same with nonlocal). If you wanted to rename all instances of a to new_a, you'd be in trouble (without rewriting the entire global statement). I think this sort of drives towards two different API shapes:

  1. a single get_span function, callable as get_span(node) to get the span of the node itself, get_span(node, attr) to get the span of an attribute, and get_span(node, attr, index) to get the span of that element of the list attribute.

  2. three functions, something like get_span(node), get_attr_span(node, attr), and get_item_span(node, attr, index) respectively

Interface-wise, separate methods may make more sense (e.g. get_span_attr(attr_node), get_span_elts(list_node)) but it's definitely an advantage to keep them grouped one way or another, so that it's easier for the user to be aware of what's available.

TBH I worry that there'd be far too many as things get added over time. If this is the way to go, I'd suggest a separate module and accepting the ASTTokens as a parameter, so that they're all grouped in one place as you say.

It also starts getting worrisome if two different AST objects share attributes with the same name but different semantics (idk if that happens already or not).

One could imagine monkeypatching them onto the AST nodes, much like first_token and last_token. I'm actually super not a fan of that to begin with, but it's already the pattern. Maybe:

  • my_tuple_node.token_span_elts()
  • my_tuple_node.asttokens.span_elts()
  • asttokens.info(my_tuple_node).span_elts() (the way it would look without monkeypatching -- internally there's a global dictionary that maps from AST object to NodeInfo object or whatever, where the NodeInfo carries enough information to answer questions like this.)

This has the property of being naturally grouped and avoiding magic string constants.


Explosion of possibilities here, and I don't mean to bikeshed too much. As long as we're clear on the basic approach, I can start implementing, and am happy to work out the relatively superficial detail of exactly how the API looks concurrently.

@ssbr
Copy link
Contributor Author

ssbr commented Sep 16, 2020

Looks like I can't edit the issue, but feel free to assign to me.

@dsagal dsagal assigned dsagal and ssbr and unassigned dsagal Sep 16, 2020
@dsagal
Copy link
Member

dsagal commented Sep 16, 2020

I assigned to you. You make a good point about iter_fields (in the generic example) -- it makes the string constants less magic.

Thanks for the analysis of issues and possibilities. They all make sense, and I agree that it seems fine to start with something and iterate. Thanks for tackling this! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants