-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include parentheses around expressions in token range #11
Comments
Since the parens of lhs and rhs seem to be included in the parent expression range, can't you partition the code at the operator and swap the strings? A bad implementation would be...
|
Splitting the whole parent expression on the first occurrence of the parent expression’s operator will obviously not work:
What can work is take the range between the end of
Not the most elegant way, but it will do for now. |
Well, I wrote "A bad implementation would be..." So of course my naive implementation doesn't work in all cases ;) And BTW regex also doesn't work 100%. After all, expression also can contain arbitrary whitespace and comments. E.g. what about this expression:
I'd not worry about parentheses in particular. Instead I'd get the location of the operator by locating the actual token of the operator. If comments and whitespace between sub-expressions and the operator are not included in sub-expressions' text, then you'd still need custom code to handle this.
Then get sublist based on |
Ugh. I played a bit with your example. It’s much uglier than you expected. The
Which is syntactically correct but will not meet any reasonable user expectations. |
I see :D Well, in any case the authors of such convoluted expressions are probably not the target audience of your extension ;) Good luck with your project! Is it on GitHub? |
Not yet, I’m still assessing if it’s viable. |
Parentheses are a bit special in that they don't cause creation of extra nodes in the AST (only affect order of operations). So they need a bit of extra work to capture them. Luckily, import asttokens
def range_with_parens(atokens, node):
"""Returns (first_token, last_token) containing node and any surrounding parentheses."""
first = node.first_token
last = node.last_token
while (asttokens.util.match_token(atokens.prev_token(first), token.OP, '(') and
asttokens.util.match_token(atokens.next_token(last), token.OP, ')')):
first = atokens.prev_token(first)
last = atokens.next_token(last)
return (first.startpos, last.endpos)
def text_with_parens(atokens, node):
"""Returns text of node including any surrounding parentheses."""
first, last = range_with_parens(atokens, node)
return atokens.text[first:last] |
Thanks, this kinda works. #!/usr/bin/python3
import sys
import textwrap
import token
import asttokens
def conservative_extent(atokens, node):
prev = atokens.prev_token(node.first_token)
next = atokens.next_token(node.last_token)
if (asttokens.util.match_token(prev, token.OP, '(') and
asttokens.util.match_token(next, token.OP, ')')):
return prev.endpos, next.startpos
return node.first_token.startpos, node.last_token.endpos
def greedy_extent_limited(atokens, node, start_limit, end_limit):
first = node.first_token
last = node.last_token
while True:
prev = atokens.prev_token(first)
next = atokens.next_token(last)
if not (asttokens.util.match_token(prev, token.OP, '(') and
asttokens.util.match_token(next, token.OP, ')') and
prev.startpos >= start_limit and
next.endpos <= end_limit):
return (atokens.next_token(prev, True).startpos,
atokens.prev_token(next, True).endpos)
first = prev
last = next
return (first.startpos, last.endpos)
def test_transpose(text):
print(text)
print('↓')
atokens = asttokens.ASTTokens(text, parse=True)
node = atokens.tree.body[0].value
start, end = conservative_extent(atokens, node)
l_start, l_end = greedy_extent_limited(atokens, node.left, start, end)
r_start, r_end = greedy_extent_limited(atokens, node.right, start, end)
print(''.join((text[:l_start],
text[r_start:r_end],
text[l_end:r_start],
text[l_start:l_end],
text[r_end:])))
print()
def main():
test_transpose(textwrap.dedent('''\
((a * b + c) * d)'''))
test_transpose(textwrap.dedent('''\
(a # this is my lhs
# *** end of lhs ***
*
# *** start of rhs
b # this is my rhs
)'''))
if __name__ == '__main__':
sys.exit(main() or 0)
The common case is nice and correct. The ugly case is still ugly but correct. |
Good point to limit to the parent's extent. You could probably simplify the interface a bit to still just take one node, and expand its range to include parentheses, limiting the range to the range of the node's parent. Although I am a little confused: in which case does limiting to the parent actually cause different behavior? Is one of your examples affected? |
As Mateusz pointed out, it’s not only about parentheses. I do want to expand each sub-node’s range to include parentheses, but I also want to include comments that are within the parent’s parentheses. Consider the “ugly example” above. I will tokenize it for clarity and mark out the various ranges involved. I will also add a few extra pairs of parentheses. Here’s the starting position. The original parent range is too small so limiting to it will not let me extend over the comment after
If I extend each range to include each matching pair of parentheses and only those, I get this:
Still not greedy enough. So I extend the operands’ ranges all the way up to but not including the nearest significant tokens that do not form a pair of parentheses:
After drawing these schematics, I do believe the parent range is not strictly necessary. Still, I’m inclined to keep it simply because it will remind my future self what the maximum possible scope of the transformation is.
|
Makes sense. The convention in Python is that end-of-line comments are associated with the code on that line, and lines that only contain a comment are associated with the following line. So it might be a helpful utility to expand a node's token range to include any end-of-line comments AND any preceding comment-only lines (or maybe only preceding comment-only lines until you hit a blank line). And also to include surrounding parentheses. (One difference from your proposal is if you have an end-of-line comment after the operator There was just a discussion of a similar requirement in #10 -- about finding end-of-line comments after a token. Also, to make this completely general-purpose, we'd run into a problem with parentheses and function calls. E.g. in |
I have no problem with that. In my code above, I’m expanding
Yes, that’s what I had in mind when I said “parentheses that are not part of the parent node’s syntax” in the original comment. However, in case a function is called with a single argument, that argument’s expression can be wrapped in an arbitrary number of pairs of parentheses, and only the outermost is syntactic:
|
That all makes sense. Well, if you think such a utility belongs into |
I will try out the concept privately to gain some hands-on experience with it. If/when it proves useful, I may propose a pull request. Let me see if I get this right: you want it as a utility only, not as a change to the default behavior, right? |
I guess different use cases might want different concept of which range of tokens comprises a node. E.g. in |
While working on #36 / #28 I've found that in Python 3.8 the parentheses surrounding a tuple are now being included, because the from asttokens import ASTTokens
source = """
(1, 2)
"""
atok = ASTTokens(source, parse=True)
t = atok.tree.body[0].value
print(t)
print(atok.get_text(t))
print(t.col_offset) There's still no difference for the source So at least for tuples, the choices are:
I'm in favour of 3 because I think it's more intuitive. People tend to think of tuples as being defined by parentheses even if they're not, which is probably why Python has made this change. Of course this will mean changing the behaviour of the library which may cause some people problems. @dsagal what do you think? |
This issue with tuples and parentheses is also causing other bugs. Because asttokens tries to include trailing commas within a tuple, when it already has the surrounding parentheses it ends up including commas outside of the tuple. This causes import ast
from asttokens import ASTTokens
source = """
foo((x,),)
"""
tree = ast.parse(source)
for node in ast.walk(tree):
if isinstance(node, ast.Tuple):
print(node, node.lineno, node.col_offset)
atok = ASTTokens(source, tree=tree)
t = tree.body[0].value.args[0]
print(t)
print(atok.get_text(t)) In Python 3.7, |
I agree it makes sense to change behavior to always include tuple parens, even pre-3.8. I don't know if anyone relies on them NOT being included, but since it's a change, it should perhaps get a major version bump. We'd need to keep in mind, however, that sometimes tuples exist without parens. |
I relied on them not being included so that I could have minimal diffs. (This change broke my tests. :]) I can work around it though. I don't agree that it makes sense. Whether parentheses are required is a property of the substitution-site, not the expression. Suppose I take a statement like So whether I want to eat as many parens as possible, or as few as possible, depends on what I am doing: if I'm deleting something and replacing it with something else, I probably want to delete all the parens and decide only how many I need for my replacement expression. And when I am inserting an expression into a blank spot where an expression belongs,I only want to add as many parens as are necessary, and no more. (The way I handle this during substitution is to try adding parentheses and see if it changes the parse tree. If it does, those parens are necessary to preserve the intended effect.) IMO the "ideal" is for asttokens to allow the user to decide what they want: as many parens as can possibly be attributed to this expression, or none of them. Doing it inconsistently between expression types definitely doesn't sound great, and choosing only one or the other forces the user to work around it somehow (although really that isn't too hard.) |
I disagree, I think more people would write |
Maybe, but I hope it's clear that making an opinionated decision here makes it less convenient for tools that build on top of asttokens to make different decisions. Edit: and we can probably agree on it entirely for other kinds of expressions, like arithmetic. |
That's an interesting point, and sorry, @ssbr, about creating hassle for you with the backward-incompatible change. In your example, in the expression What you were relying on wasn't a very useful behavior. In fact, I imagine to get what you want, you needed to add your own handling to strip unneeded parentheses from an expression like So I think adding an option for whether to include parens solves a very particular need, and doesn't solve it very well. At least given your description, a more useful feature would be a function that takes an AST node and returns a concise code for it. But that already exists, I think -- e.g. in |
Not sure what you were trying to say here @dsagal? There is no Expr node anywhere in
If I think the current behaviour is a good default behaviour which matches people's natural expectations for simple uses of the library. But clearly there are at least two people who want an option for different behaviour, albeit in opposite directions: @yurikhan wants to be able to include them more, and @ssbr less. Including such behaviour will at the very least require proposing an API, and will probably also need a PR as implementing this stuff is not trivial. |
Doh, you are right. Sorry to comment based only on recollections which were incorrect :) And without even looking carefully at which issue the thread is about. So scratch my comments, and listen to @alexmojaki instead. If anyone wants to have a go at offering an option for this, it seems to me that expanding token ranges to include parens, or stripping parens, is fine to do after parsing. I.e. any node's range could be expanded or stripped of surrounding parentheses with no change in meaning (might be wrong on this, but can't think of anything). In this case something along the lines of the |
@alexmojaki Thanks for the ping, will leave a comment on that issue. FWIW I am absolutely happy and fine with any decision asttokens makes. The worst it will do is break my tests and maybe make diffs include slightly more parens than they would've otherwise. I'll try to open source my thing this year instead of just using it internally at work so that I have lines of code I can point at. :) |
Not quite rely, but I'm pretty sure that That said, tuck already has its own logic for finding the outermost set of parens which belong to an expression: https://github.com/PeterJCLaw/tuck/blob/a542aacd000d25c409c6d68b8e62ca18c9375b8e/tuck/wrappers.py#L30-L62 I suspect the concern the original author had here (and on I share) is that the inclusion of parens sometimes feels inconsistent between node types. I can't immediately point to an example and I'm not sure how much of this is a result of Python's AST itself rather than what An alternative that would be non-breaking might be to have a util that can walk outwards from a token/pair of tokens/ast node and find the outer bounds, assuming we can find a useful semantic for what the outcome of that is. (I see there was some discussion about that above, though I'll admit I've not completely followed it all). If the spelling that I've linked to above would be useful either as-is or as the basis for something I'd be happy to see it upstreamed into |
I am trying to use
asttokens
for source-level transformations in a code editor. For example, I would like to position the cursor on a binary operation, press a key, and the operands would be transposed:(a+b) * c
→c * (a+b)
.The way I’m trying to do that is:
asttokens.ASTTokens
. (The smaller the better because the whole file cannot be guaranteed to be syntactically correct all the time. For now, I operate at the level of function definitions.)BinOp
.left
andright
operands, and swap them. (This is to preserve as much of the original code formatting as possible. For instance, re-generating the whole function usingastor
will strip all comments and possibly mess up spacing and line structure.)However, I find that at step 3, if either operand is enclosed in parentheses, they do not count towards the token range:
so if I swap the corresponding character ranges, I get
(c) * a + b
which is wrong.It would be nice if any parentheses that are not part of the parent node’s syntax were included in the token range of the child nodes:
(Or am I overlooking some other Obvious Way to Do It?)
The text was updated successfully, but these errors were encountered: