You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Lookarounds are useful for many things and present interesting optimization opportunities. There are a few optimizations that I would like to request.
Describe the solution you'd like
I would like to see the following optimizations:
Inlining lookarounds at the end of the expression tree of a lookaround. E.g. >> "a" (>> "b") → >> "a" "b". For this optimization to be valid, the inner and outer lookarounds have to look in the same direction, the inner lookaround must be not be negated, and the inner lookaround must be at the end of the expression tree of the one.
Outline assertions at the start of non-negated lookarounds. E.g. >> ^ "a" → ^ (>> "a"). In general, any element that does not consume characters at the start of a non-negated lookaround can be outlined.
Removing trivially accepting assertions. E.g. (>> [w]) "foo" → "foo". Assertions that can be statically determined to always accept because of the surrounding pattern and do not have any side effects (e.g. capturing groups) can be removed (= replaced with the empty string "", the neutral element of concatenation).
Removing trivially rejecting assertions. E.g. (>> "a") "foo" | "bar" → ∅ "foo" | "bar" (I'm using ∅ to denote the empty set). Assertions that can be statically determined to always accept because of the surrounding pattern and do not have any side effects (e.g. capturing groups) can be removed (= replaced with the empty set, the absorbing element of concatenation).
Applying assertions. E.g. (>> [w]) C → [w], (!>> [w]) C → ![w]. Single-character lookarounds can be removed by applying them to the character before/after them. This is not only an optimization (these optimized regexes are around 4x faster in JavaScript), but it can also be used as a method to achieve character class intersection and subtraction.
Notes about 3 and 4:
These optimizations should work on branches of lookarounds, not the whole lookaround. E.g. (>> "a" | "-") [w]+ → (>> "a") [w]+.
Boundary and start/end assertions should also be included.
Determining whether an assertion always accepts or rejects is quite difficult, but there are some fast approximations that can be used. One idea is to only consider the first character that comes before/after an assertion. Knowing the next character before/after an assertion is enough to optimize start, end, and boundary assertions as well as a lot of lookarounds. We implement this approach in many optimization-related rules in eslint-plugin-regexp.
Optimizing rejecting assertions is especially useful for optimizing custom boundary assertions. E.g. using JS syntax (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))foo can be optimized to (?:(?<=\w)∅|(?<!\w)(?=\w))foo → (?:(?<!\w)(?=\w))foo → (?:(?<!\w))foo → (?<!\w)foo.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Lookarounds are useful for many things and present interesting optimization opportunities. There are a few optimizations that I would like to request.
Describe the solution you'd like
I would like to see the following optimizations:
>> "a" (>> "b")
→>> "a" "b"
. For this optimization to be valid, the inner and outer lookarounds have to look in the same direction, the inner lookaround must be not be negated, and the inner lookaround must be at the end of the expression tree of the one.>> ^ "a"
→^ (>> "a")
. In general, any element that does not consume characters at the start of a non-negated lookaround can be outlined.(>> [w]) "foo"
→"foo"
. Assertions that can be statically determined to always accept because of the surrounding pattern and do not have any side effects (e.g. capturing groups) can be removed (= replaced with the empty string""
, the neutral element of concatenation).(>> "a") "foo" | "bar"
→∅ "foo" | "bar"
(I'm using∅
to denote the empty set). Assertions that can be statically determined to always accept because of the surrounding pattern and do not have any side effects (e.g. capturing groups) can be removed (= replaced with the empty set, the absorbing element of concatenation).(>> [w]) C
→[w]
,(!>> [w]) C
→![w]
. Single-character lookarounds can be removed by applying them to the character before/after them. This is not only an optimization (these optimized regexes are around 4x faster in JavaScript), but it can also be used as a method to achieve character class intersection and subtraction.Notes about 3 and 4:
(>> "a" | "-") [w]+
→(>> "a") [w]+
.eslint-plugin-regexp
.(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))foo
can be optimized to(?:(?<=\w)∅|(?<!\w)(?=\w))foo
→(?:(?<!\w)(?=\w))foo
→(?:(?<!\w))foo
→(?<!\w)foo
.The text was updated successfully, but these errors were encountered: