Optimizations for assertions and lookarounds #56

RunDevelopment · 2022-12-24T10:33:56Z

Is your feature request related to a problem? Please describe.

Lookarounds are useful for many things and present interesting optimization opportunities. There are a few optimizations that I would like to request.

Describe the solution you'd like

I would like to see the following optimizations:

Inlining lookarounds at the end of the expression tree of a lookaround. E.g. >> "a" (>> "b") → >> "a" "b". For this optimization to be valid, the inner and outer lookarounds have to look in the same direction, the inner lookaround must be not be negated, and the inner lookaround must be at the end of the expression tree of the one.
Outline assertions at the start of non-negated lookarounds. E.g. >> ^ "a" → ^ (>> "a"). In general, any element that does not consume characters at the start of a non-negated lookaround can be outlined.
Removing trivially accepting assertions. E.g. (>> [w]) "foo" → "foo". Assertions that can be statically determined to always accept because of the surrounding pattern and do not have any side effects (e.g. capturing groups) can be removed (= replaced with the empty string "", the neutral element of concatenation).
Removing trivially rejecting assertions. E.g. (>> "a") "foo" | "bar" → ∅ "foo" | "bar" (I'm using ∅ to denote the empty set). Assertions that can be statically determined to always accept because of the surrounding pattern and do not have any side effects (e.g. capturing groups) can be removed (= replaced with the empty set, the absorbing element of concatenation).
Applying assertions. E.g. (>> [w]) C → [w], (!>> [w]) C → ![w]. Single-character lookarounds can be removed by applying them to the character before/after them. This is not only an optimization (these optimized regexes are around 4x faster in JavaScript), but it can also be used as a method to achieve character class intersection and subtraction.

Notes about 3 and 4:

These optimizations should work on branches of lookarounds, not the whole lookaround. E.g. (>> "a" | "-") [w]+ → (>> "a") [w]+.
Boundary and start/end assertions should also be included.
Determining whether an assertion always accepts or rejects is quite difficult, but there are some fast approximations that can be used. One idea is to only consider the first character that comes before/after an assertion. Knowing the next character before/after an assertion is enough to optimize start, end, and boundary assertions as well as a lot of lookarounds. We implement this approach in many optimization-related rules in eslint-plugin-regexp.
Optimizing rejecting assertions is especially useful for optimizing custom boundary assertions. E.g. using JS syntax (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))foo can be optimized to (?:(?<=\w)∅|(?<!\w)(?=\w))foo → (?:(?<!\w)(?=\w))foo → (?:(?<!\w))foo → (?<!\w)foo.

The text was updated successfully, but these errors were encountered:

RunDevelopment added the enhancement New feature or request label Dec 24, 2022

Aloso added the C-optimize Issue or feature request for an optimization label Dec 28, 2022

Aloso mentioned this issue Oct 10, 2023

Optimizations (tracking issue) #95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for assertions and lookarounds #56

Optimizations for assertions and lookarounds #56

RunDevelopment commented Dec 24, 2022 •

edited

Loading

Optimizations for assertions and lookarounds #56

Optimizations for assertions and lookarounds #56

Comments

RunDevelopment commented Dec 24, 2022 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

RunDevelopment commented Dec 24, 2022 •

edited

Loading