-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please define "portable" #55
Comments
Regarding the space: The only differences seems to be that
This can be polyfilled with |
A problem with this is that lookbehind is not supported in all JS engines (notably, Safari/WebKit doesn't support it), which I want to write a compiler warning for. I don't know any way to emulate |
AFAIK, there is no general way to emulate it without lookbehinds. Lookaround optimizations can eliminate many lookarounds within patterns, but some simple cases (e.g. |
That being said, the examples I gave weren't meant to be bug reports. I just wanted to illustrate that pomsky doesn't seem to be portable, with the definition of "portable" being "compiled regexes behave the same in all target languages". So could you please clarify what portable means for pomsky? What are pomsky's goals when it comes to supporting compile targets? |
"Portable" means that a Pomsky expression that works for one regex flavor should also work in other flavors, unless it uses an unsupported feature (like backreferences in Rust). One example is that named capturing groups use different syntax in JS than in PCRE, but in Pomsky the same syntax can be used to target any flavor. I acknowledge that Pomsky isn't portable in all situations; I initially hoped that I could get rid of all the differences with polyfills, but it has become clear that this is not always possible, so Pomsky's portability is a best-effort kinda thing. |
That pomsky accounts for the syntax of different flavors in a given IMO. After all, I don't know any regex flavors with I am more interested in what portability means for semantics. Right now, pomsky specifies the semantics of its elements (e.g. which sets of characters That being said, I do see where you're coming from, but I don't think that the current approach to portability is good. The fact, that users don't know which approximations flavors might or might not use, makes pomsky unreliable. I think letting users choose whether they want approximations would be best. This would neatly separate pomsky's semantics and flavor-specific limitations and approximations. Pomsky could define a list of all approximations that flavors might perform, and then allow users to enable/disable approximations per flavor. Of course, shorthands for sets of options would be nice (e.g. (JS and lookbehinds are a compelling case for this fine-grained approach. While regexes used on the web shouldn't use lookbehinds because of Safari, regexes that used in e.g. electron apps can use lookbehinds. So why should all JS regexes use an approximation that is only necessary for one browser?) |
First of all, I consider all 3 examples you pointed out to be bugs. It would be best to create separate issues for them and discuss only the bigger picture here, without looking at each portability issue in detail. There are multiple possible solutions to portability issues:
5. seems like a good solution because it maintains correctness without sacrificing functionality. However, it is the least portable and goes against my goal to make Pomsky feel like a single, consistent language. What I think you're suggesting is a configuration option to choose between 2. ( I intend to communicate this more clearly on the website. I think your idea of a "strict" mode is cool, but this sounds like a bigger project, so I probably won't have time to implement this in the near future. However, if you (or someone else) want to contribute this feature, that would be appreciated! |
👍
This issue was basically me feeling out the goals and ambitions of this project. I wanted to contribute to a Rust project for a while now, so this will be a good first. Expect PRs in the coming year :) |
@RunDevelopment I created a proposal to add an ASCII-only mode, addressing the problem with |
Hello!
I am someone interested in regexes, so pomsky caught my eye. This project has a lot of good ideas (I especially like ranges), but there is one thing that really bugs me: How portable is pomsky?
Pomsky advertises itself as "a portable, modern regular expression language", but I don't see it as very portable.
Examples:
Predefined character classes are often defined very differently among languages. E.g.
\s
is defined as[ \t\n\x0B\f\r]
or[\p{IsWhite_Space}]
=[\x09-\x0d \x85\xa0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000]
in Java and as[\f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
in JavaScript. However, pomsky doesn't address those differences.[s]
compiles to\s
for both JavaScript and Java.%
compiles to\b
in JavaScript, which is a huge issue because it is not Unicode-aware. Documenting this doesn't make it more portable :(Forward references behave differently in different languages. Here is how the same regex behaves in Python and JavaScript:
This is because Python and JavaScript have different semantics for how capturing groups capture text. In JavaScript, entering a group (capturing or not) resets the captured text of all capturing groups in it. So when
(?:(a)|\1\1)
is entered in the second iteration, the captured text of(a)
is reset, which is why it does not accept"aaa"
. Python, on the other hand, only resets the captured text of a capturing group after the capturing group has captured a different part of the string.Python and JavaScript also differ in the semantics of backreferences in one edge case. If a backreference references a group that has no captured text (either because it has never captured text or because its captured text was reset), then the backreference will always reject (being equivalent to
(?!)
) in Python, and it will always accept (being equivalent to(?:)
) in JavaScript.Since, pomsky doesn't seem to guarantee that regexes behave the same in different languages, what does portability mean for pomsky?
I am concerned with portability, because semantic differences can lead to security vulnerabilities.
E.g.
( [s]{2} | [ U+feff ] )+ $
will be vulnerable to exponential backtracking in some target language but not others. So if a developer uses a static analyzer or fuzzer to verify absence of exponential backtracking in one target language, they might assume that other languages are safe too.Of course, semantic differences can cause problems in other ways too. E.g. if pomsky-generated regexes are used for input validation, then the JavaScript frontend might correctly filter out bad inputs but the Java backend does not.
Assuming that portability means "the generated regexes behave the same in all target languages", the above examples could be addressed as follows:
\s
can be emulated with the right character class across languages. If people actually want the\s
of their respective target language, then they can useregex \s
.%
should just behave the same in all target languages. It's unfortunate that JS regexes will be quite long, but long regexes are better than wrong regexes.The text was updated successfully, but these errors were encountered: