Support for POSIX regular expression operators #2236

ghost · 2022-04-05T14:51:32Z

Environment

PostgreSQL version: 13+
PostgREST version: v9.0.0
Operating system: Debian

Description of issue

PostgREST should support comparison operators as defined in: https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP

It already does support LIKE/ILIKE, so an extension here would be very beneficial.

The text was updated successfully, but these errors were encountered:

…tgREST#2236

…- refs PostgREST/postgrest#2236

steve-chavez · 2022-04-05T17:15:40Z

We shied away from exposing regex because of ReDoS in the past #539 (comment).

Another concern I have: can we ensure a regex op will be a fast operation?

According to this so question, an index won't help to speed up a regex operation.

ghost · 2022-04-05T17:30:14Z

@steve-chavez The ReDoS issue shouldn't exist anymore since those ancient PostgreSQL versions have long been EOL'd. If there are still people out there using them, I bet there are other syntactical issues when trying to run PostgREST on those.

For the indexing you are totally correct as there currently isn't any way to speed them up, but this is true for many other operators. Only the possibility to create an index doesn't mean that it actually exists and even if it does also doesn't imply that the query planner will actually make use of it.

So I see no guarantees anywhere, at least not something that PostgREST would be able to hold in practice. In the end, it's all in the operators/DBA's hands to do optimizations or not.

We could put a warning into the docs to emphasize the fact that there are no ways to speed up execution of regular expressions using indexes, so when not combined with other filters will always result in a full table scan.

steve-chavez · 2022-04-05T18:18:21Z

The ReDoS issue shouldn't exist anymore since those ancient PostgreSQL versions have long been EOL'd. If there are still people out there using them, I bet there are other syntactical issues when trying to run PostgREST on those.

True, agree.

For the indexing you are totally correct as there currently isn't any way to speed them up, but this is true for many other operators.

Hm, what other operators do we expose that cannot be sped up with an index?

We could put a warning into the docs to emphasize the fact that there are no ways to speed up execution of regular expressions using indexes, so when not combined with other filters will always result in a full table scan.

Oh, but that would be a problem for every API(that might not need a regex op) since clients could probe this regex operator and deplete the db resources.

will always result in a full table scan

If using the regex will always result in an expensive plan, maybe we should apply the idea we've been discussing in #915 (comment).

Basically, for operations that are known to be expensive, we run an additional EXPLAIN statement, we get the statement cost and if it's greater than a configurable threshold we reject them. If this threshold is not defined then expensive operations are not possible - this would make the regex opt-in for APIs that don't want it as well.

@enote-kane WDYT?

wolfgangwalther · 2022-04-05T18:25:00Z

Oh, but that would be a problem for every API(that might not need a regex op) since clients could probe this regex operator and deplete the db resources.

Hm. How would you protect against that, even with the current operators?

As long as a big table is accessible anonymously, you will always have that problem. It's unlikely you'll create indexes for every column and all operators, right? Even more so, because it's a big table...

Edit: So basically, I think, the question of whether it's possible to create an index for an operator is orthogonal to the question of DoS attacks. The cost threshold is a way to protect against that. But the attack surface would not really be different, even if we introduced other operators.

ghost · 2022-04-05T18:41:43Z

@steve-chavez I generally like the idea with protecting expensive operations with an EXPLAIN and a threshold.

However, if PostgREST goes this route then it would have to do it for almost any filter, since no filter is "guaranteed" to perform well. Also, statistics may be out-of-date so the planner would chose the wrong plan anyway and that could result into situations where sometimes, requests get rejected (data modification bursts) while most of the time the requests are working just fine and are lightning fast.

I think that would be even more confusing for users.

I've recently had a situation where a query took 1.5 hours and after ANALYZE of 3 involved tables (joins) the exact same query went down to 2.6 seconds - classic nested loop vs. hash join issue. Fun fact: the execution plan for the nested-loop variant came with a lower cost than the hash join variant:

https://explain.depesz.com/s/eQsa

So I think the additional EXPLAIN check should be highly configurable including an option to simply turn it off.

steve-chavez · 2022-04-05T22:27:41Z

Hm, but for the regex we'd have the guarantee to always to perform bad(full table scan). The current defenses against slow queries are not perfect but they're something.

Also, statistics may be out-of-date so the planner would chose the wrong plan anyway and that could result into situations where sometimes, requests get rejected (data modification bursts) while most of the time the requests are working just fine and are lightning fast.

All true regarding the EXPLAIN cost, I guess we need to look at other options.

For this particular case, should we restrict the regex input so at least we know that an index can speed it up?

I just saw this article about pg_trgm.

In general, pg_trgm can help when:
You want to speed up LIKE, ILIKE, ~ or ~*.
You want to search for patterns that aren't left-anchored (e.g. %john%). Such patterns aren't supported by B-tree indexes.

So perhaps we can forbid left-anchored inputs.

@enote-kane Would that still be flexible enough for your use case? Any other ideas wrt ensuring indexes speed up the operation?

ghost · 2022-04-05T23:30:17Z

@steve-chavez Just tested it and indeed, adding trgm indexes can convert from a sequential scan into an index scan for regular expression operations (also case-insensitive). Both types seem to work the same here (GIN/GIST).

So, we DO have a possible optimization for DBA's. And this is totally suitable for my use case.

steve-chavez · 2022-04-06T16:25:29Z

So, we DO have a possible optimization for DBA's. And this is totally suitable for my use case.

Great news. So for now could you add the non-left-anchored input safeguard to your PR?

steve-chavez · 2022-04-06T16:32:43Z

The pg_trgm docs also say:

The index search works by extracting trigrams from the regular expression and then looking these up in the index. The more trigrams that can be extracted from the regular expression, the more effective the index search is.

For both LIKE and regular-expression searches, keep in mind that a pattern with no extractable trigrams will degenerate to a full-index scan.

So we should also ensure there's at least one trigram present?

ghost · 2022-04-06T18:12:13Z

@steve-chavez For like/ilke there is no such check as far as I can see and using GIN/GIST indexes makes the left-anchored argument more or less irrelevant. Sadly, the DbStructure doesn't contain index information, so we can't make a guess here.

For checking for trigrams: how do you envision that to be implemented? I am really new to Haskell and the internal concepts of PostgREST. Curious to learn more here.

steve-chavez · 2022-04-07T20:27:51Z

For checking for trigrams: how do you envision that to be implemented? I am really new to Haskell and the internal concepts of PostgREST

Oh, we don't have a dedicated module for checking a filter's input(we have never done it) so it would have to be brand new. Probably just in ApiRequest.hs for starters.

For like/ilke there is no such check as far as I can see and using GIN/GIST indexes makes the left-anchored argument more or less irrelevant.

Right, and I've noted that one can also force a seq scan with LIKE(ref). It also seems complex to make sure the input doesn't trigger a seq scan.

%%: uses the index
%&%: uses the index
%?%: uses the index
%foo & bar%: uses the index
%foo ? bar%: uses the index
%foo && bar%: uses the index
%foo ?? bar%: uses the index
%&&%: triggers a full table scan
%??%: triggers a full table scan

So whatever solution we land on(probably the cost plus some smartness), it should also be applied for LIKE. This can be done in a later enhancement - not a requirement for the regex op.

Will review your PR now 👍

…tgREST#2236

…- refs PostgREST/postgrest#2236

ghost · 2022-04-11T16:14:37Z

@steve-chavez Would you mind creating a new issue for the checks (which most likely needs more research)? Then we could close this one as the PR's are merged already.

steve-chavez · 2022-04-18T16:08:34Z

@enote-kane Sure, sorry for the delay. Opened #2249.

…tgREST#2236

ghost pushed a commit to eNote-GmbH/postgrest that referenced this issue Apr 5, 2022

feat(query): add basic regular expression operator support - refs Pos…

795a1c4

…tgREST#2236

ghost pushed a commit to eNote-GmbH/postgrest that referenced this issue Apr 5, 2022

feat(query): add basic regular expression operator support - refs Pos…

cbb49b0

…tgREST#2236

ghost pushed a commit to eNote-GmbH/postgrest-docs that referenced this issue Apr 5, 2022

feat(api): add match/imatch operators for regular expression support …

20e59ac

…- refs PostgREST/postgrest#2236

This was referenced Apr 5, 2022

feat(api): add match/imatch operators for regular expression support PostgREST/postgrest-docs#524

Merged

feat(query): add basic regular expression operator support #2237

Merged

ghost pushed a commit to eNote-GmbH/postgrest that referenced this issue Apr 8, 2022

feat(query): add basic regular expression operator support - refs Pos…

84808ec

…tgREST#2236

ghost pushed a commit to eNote-GmbH/postgrest-docs that referenced this issue Apr 8, 2022

feat(api): add match/imatch operators for regular expression support …

2ede435

…- refs PostgREST/postgrest#2236

steve-chavez added the enhancement a feature, ready for implementation label Apr 8, 2022

wolfgangwalther pushed a commit that referenced this issue Apr 11, 2022

feat(query): add basic regular expression operator support - refs #2236

007f49a

wolfgangwalther pushed a commit to PostgREST/postgrest-docs that referenced this issue Apr 11, 2022

feat(api): add match/imatch operators for regular expression support …

6f79b41

…- refs PostgREST/postgrest#2236

steve-chavez mentioned this issue Apr 18, 2022

Ensure an index is applied for LIKE/ILIKE and regex #2249

Open

steve-chavez closed this as completed Apr 18, 2022

fjf2002 pushed a commit to fjf2002/postgrest that referenced this issue Apr 22, 2022

feat(query): add basic regular expression operator support - refs Pos…

a8f29b7

…tgREST#2236

laurenceisla mentioned this issue Aug 23, 2022

Add missing changes/fixes/features for the new version PostgREST/postgrest-docs#563

Merged

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for POSIX regular expression operators #2236

Support for POSIX regular expression operators #2236

ghost commented Apr 5, 2022

steve-chavez commented Apr 5, 2022

ghost commented Apr 5, 2022

steve-chavez commented Apr 5, 2022

wolfgangwalther commented Apr 5, 2022 •

edited

Loading

ghost commented Apr 5, 2022 •

edited by ghost

Loading

steve-chavez commented Apr 5, 2022

ghost commented Apr 5, 2022

steve-chavez commented Apr 6, 2022

steve-chavez commented Apr 6, 2022

ghost commented Apr 6, 2022

steve-chavez commented Apr 7, 2022

ghost commented Apr 11, 2022

steve-chavez commented Apr 18, 2022

Support for POSIX regular expression operators #2236

Support for POSIX regular expression operators #2236

Comments

ghost commented Apr 5, 2022

Environment

Description of issue

steve-chavez commented Apr 5, 2022

ghost commented Apr 5, 2022

steve-chavez commented Apr 5, 2022

wolfgangwalther commented Apr 5, 2022 • edited Loading

ghost commented Apr 5, 2022 • edited by ghost Loading

steve-chavez commented Apr 5, 2022

ghost commented Apr 5, 2022

steve-chavez commented Apr 6, 2022

steve-chavez commented Apr 6, 2022

ghost commented Apr 6, 2022

steve-chavez commented Apr 7, 2022

ghost commented Apr 11, 2022

steve-chavez commented Apr 18, 2022

wolfgangwalther commented Apr 5, 2022 •

edited

Loading

ghost commented Apr 5, 2022 •

edited by ghost

Loading