Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for POSIX regular expression operators #2236

Closed
ghost opened this issue Apr 5, 2022 · 13 comments
Closed

Support for POSIX regular expression operators #2236

ghost opened this issue Apr 5, 2022 · 13 comments
Labels
enhancement a feature, ready for implementation

Comments

@ghost
Copy link

ghost commented Apr 5, 2022

Environment

  • PostgreSQL version: 13+
  • PostgREST version: v9.0.0
  • Operating system: Debian

Description of issue

PostgREST should support comparison operators as defined in: https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP

It already does support LIKE/ILIKE, so an extension here would be very beneficial.

@steve-chavez
Copy link
Member

We shied away from exposing regex because of ReDoS in the past #539 (comment).

Another concern I have: can we ensure a regex op will be a fast operation?

According to this so question, an index won't help to speed up a regex operation.

@ghost
Copy link
Author

ghost commented Apr 5, 2022

@steve-chavez The ReDoS issue shouldn't exist anymore since those ancient PostgreSQL versions have long been EOL'd. If there are still people out there using them, I bet there are other syntactical issues when trying to run PostgREST on those.

For the indexing you are totally correct as there currently isn't any way to speed them up, but this is true for many other operators. Only the possibility to create an index doesn't mean that it actually exists and even if it does also doesn't imply that the query planner will actually make use of it.

So I see no guarantees anywhere, at least not something that PostgREST would be able to hold in practice. In the end, it's all in the operators/DBA's hands to do optimizations or not.

We could put a warning into the docs to emphasize the fact that there are no ways to speed up execution of regular expressions using indexes, so when not combined with other filters will always result in a full table scan.

@steve-chavez
Copy link
Member

The ReDoS issue shouldn't exist anymore since those ancient PostgreSQL versions have long been EOL'd. If there are still people out there using them, I bet there are other syntactical issues when trying to run PostgREST on those.

True, agree.

For the indexing you are totally correct as there currently isn't any way to speed them up, but this is true for many other operators.

Hm, what other operators do we expose that cannot be sped up with an index?

We could put a warning into the docs to emphasize the fact that there are no ways to speed up execution of regular expressions using indexes, so when not combined with other filters will always result in a full table scan.

Oh, but that would be a problem for every API(that might not need a regex op) since clients could probe this regex operator and deplete the db resources.

will always result in a full table scan

If using the regex will always result in an expensive plan, maybe we should apply the idea we've been discussing in #915 (comment).

Basically, for operations that are known to be expensive, we run an additional EXPLAIN statement, we get the statement cost and if it's greater than a configurable threshold we reject them. If this threshold is not defined then expensive operations are not possible - this would make the regex opt-in for APIs that don't want it as well.

@enote-kane WDYT?

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Apr 5, 2022

Oh, but that would be a problem for every API(that might not need a regex op) since clients could probe this regex operator and deplete the db resources.

Hm. How would you protect against that, even with the current operators?

As long as a big table is accessible anonymously, you will always have that problem. It's unlikely you'll create indexes for every column and all operators, right? Even more so, because it's a big table...

Edit: So basically, I think, the question of whether it's possible to create an index for an operator is orthogonal to the question of DoS attacks. The cost threshold is a way to protect against that. But the attack surface would not really be different, even if we introduced other operators.

@ghost
Copy link
Author

ghost commented Apr 5, 2022

@steve-chavez I generally like the idea with protecting expensive operations with an EXPLAIN and a threshold.

However, if PostgREST goes this route then it would have to do it for almost any filter, since no filter is "guaranteed" to perform well. Also, statistics may be out-of-date so the planner would chose the wrong plan anyway and that could result into situations where sometimes, requests get rejected (data modification bursts) while most of the time the requests are working just fine and are lightning fast.

I think that would be even more confusing for users.

I've recently had a situation where a query took 1.5 hours and after ANALYZE of 3 involved tables (joins) the exact same query went down to 2.6 seconds - classic nested loop vs. hash join issue. Fun fact: the execution plan for the nested-loop variant came with a lower cost than the hash join variant:

So I think the additional EXPLAIN check should be highly configurable including an option to simply turn it off.

@steve-chavez
Copy link
Member

Hm, but for the regex we'd have the guarantee to always to perform bad(full table scan). The current defenses against slow queries are not perfect but they're something.

Also, statistics may be out-of-date so the planner would chose the wrong plan anyway and that could result into situations where sometimes, requests get rejected (data modification bursts) while most of the time the requests are working just fine and are lightning fast.

All true regarding the EXPLAIN cost, I guess we need to look at other options.

For this particular case, should we restrict the regex input so at least we know that an index can speed it up?

I just saw this article about pg_trgm.

In general, pg_trgm can help when:
You want to speed up LIKE, ILIKE, ~ or ~*.
You want to search for patterns that aren't left-anchored (e.g. %john%). Such patterns aren't supported by B-tree indexes.

So perhaps we can forbid left-anchored inputs.

@enote-kane Would that still be flexible enough for your use case? Any other ideas wrt ensuring indexes speed up the operation?

@ghost
Copy link
Author

ghost commented Apr 5, 2022

@steve-chavez Just tested it and indeed, adding trgm indexes can convert from a sequential scan into an index scan for regular expression operations (also case-insensitive). Both types seem to work the same here (GIN/GIST).

So, we DO have a possible optimization for DBA's. And this is totally suitable for my use case.

@steve-chavez
Copy link
Member

So, we DO have a possible optimization for DBA's. And this is totally suitable for my use case.

Great news. So for now could you add the non-left-anchored input safeguard to your PR?

@steve-chavez
Copy link
Member

The pg_trgm docs also say:

The index search works by extracting trigrams from the regular expression and then looking these up in the index. The more trigrams that can be extracted from the regular expression, the more effective the index search is.

For both LIKE and regular-expression searches, keep in mind that a pattern with no extractable trigrams will degenerate to a full-index scan.

So we should also ensure there's at least one trigram present?

@ghost
Copy link
Author

ghost commented Apr 6, 2022

@steve-chavez For like/ilke there is no such check as far as I can see and using GIN/GIST indexes makes the left-anchored argument more or less irrelevant. Sadly, the DbStructure doesn't contain index information, so we can't make a guess here.

For checking for trigrams: how do you envision that to be implemented? I am really new to Haskell and the internal concepts of PostgREST. Curious to learn more here.

@steve-chavez
Copy link
Member

For checking for trigrams: how do you envision that to be implemented? I am really new to Haskell and the internal concepts of PostgREST

Oh, we don't have a dedicated module for checking a filter's input(we have never done it) so it would have to be brand new. Probably just in ApiRequest.hs for starters.

For like/ilke there is no such check as far as I can see and using GIN/GIST indexes makes the left-anchored argument more or less irrelevant.

Right, and I've noted that one can also force a seq scan with LIKE(ref). It also seems complex to make sure the input doesn't trigger a seq scan.

%%: uses the index
%&%: uses the index
%?%: uses the index
%foo & bar%: uses the index
%foo ? bar%: uses the index
%foo && bar%: uses the index
%foo ?? bar%: uses the index
%&&%: triggers a full table scan
%??%: triggers a full table scan

So whatever solution we land on(probably the cost plus some smartness), it should also be applied for LIKE. This can be done in a later enhancement - not a requirement for the regex op.

Will review your PR now 👍

ghost pushed a commit to eNote-GmbH/postgrest that referenced this issue Apr 8, 2022
ghost pushed a commit to eNote-GmbH/postgrest-docs that referenced this issue Apr 8, 2022
@steve-chavez steve-chavez added the enhancement a feature, ready for implementation label Apr 8, 2022
@ghost
Copy link
Author

ghost commented Apr 11, 2022

@steve-chavez Would you mind creating a new issue for the checks (which most likely needs more research)? Then we could close this one as the PR's are merged already.

@steve-chavez
Copy link
Member

@enote-kane Sure, sorry for the delay. Opened #2249.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement a feature, ready for implementation
Development

No branches or pull requests

2 participants