Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for PGroonga operators #2028

Open
steve-chavez opened this issue Nov 18, 2021 · 31 comments
Open

Support for PGroonga operators #2028

steve-chavez opened this issue Nov 18, 2021 · 31 comments
Labels
enhancement a feature, ready for implementation

Comments

@steve-chavez
Copy link
Member

PGroonga offers performance and usability advantages over pg native textsearch and pg_trgm.

Tutorial: https://pgroonga.github.io/tutorial/

Reference : https://pgroonga.github.io/reference/

Proposal

Since pgroonga operators work on regular data types like text and jsonb, we can conditionally enable them on a column if it's indexed. This is to prevent expensive searches on unindexed columns.

CREATE INDEX index_name ON table USING pgroonga (column);

Then to do SELECT * FROM table WHERE column &@~ 'PostgreSQL';:

GET /table?column=groonga_query.PostgreSQL

(&@~ is groonga_query for now).

Then we can document this in an integrations(or extensions) page like we did for timescale.

@steve-chavez steve-chavez added the idea Needs of discussion to become an enhancement, not ready for implementation label Nov 18, 2021
@wolfgangwalther
Copy link
Member

like we did for timescale.

Why is this only in the old v5.2 docs? Any specific reason for removing it from v6 onwards?

@wolfgangwalther
Copy link
Member

&@~ is groonga_query for now

I was wondering, whether we could somehow add metadata to custom operators in the database, to make PostgREST be able to identify and map them automatically. Kind of similar to what we're doing with CREATE FUNCTION ... SET ... in #1582. Of course, SET does not work on operators. I'm pretty sure this was discussed before and turned down for good reason, but what about using COMMENT to add metadata to database objects?

We could use a yaml front matter block to add metadata, while keeping the current openapi comments support backwards compatible. This would also prevent issues with inlining, when using SET on functions as mentioned in #1582 (comment).

COMMENT ON OPERATOR "&@~" $$
---
pgrst:
  filter:
    name: groonga_query
    requires_index: pgroonga
---
This operator will provide fulltext magic!
$$;


-- similar to the original idea in #1582
COMMENT ON FUNCTION ret_image(name text) $$
---
pgrst:
  accept:
    - image/png
---
This function will respond to Accept: image/png requests!
$$;


-- related to improved routing
COMMENT ON TABLE really_long_and_inaccessible_name $$
---
pgrst:
  route: short_name
---
This table can now be queried at /short_name instead.
$$;


-- moving more of the config closer to database objects
COMMENT ON SCHEMA my_extensions $$
---
pgrst:
  search_path: true
---
This replaces db-extra-search-path in the config.
$$;

@steve-chavez
Copy link
Member Author

I'm pretty sure this was discussed before and turned down for good reason, but what about using COMMENT to add metadata to database objects?

It was not thoroughly discussed IIRC. I thought it would not be a good idea because it "pollutes" the comments with postgrest-specific metadata(output of slash commands like \d+ would be weird).

Creating a DSL on top of comments, like shown on #1179 (comment), was also described as "icky" on a hacker news comment. I somewhat agree, as it feels "invasive" and prone to errors(also no syntax highlight).

Another problem is that if other tools also (ab)use the comments then we'd "clash" with their DSL.

We could use a yaml front matter block to add metadata

yaml certainly looks better. I guess we'd have to reach consensus with other tools though.

I still think the correct way to do this would be the way mentioned in #1710 (comment), whether is our in-db config or postgrest-contrib functions, that way our metadata would be namespaced.


Since pgroonga operators work on regular data types like text and jsonb, we can conditionally enable them on a column if it's indexed

Do you see a problem with this interface? I like the idea of being strict and requiring indexes for enabling certain filters.

I also think we'll need a more generic interface eventually, but for now we don't have too many operators for this to be a problem.


Why is this only in the old v5.2 docs? Any specific reason for removing it from v6 onwards?

It was an agreement between me and Timescale(former backer) sometime ago, where we'd show their logo + integration doc for the duration of their sponsorship on Patreon, which only lasted for a portion of v5.2 lifetime.

I think we can reincorporate the timescale doc though, it's good content and an extension section(with pgroonga and postgis later on) would look good.

@steve-chavez
Copy link
Member Author

&@~ is groonga_query for now

One thought, now that we have many client libraries, we can expose the original operator and leave the urlencoding to them, so we don't need to make an alias. This was discussed before on #938 (comment):

However I don't think we should have aliases for all operators, having to remember e.g. that to use -|- I have to tell posgret adj is pointless trivia that will only trip me up; and more importantly: make it something else you have to know to use a postgrest API. If I'm reading misc http client, I'm not goint to be able to infer what adj means. The less you have to know the better.

Or, we could also offer both, the original operator and the url-safe alias.

@wolfgangwalther
Copy link
Member

One thought, now that we have many client libraries, we can expose the original operator and leave the urlencoding to them, so we don't need to make an alias. This was discussed before on #938 (comment):

Hm. I don't buy the argument "it's hard to remember aliases" very much. I doubt people really know all those PG operators out of their head, because the regularly use them. For me, it's more a "I know there is something that does this... let's look up the operator in the docs.". Also the client-side code is probably read by a lot more people that don't know PostgreSQL well, and the aliased operators are much more meaningful to them.

If you really want to support the original operator (where I don't really see much value in), we absolutely have to keep supporting the url-safe aliases.

@wolfgangwalther
Copy link
Member

It was not thoroughly discussed IIRC. I thought it would not be a good idea because it "pollutes" the comments with postgrest-specific metadata(output of slash commands like \d+ would be weird).

I guess that depends on what you're expecting to see in that case. Assuming we are reading comments of objects in the public (or other postgrest specific) schema only, it could also be helpful. The objects in the api schema should only be used for postgrest anyway, so having the postgrest specific information readily in one place would be much better than having them stored away in some big json config setting.

Of course we can try to see what kind of format we can put the metadata comments in to have them show in a nice way in \d+ etc.

Creating a DSL on top of comments, like shown on #1179 (comment) [...] Another problem is that if other tools also (ab)use the comments then we'd "clash" with their DSL.

I would assume all other tools to be somewhat resistant against stuff that doesn't match their DSL - and the same for us. As long as we are prefixing this on our side, we should be fine.

Do you see a problem with this interface? I like the idea of being strict and requiring indexes for enabling certain filters.

I think it's a great idea to be strict about requiring indexes. I like it so much, that I added requires_index: ... to my example above, too ;)

I think we can reincorporate the timescale doc though, it's good content and an extension section(with pgroonga and postgis later on) would look good.

Fully agree.

@wolfgangwalther
Copy link
Member

I also think we'll need a more generic interface eventually, but for now we don't have too many operators for this to be a problem.

Looks like an extensible interface could already be useful, as new operators have now been asked for repeatedly:

@steve-chavez
Copy link
Member Author

@wolfgangwalther (Ab)using function names, how about this for exposing new operators

CREATE FUNCTION "pgrst.projects.name.groonga_query"(test.projects, test.projects.name%TYPE) RETURNS bool
AS $_$ 
  SELECT $1.name  &@~ $2 
$_$ LANGUAGE SQL IMMUTABLE;

This would allow calling

GET /projects?name=groonga_query.stuff

I've checked and the resulting queries are inlinable too

select * from test.projects x where "pgrst.projects.name.groonga_query"(x, 2);

Related to #2442 (comment)


Might be possible to make this more generic(for all columns) with some conventions on the function naming

@steve-chavez
Copy link
Member Author

I was mixing the concern about restricting the operator columns on #2442, there's a more plausible idea there now.

Adding new operators could be simply:

CREATE FUNCTION "pgrst.groonga_query"(text, text) RETURNS bool
AS $_$ 
  SELECT $1 &@~ $2 
$_$ LANGUAGE SQL IMMUTABLE;

Meaning any function name with our prefix pgrst would be exposed as a filter operator.

@wolfgangwalther
Copy link
Member

Ab)using function names, how about this for exposing new operators

In the end this is just a function wrapper around an operator to attach metadata to it via the name prefix. You could achieve the same by creating a custom operator with the same name prefix. Why the function wrapper?

But both of those approaches are very limited in terms of how much metadata we can attach. The example in #2028 (comment) already mentions something like requires_index, which wouldn't be possible this way.

I think we should sort out the "metadata" problem once and for good - and then we can build a lot of things on top of that.

@steve-chavez
Copy link
Member Author

You could achieve the same by creating a custom operator with the same name prefix

You can't create an operator with letters, no "eq" or "lt" https://www.postgresql.org/docs/current/sql-createoperator.html. The interface is much more complex too.

The example in #2028 (comment) already mentions something like requires_index, which wouldn't be possible this way.

Yeah, that I think we'll not need anymore. As discussed on #2249, it's not certain that an index will be applied. What we should do instead:

@wolfgangwalther
Copy link
Member

You could achieve the same by creating a custom operator with the same name prefix

You can't create an operator with letters, no "eq" or "lt" https://www.postgresql.org/docs/current/sql-createoperator.html. The interface is much more complex too.

The example in #2028 (comment) already mentions something like requires_index, which wouldn't be possible this way.

Yeah, that I think we'll not need anymore.

To put it the other way around: We do need to solve the "metadata for database objects" problems. Once we do that, adding the information on which operators to expose through this metadata is much nicer than through function names as proposed here.

So no matter whether we need requires_index right now or not, or whether the operator prefix would in fact need to be a schema name... - let's solve the metadata stuff first and then a lot of pieces will fall into place here and in other issues.

@steve-chavez
Copy link
Member Author

steve-chavez commented Nov 10, 2022

Now that the allowlist on #2442 (comment) seems to be the way to go to restrict operators on columns, I think this can be kept simpler.

We could just have a config that maps keywords to operators:

db-extra-operators = "groonga_query:&@~, similarity:%, geo_within:st_within"
ALTER ROLE authenticator SET pgrst.extra_operators = 'groonga_query:&@~, similarity:%, geo_within:st_within'

Later one we might have a COMMENT ON or SECURITY LABEL for metadata but for now this looks like the simplest way to add operators.

@steve-chavez
Copy link
Member Author

ALTER ROLE authenticator SET pgrst.extra_operators = 'groonga_query:&@~, similarity:%, geo_within:st_within'

A problem with the above proposal is that it cannot express the FTS operators, unless we support something like:

ALTER ROLE postgrest_test_authenticator 
SET pgrst.extra_operators = 'my_fts:@@ to_tsquery, my_plain_fts: @@ plainto_tsquery'

Which is not that clean or expressive IMO.

The above function interface is much more expressive. To make it cleaner, how about removing the naming convention:

CREATE FUNCTION groonga_query(text, text) RETURNS bool
AS $_$ 
  SELECT $1 &@~ $2 
$_$ LANGUAGE SQL IMMUTABLE;

So we expose a function as an operator if:

  • is inside db-schema
    • It would get exposed under /rpc as well, doesn't seem like a problem. Or we could try to hide it from rpc.
    • We cannot detect this type of functions on db-extra-search-path because the extensions/public schemas would be here and we would expose many functions as operators
  • is IMMUTABLE
    • This would take care that we don't expose a complicated business logic function as an operator. Those operate internally on existing tables, so they'd have to be STABLE or VOLATILE.
  • takes two parameters of the same type and returns a bool

@steve-chavez
Copy link
Member Author

The function interface is complex, I'm not convinced by it.

I'm in need of the pgroonga operators though.

Until we settle on an interface, how about just adding the db-extra-operators mentioned above. It's not complex at all to add that config and could be removed later if we have a better inteface. This reminds me of raw-media-types, which is not ideal but it has been really low maintenance since it required few lines of code.

@steve-chavez
Copy link
Member Author

Regarding aliasing to avoid urlencoding on operator syntax, maybe that's something that should be done in core. It doesn't feel right to let customers break their APIs by changing an alias on the config. I feel that design is brittle.

Could we be smart and transform symbols to letters? Like @=at, then @@ would be col=atat.txt.. that doesn't look good, so no.


Another option could be having a special raw operator:

col=raw(@@).content

That does look more consistent with our syntax. Plus aliases doesn't seem necessary now.

To allowlist the exposed operators:

raw-operators = "&&, @@"

I think this can also keep the QueryParams parsers a bit more simple. Disallowed operators can be rejected at the Plan level.

@steve-chavez
Copy link
Member Author

steve-chavez commented Jan 10, 2023

The syntax for the raw operators could be:

SEARCH /table?column=$op.@@.$body.val
{"..."}

Related to the dollar operators proposal #2125 (comment)

(Kinda looks like the OPERATOR(schema.@@) syntax in PostgreSQL.)


Could also be

SEARCH /table?column=$op(@@).$body.val
{"..."}

@steve-chavez
Copy link
Member Author

I've realized this is somewhat the same problem as #2578. We also need a way to call filters with multiple inputs.

@wolfgangwalther
Copy link
Member

Could also be

SEARCH /table?column=$op(@@).$body.val
{"..."}

I like this the most of the suggestions above.

@steve-chavez
Copy link
Member Author

steve-chavez commented May 3, 2023

How about this for custom operators.

Since IMMUTABLE is really a pure function, we could expose these as operators.

These can be enabled per table/view and column like this:

-- name is a type for a pg identifier
CREATE FUNCTION groonga_query(test.projects, text, text, name default 'id,name,client_id') RETURNS bool
AS $_$ 
  SELECT $2 &@~ $3 
$_$ LANGUAGE SQL IMMUTABLE;

They can be enabled per table/view and type like this:

CREATE FUNCTION groonga_query(test.projects, text, text) RETURNS bool
AS $_$ 
  SELECT $2  &@~ $3 
$_$ LANGUAGE SQL IMMUTABLE;

They can be enabled for all table/view and type like this:

CREATE FUNCTION groonga_query(anyelement, text, text) RETURNS bool
AS $_$ 
  SELECT $2 &@~ $3 
$_$ LANGUAGE SQL IMMUTABLE;

Inlining advantages apply as mentioned above.


We would need to scan the filters and search them in the schema cache for this, which would incur in perf loss.

Or we could somehow mark them as custom, like:

GET /projects?name=$groonga_query.stuff

@steve-chavez
Copy link
Member Author

Come to think of it, the above might solve #2442 as well.

We could create:

CREATE FUNCTION eq(anyelement, anyelement) RETURNS bool
AS $_$ 
  SELECT $1 = $2 
$_$ LANGUAGE SQL IMMUTABLE;

Which would represent our default eq filter.

To disallow it globally:

REVOKE EXECUTE ON eq FROM anon;
REVOKE EXECUTE ON eq FROM webuser;

Then the user can enable it per column or per table as mentioned above.


It might leave a big footprint in the db but I think it could work.

@steve-chavez
Copy link
Member Author

Using unnamed parameters for custom operator functions is key to differentiate them from #2578

@steve-chavez
Copy link
Member Author

These can be enabled per table/view and column like this:

-- name is a type for a pg identifier
CREATE FUNCTION groonga_query(test.projects, text, text, name default 'id,name,client_id') RETURNS bool
AS $$
SELECT $2 &@~ $3
$
$ LANGUAGE SQL IMMUTABLE;

An improvement on the above would be:

CREATE FUNCTION groonga_query(test.projects, text, text, id name) RETURNS bool
AS $_$ 
  SELECT $2 &@~ $3 
$_$ LANGUAGE SQL IMMUTABLE;

Where the named parameter at the end indicates the column for the operator.

@steve-chavez
Copy link
Member Author

So far looks like #2805 (comment) is the simplest implementation for custom operators.

@wolfgangwalther I was wondering if you have any concerns about that?

I guess one disadvantage is that we lose type checking when adding new operators through a custom config. It will also be harder to infer correct typescript types later.

Hm, perhaps we leave the configurable query grammar just for restricting operators and not adding them?


One alternative now that we have the pre-config and hence the postgrestschema(docs). Is that we could just add immutable functions to the postgrest schema:

CREATE FUNCTION postgrest.eq(anyelement, anyelement) RETURNS bool
AS $_$ 
  SELECT $1 = $2 
$_$ LANGUAGE SQL IMMUTABLE;

CREATE FUNCTION postgrest.fts(anyelement, anyelement) RETURNS bool
AS $_$ 
  SELECT $1 @@ to_tsquery($2) 
$_$ LANGUAGE SQL IMMUTABLE;

-- later on my thinking is that we could support
CREATE FUNCTION postgrest.fts(text, anyelement) RETURNS bool
AS $_$ 
  SELECT $1 @@ to_tsquery($2) 
$_$ LANGUAGE SQL IMMUTABLE;
-- that's basically saying that clients can only use the `fts` operators on text columns

And then we enable them as operators for the query grammar. That doesn't sound bad actually.

@wolfgangwalther
Copy link
Member

So far looks like #2805 (comment) is the simplest implementation for custom operators.

@wolfgangwalther I was wondering if you have any concerns about that?

I don't like it. We should not do stuff that should be done via database objects, via config.

PostgREST's goal is to "expose a, potentially already existing, database schema". So the first thing to do should be do interpret the existing schema and expose it in a way that makes sense.

You showed multiple examples of how we could use IMMUTABLE functions to map operators. Maybe the right question to ask is this: Given the following function in the exposed schema - is there any way to interpret it's "meaning" other than as an operator?

create function eq(text, text) returns bool
immutable
return $1 = $2;

We can't use it as an RPC, because we don't support multiple unnamed arguments for those. It's returning a bool, which is a strong indication, that it's supposed to be used in a WHERE clause.

Our query/request syntax has the following "identifiers" it can use:

  • table/view/function names in path
  • column/function names in select=
  • column/function names on the left side of a filter
  • table/view/function/column/foreign key names in embedding
  • in the future: domain names for mimetypes

None of those are applicable to this here. Short of some constants and keywords for syntax, we only have operators left in our query syntax. So it's pretty natural to use this.

One alternative now that we have the pre-config and hence the postgrestschema(docs).

I don't see how "the postgrest schema" is a thing there. I could put the pre-config (same as pre-request) function in any schema - but there is not a separate config setting for the schema those functions are in. So we certainly can't use this schema for anything else.

Also: We want to expose those operators.. so they should be in the exposed schema, right?

Hm, perhaps we leave the configurable query grammar just for restricting operators and not adding them?

I like the suggestion you made above more:

  • Treat the existing operators as built-in operators / shortcuts for functions like the one above.
  • Allow overriding those built-ins operator functions.
  • When there is a matching override - use it.
  • Restrictions can then be made via permissions - the most natural way to do so.

An improvement on the above would be:

CREATE FUNCTION groonga_query(test.projects, text, text, id name) RETURNS bool
AS $_$ 
  SELECT $2 &@~ $3 
$_$ LANGUAGE SQL IMMUTABLE;

Where the named parameter at the end indicates the column for the operator.

I don't like that either. That's a lot of magic.

I didn't have the use-case to disable some operators (or other things) for specific routes (or enable them only for one etc.). If we do anything remotely like that, we should make it some kind of general solution that would work with much more than just this specific case.

Operators should not depend on context. They should only depend on their two arguments.

@steve-chavez
Copy link
Member Author

Agree overall. If anything adding new operators via functions looks like the way forward.

However this part:

Restrictions can then be made via permissions - the most natural way to do so.

Still doesn't solve enabling operators per column, say only enabling eq for PKs. With #2805 that's doable.

But we can leave that for later. I'm not sure if #2805 is the best way.

@steve-chavez steve-chavez added the enhancement a feature, ready for implementation label Jun 13, 2023
@steve-chavez
Copy link
Member Author

Seems implementing this is not trivial. Right now we validate operators on QueyParams.hs, which doesn't involve the schema cache. We need to move operator parsing to the Plan.hs level.

Implementation is kinda similar to #2825 in that we have builtins and we can override/enhance them.

When implementing this we might also consider supporting using the operator on the left side of the query param #2066

@steve-chavez
Copy link
Member Author

You showed multiple examples of how we could use IMMUTABLE functions to map operators. Maybe the right question to ask is this: Given the following function in the exposed schema - is there any way to interpret it's "meaning" other than as an operator?

create function eq(text, text) returns bool
immutable
return $1 = $2;

We can't use it as an RPC, because we don't support multiple unnamed arguments for those. It's returning a bool, which is a strong indication, that it's supposed to be used in a WHERE clause.

Looks like the above holds up pretty good. I've been testing:

WITH
x AS (
SELECT n.nspname as "Schema",
  p.proname as "Name",
  pg_catalog.pg_get_function_result(p.oid) as "Result data type",
  pg_catalog.pg_get_function_arguments(p.oid) as "Argument data types",
 CASE p.prokind
  WHEN 'a' THEN 'agg'
  WHEN 'w' THEN 'window'
  WHEN 'p' THEN 'proc'
  ELSE 'func'
 END as "Type",
 CASE
  WHEN p.provolatile = 'i' THEN 'immutable'
  WHEN p.provolatile = 's' THEN 'stable'
  WHEN p.provolatile = 'v' THEN 'volatile'
 END as "Volatility",
 CASE
  WHEN p.proparallel = 'r' THEN 'restricted'
  WHEN p.proparallel = 's' THEN 'safe'
  WHEN p.proparallel = 'u' THEN 'unsafe'
 END as "Parallel",
 pg_catalog.pg_get_userbyid(p.proowner) as "Owner",
 CASE WHEN prosecdef THEN 'definer' ELSE 'invoker' END AS "Security",
 pg_catalog.array_to_string(p.proacl, E'\n') AS "Access privileges",
 l.lanname as "Language",
 COALESCE(pg_catalog.pg_get_function_sqlbody(p.oid), p.prosrc) as "Source code",
 pg_catalog.obj_description(p.oid, 'pg_proc') as "Description"
FROM pg_catalog.pg_proc p
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
     LEFT JOIN pg_catalog.pg_language l ON l.oid = p.prolang
WHERE pg_catalog.pg_function_is_visible(p.oid)
ORDER BY 1, 2, 4)
SELECT * FROM x
WHERE "Description" ILIKE '%operator%' AND "Result data type" = 'boolean' AND "Volatility" = 'immutable';

-[ RECORD 1 ]-------+---------------------------------------------------------
Schema              | pg_catalog
Name                | aclitemeq
Result data type    | boolean
Argument data types | aclitem, aclitem
Type                | func
Volatility          | immutable
Parallel            | safe
Owner               | postgres
Security            | invoker
Access privileges   | 
Language            | internal
Source code         | aclitem_eq
Description         | implementation of = operator
-[ RECORD 2 ]-------+---------------------------------------------------------
Schema              | pg_catalog
Name                | array_eq
Result data type    | boolean
Argument data types | anyarray, anyarray
Type                | func
Volatility          | immutable
Parallel            | safe
Owner               | postgres
Security            | invoker
Access privileges   | 
Language            | internal
Source code         | array_eq
Description         | implementation of = operator
-[ RECORD 3 ]-------+---------------------------------------------------------
Schema              | pg_catalog
Name                | array_ge
Result data type    | boolean
Argument data types | anyarray, anyarray
Type                | func
Volatility          | immutable
Parallel            | safe
Owner               | postgres
Security            | invoker
Access privileges   | 
Language            | internal
Source code         | array_ge
Description         | implementation of >= operator
...

And every operator we expose is inside that list. Some like below, above, might be useful to expose too.

The main problem is the naming, most functions contain a good name but it's usually concatenated with the types.

If we could get a good name from the internal functions then we wouldn't need to hardcode the operators in the code nor we would need to instruct the user to create new functions.

@steve-chavez
Copy link
Member Author

Considering the that it's not easy to add support for custom operators (ref). I'm tempted to just hardcode the pgroonga operators for now.

They recently released a how-to for postgREST that uses dynamic SQL which doesn't look good at all: https://pgroonga.github.io/ja/how-to/postgrest.html

pgroonga is on nixpkgs too, which should make things simpler.

@steve-chavez

This comment was marked as outdated.

@steve-chavez

This comment was marked as outdated.

@steve-chavez steve-chavez removed the idea Needs of discussion to become an enhancement, not ready for implementation label Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement a feature, ready for implementation
Development

Successfully merging a pull request may close this issue.

2 participants