From 15abb8af61e23482eb2518351fcd6af071cc0edb Mon Sep 17 00:00:00 2001 From: st1page <1245835950@qq.com> Date: Tue, 1 Aug 2023 20:19:00 +0800 Subject: [PATCH 1/6] add Background --- rfcs/xxxx-lambda-expression | 155 ++++++++++++++++++++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 rfcs/xxxx-lambda-expression diff --git a/rfcs/xxxx-lambda-expression b/rfcs/xxxx-lambda-expression new file mode 100644 index 00000000..4be3d8d1 --- /dev/null +++ b/rfcs/xxxx-lambda-expression @@ -0,0 +1,155 @@ +--- +feature: my_excited_feature +authors: + - "st1page" +start_date: "2023/08/01" +--- + +# Lambda expression + +## Background + +To better support user's semi-structure data, We have supported `Array` and `Struct` types compatible with PG. But in some user's requirement, We found we can not support them easily. + +### Induction + +```SQL +CREATE TABLE t( + arr_i int[], + arr_json jsonb[], + k int primary key +); +``` + +Consider we have the table definition and want to get the sum of the `arr_i` and sum of `arr_json`'s field `x`. In PG user can write the correlated subquery like this. + +```SQL +dev=# explain (verbose) + select *, + (select sum(i) from unnest(arr_i) i) as sum_i, + (select sum((i->'x')::int) from unnest(arr_json) i) as sum_x + from t; + QUERY PLAN +------------------------------------------------------------------------------------------ + Seq Scan on public.t (cost=0.00..294.75 rows=850 width=84) + Output: t.arr_i, t.arr_json, t.k, (SubPlan 1), (SubPlan 2) + SubPlan 1 + -> Aggregate (cost=0.13..0.14 rows=1 width=8) + Output: sum(i.i) + -> Function Scan on pg_catalog.unnest i (cost=0.00..0.10 rows=10 width=4) + Output: i.i + Function Call: unnest(t.arr_i) + SubPlan 2 + -> Aggregate (cost=0.18..0.19 rows=1 width=8) + Output: sum(((i_1.i -> 'x'::text))::integer) + -> Function Scan on pg_catalog.unnest i_1 (cost=0.00..0.10 rows=10 width=32) + Output: i_1.i + Function Call: unnest(t.arr_json) +``` + +The correlated column is in the table function `Unnest` and it is hard to do the sub-query decorrelation. Unfortunately, we do not implement `Apply` operator in streaming execution. +Their are some equivalent query that I think could be decorrelated plan. But they all are complex and with high cost. + +```SQL + select * from t t1 + join ( + select k, sum(i) + from (select k, unnest(arr_i) i from t) u group by k + ) t2 on t1.k = t2.k + join ( + select k, sum((i->'x')::int) + from (select k, unnest(arr_json) i from t) u group by k + ) t3 on t1.k = t3.k; + /* Or */ + select distinct on(k) + k, arr_i, arr_json, + sum(i) over (partition by k), + sum((j->'x')::int) over (partition by k) + from ( select *, unnest(arr_i) i, unnest(arr_json) j from t) foo; +``` + +In investigation of other SQL systems, there are 3 levels' enhancement on this kind of query, this RFC will be concerned with the three levels. + +### level 1: array aggregate expression + +It is a kind of SQL **scalar expression** which accept a array type and return the array's aggregation result, such as `array_sum`, `array_max`. It can solve some part of the issue but can not handle the complex nested datatype in the array. + +```SQL +select *, array_sum(arr_i) from t; +``` + +### level 2: scalar lambda expression + +Implement lambda expression function to make user can define there `transform` logic for array's each elements. +It can be used with the array aggregate expression together. + +```SQL +select *, + array_sum(arr_i), + array_sum(transform(arr_json, v -> (v->'x')::int)) +from t; +``` + +### level 3: aggregate lambda expression + +Based on the level 2, these system support the `reduce` to make user use lambda function implement the aggregation logic. + +```SQL +select *, + reduce(arr_i, 0, (s, v) -> s + v), + reduce(arr_json, 0, (s, v) -> s + (v->'x')::int)), +from t; +``` + +Btw, It often can be used to express UDAF too. + +## Investigation + +Simple array expressions + +| array_sum | array_max/min | array_sort | array_zip | array_zip | +| ---- | ---- | +| | | +| | | + +array expressions with lambda expression paramter + +### PrestoDB + +PrestoDB supports many array aggregate expressions . Also it implement both scalar and aggregation lambda expression on array + +- + +And Trino support them too. + +### Starrocks + +### Doris + +### Databricks + +### ClickHouse + + + +## Motivation + +Why are you want to introduce the feature? + +## Design + +Explain the feature in detail. + +## Unresolved questions + +- Are there some questions that haven't been resolved in the RFC? +- Can they be resolved in some future RFCs? +- Move some meaningful comments to here. + +## Alternatives + +What other designs have been considered and what is the rationale for not choosing them? + +## Future possibilities + +Some potential extensions or optimizations can be done in the future based on the RFC. From ebd4ed532136d9901ad0adfb06927b7280062359 Mon Sep 17 00:00:00 2001 From: st1page <1245835950@qq.com> Date: Wed, 2 Aug 2023 03:57:21 +0800 Subject: [PATCH 2/6] investigation --- rfcs/xxxx-lambda-expression | 103 +++++++++++++++++++++++++----------- 1 file changed, 71 insertions(+), 32 deletions(-) diff --git a/rfcs/xxxx-lambda-expression b/rfcs/xxxx-lambda-expression index 4be3d8d1..f982583e 100644 --- a/rfcs/xxxx-lambda-expression +++ b/rfcs/xxxx-lambda-expression @@ -105,40 +105,79 @@ Btw, It often can be used to express UDAF too. ## Investigation -Simple array expressions - -| array_sum | array_max/min | array_sort | array_zip | array_zip | -| ---- | ---- | -| | | -| | | - -array expressions with lambda expression paramter - -### PrestoDB - -PrestoDB supports many array aggregate expressions . Also it implement both scalar and aggregation lambda expression on array - -- - -And Trino support them too. - -### Starrocks - -### Doris - -### Databricks - -### ClickHouse - - - -## Motivation - -Why are you want to introduce the feature? + Array expressions support. (All of them supports the `array_size/cardinality`, `unnest/flatten/explode` which is not listed in the table.) + +| | array_cnt | array_sum | array_cum_sum | array_avg | array_max/min | array_sort | array_distinct | array_zip | Unnest
with idx | +| ---------- | :-------: | :-------: | :-----------: | :-------: | :--------------: | :--------: | :------------: | :-------: | :----------------: | +| Databricks | | | | | Y | Y | Y | Y | Y | +| Trino | | | | | Y | Y | Y | Y | Y | +| PrestoDB | Y | Y | Y | Y | Y | Y | Y | Y | Y | +| Starrocks | Y | Y | Y | Y | Y | Y | Y | Y | | +| Doris | Y | Y | Y | Y | Y | Y | Y | Y | | +| ClickHouse | Y | Y | Y | Y | Y | Y | Y | Y | | +| DuckDB | Y | Y | Y | Y | Y | Y | Y | | | +| Vertica | Y | Y | | Y | Y | | | | Y | +| FlinkSQL | | | | Y | only MAX, no MIN | | | | | +| Snowflake | | | | | | | Y | | Y | + + + +Array expressions with lambda expression paramter + +| | transform | transform
with idx | filter | filter
with idx | sort cmp | reduce | UDAF | Lambda Limitations | +| ---------- | --------- | --------------------- | ------ | ------------------ | -------- | ------ | :---: | ---------------------------------------------------------------------------------------------------------------------- | +| Databricks | Y | Y | Y | Y | Y | Y | Y | No subquery, No SQL UDF | +| Trino | Y | | Y | | Y | Y | Y | No subquery, No Aggregations | +| PrestoDB | Y | | Y | | Y | Y | Y | No subquery, No Aggregations | +| Starrocks | Y | | Y | | Y | | | | +| Doris | Y | | Y | | Y | | | | +| ClickHouse | Y | Y | Y | Y | Y | | | | +| DuckDB | Y | | Y | | | | | | +| Vertica | Y | Y | Y | Y | Y | | | arg name [limitations](https://docs.vertica.com/23.3.x/en/sql-reference/language-elements/lambda-functions/#arguments) | +| FlinkSQL | | | | | | | | | +| Snowflake | | | | | | | | | + + - No capture in lambda function, I think No system support refer column or sub query in the lambda function + - Snowflake do not anything about it and fall back to the `unnest/flatten/explode` https://stackoverflow.com/questions/62807878/how-to-get-maximum-value-from-an-array-column-in-snowflake + - [to confirm] Accroding to the doc, FlinkSQL only support flatten the array in join clause https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#array-expansion + - There are some Syntactic sugar. + - Trino/PrestoDB/Starrocks: support pass multiple arrays in `unnest/flatten/explode` so that the arrays will do zip first. + - ClickHousr/Starrocks: `array_sum(lambda_function, arr1,arr2...) = array_sum(array_map(lambda_function, arr1,arr2...))` + - ClickHouse support any aggregation function on array directly, e.g. `SELECT arrayReduce('max', [1, 2, 3]);` + +### Refers + - Databricks + - https://docs.databricks.com/sql/language-manual/sql-ref-lambda-functions.html + - https://docs.databricks.com/sql/language-manual/sql-ref-functions-builtin-alpha.html + - Trino + - https://trino.io/docs/current/functions/lambda.html + - https://trino.io/docs/current/functions/array.html?highlight=array#reduce + - https://trino.io/docs/current/functions/aggregate.html?highlight=lambda+expressions#lambda-aggregate-functions + - https://trino.io/docs/current/functions/array.html?highlight=array#array_max + - PrestoDB + - https://prestodb.io/blog/2020/03/02/presto-lambda + - https://prestodb.io/docs/current/functions/lambda.html + - https://prestodb.io/docs/current/functions/aggregate.html#reduce_agg + - Starrocks + - https://docs.starrocks.io/en-us/3.0/sql-reference/sql-functions/Lambda_expression + - Doris + - https://doris.apache.org/docs/dev/sql-manual/sql-functions/array-functions/array_max + - https://doris.apache.org/search?q=lambda + - ClickHouse + - https://clickhouse.com/docs/en/sql-reference/functions/array-functions + - https://clickhouse.com/docs/en/sql-reference/functions#higher-order-functions---operator-and-lambdaparams-expr-function + - DuckDB + - https://duckdb.org/docs/sql/functions/nested#list-aggregates + - https://duckdb.org/docs/sql/functions/nested#lambda-functions + - Vertica + - https://docs.vertica.com/23.3.x/en/sql-reference/language-elements/lambda-functions/ + - FlinkSQL + - https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/systemfunctions/ ## Design - -Explain the feature in detail. +[TODO] +We can support missing array function's at first and they has been implemented for aggregators so we can reuse the code. +For the lambda expression, it is like a anonymous SQL UDF. We can implement the SQL UDF first and design lambda expression later. ## Unresolved questions From bb05310ba9587178814a3178a37839c4c4b1c49e Mon Sep 17 00:00:00 2001 From: st1page <1245835950@qq.com> Date: Wed, 2 Aug 2023 03:57:49 +0800 Subject: [PATCH 3/6] fin --- rfcs/xxxx-lambda-expression | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/rfcs/xxxx-lambda-expression b/rfcs/xxxx-lambda-expression index f982583e..d934a02a 100644 --- a/rfcs/xxxx-lambda-expression +++ b/rfcs/xxxx-lambda-expression @@ -180,15 +180,16 @@ We can support missing array function's at first and they has been implemented f For the lambda expression, it is like a anonymous SQL UDF. We can implement the SQL UDF first and design lambda expression later. ## Unresolved questions +[TODO] - Are there some questions that haven't been resolved in the RFC? - Can they be resolved in some future RFCs? - Move some meaningful comments to here. ## Alternatives - +[TODO] What other designs have been considered and what is the rationale for not choosing them? ## Future possibilities - +[TODO] Some potential extensions or optimizations can be done in the future based on the RFC. From db68201ab6e0556f91acf800b6dd6c61c074057b Mon Sep 17 00:00:00 2001 From: st1page <1245835950@qq.com> Date: Wed, 2 Aug 2023 12:00:03 +0800 Subject: [PATCH 4/6] rename --- rfcs/{xxxx-lambda-expression => 0069-lambda-expression.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename rfcs/{xxxx-lambda-expression => 0069-lambda-expression.md} (100%) diff --git a/rfcs/xxxx-lambda-expression b/rfcs/0069-lambda-expression.md similarity index 100% rename from rfcs/xxxx-lambda-expression rename to rfcs/0069-lambda-expression.md From 0084b771e3e41e77b8ecd6ca0fe7c1a91e43190f Mon Sep 17 00:00:00 2001 From: stonepage <40830455+st1page@users.noreply.github.com> Date: Tue, 8 Aug 2023 14:50:40 +0800 Subject: [PATCH 5/6] Update rfcs/0069-lambda-expression.md Co-authored-by: TennyZhuang --- rfcs/0069-lambda-expression.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0069-lambda-expression.md b/rfcs/0069-lambda-expression.md index d934a02a..42eb2acc 100644 --- a/rfcs/0069-lambda-expression.md +++ b/rfcs/0069-lambda-expression.md @@ -1,5 +1,5 @@ --- -feature: my_excited_feature +feature: lambda_expression authors: - "st1page" start_date: "2023/08/01" From 0c39fa02a3b591011da5a65b6d3ec1844312569f Mon Sep 17 00:00:00 2001 From: Eric Fu Date: Fri, 11 Aug 2023 15:51:18 +0800 Subject: [PATCH 6/6] fix typo Co-authored-by: CAJan93 --- rfcs/0069-lambda-expression.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0069-lambda-expression.md b/rfcs/0069-lambda-expression.md index 42eb2acc..aa25a9a1 100644 --- a/rfcs/0069-lambda-expression.md +++ b/rfcs/0069-lambda-expression.md @@ -11,7 +11,7 @@ start_date: "2023/08/01" To better support user's semi-structure data, We have supported `Array` and `Struct` types compatible with PG. But in some user's requirement, We found we can not support them easily. -### Induction +### Introduction ```SQL CREATE TABLE t(