Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add i32 return type for length functions #606

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

richtia
Copy link
Member

@richtia richtia commented Feb 23, 2024

This PR adds an i32 return type for a few length functions based on existing engine support.

postgres, datafusion, and acero all have i32 return types for these functions. duckdb has i64

Copy link
Member

@EpsilonPrime EpsilonPrime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does having the same set of input arguments but different output arguments cause an issue? The name of a function is usually defined by the input arguments so that would be one potential source of conflict.

From a Substrait point of view, defining the type to always be i64 is fairly interoperable as anything implementing the interface could just upcast an i32 if they had it to i64.

@westonpace
Copy link
Member

Does having the same set of input arguments but different output arguments cause an issue?

David is right. We cannot overload functions based only on return type:

Every compound function signature must be unique. If two function implementations in a YAML file would generate the same compound function signature, then the YAML file is invalid and behavior is undefined.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid you will need to create a new function. Perhaps char_length_32? I'm not sure we've had to tackle this yet so we don't have a convention.

@richtia
Copy link
Member Author

richtia commented Feb 27, 2024

I'm afraid you will need to create a new function. Perhaps char_length_32? I'm not sure we've had to tackle this yet so we don't have a convention.

That works for me...but i also guess to @EpsilonPrime's point, I'm wondering if it's too much of a problem to just have the producer/consumer handle any type of casting to i64 if needed in the situation we have today.

@westonpace
Copy link
Member

That works for me...but i also guess to @EpsilonPrime's point, I'm wondering if it's too much of a problem to just have the producer/consumer handle any type of casting to i64 if needed in the situation we have today.

I'm good with only i64. It's more work on the consumer either way. Either they have to support two functions (one of them probably with a cast) or they need to recognize and add an appropriate cast so they can consume a non-matching function.

On the producer end of things I think, in most situations, the user doesn't care about this detail. If they do then they can insert a cast. A savvy consumer will hopefully be able to optimize the resulting internal plan of consumer_char_length(...) -> cast(i64) -> cast(i32) into my_char_length(...).

@westonpace
Copy link
Member

If we got the only i64 route it might be nice to document this somewhere. It would be nice if there were a README.MD for https://github.com/substrait-io/substrait/tree/main/extensions the defined:

  • the scope of the folder (e.g. these are a fairly minimal set of functions shared by many consumers today)
  • any conventions (e.g. prefer i64 because the consumer can easily insert a cast and we don't want to double the # of methods)
  • Etc.

-
name: char_length_32
description: >-
Return the number of characters in the input string. The length includes trailing spaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is characters mean code points?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @richtia for clarity on what character means. In some system this means bytes. In some system it means utf8 codepoints.

@Blizzara
Copy link
Contributor

I was thinking about these return types recently, given that Spark and DataFusion disagree on the return types in many cases. I wonder if there is any reason/benefit overall to having the return type specified on the extension? Maybe it could be removed?

Given that:

  1. we cannot (at least currently) have functions with same name and same args but different return types, as the return type is not part of the compound key. Having separately named functions for separate return types feels just wrong.

  2. a substrait producer usually goes from internal plan -> Substrait, ie. one direction only. It is not able to modify the plan to fit into the Substrait spec, so if its internal return type for a function doesn't match what Substrait expects, there is nothing it can do (I think?).

  3. Function calls in proto definitions already contain output types, like here. It would seem reasonable to me that the producer picks the function impl with the right args, but then writes the output type it has - and consumer would be responsible for adhering to that output type by casting if its implementation differs.

I don't know about the other consumers, but DataFusion doesn't use the extension's return type (well, doesn't use the extensions at all 😅 ) and the Substrait java lib has the validation as no-op as it's too hard to do properly.

@westonpace
Copy link
Member

we cannot (at least currently) have functions with same name and same args but different return types, as the return type is not part of the compound key. Having separately named functions for separate return types feels just wrong.

Isn't this how most languages work? Even those that allow overloading do not allow overloading based purely on return type.

a substrait producer usually goes from internal plan -> Substrait, ie. one direction only. It is not able to modify the plan to fit into the Substrait spec, so if its internal return type for a function doesn't match what Substrait expects, there is nothing it can do (I think?).

This is true even if Substrait doesn't exist. If a producer and a consumer do not agree on how a function should behave then there is a problem.

Function calls in proto definitions already contain output types, like here. It would seem reasonable to me that the producer picks the function impl with the right args, but then writes the output type it has - and consumer would be responsible for adhering to that output type by casting if its implementation differs.

Yes, a consumer is free to act this way today. The return type in the YAML does not prevent this behavior. I assume most consumers will never consult the YAML. They will look at the fully qualified function name, the input types, and the output type, and then figure out if they have a kernel (usually a lookup in some kind of registry) that can satisfy the request.

I wonder if there is any reason/benefit overall to having the return type specified on the extension? Maybe it could be removed?

I think angst around return type "width" (e.g. i32 vs i64 vs u32 vs u64) is a symptom of a slightly different problem. In a perfect world there would only be one integer type (integer) and the width of that integer would be an implementation detail (I think bkietz sold me on this idea long ago). That's part of the reason I think i64 always is a perfectly fine approach.

I think there is still value in having the return type in the YAML. I think the main value is that the return type tells me the returned value is an integer and not a float / string / timestamp / etc.

  • Producers / consumers creating new functions can know what is expected.
  • I agree automatic casting between different integer widths is trivial. However, automatically introducing casts between int / float / string is probably not going to give the right answer / what the user expects.
  • Independent plan transformations tools that don't have the producer's knowledge or the consumer's list of functions will need this information.
  • Validation tools need this information.

@Blizzara
Copy link
Contributor

we cannot (at least currently) have functions with same name and same args but different return types, as the return type is not part of the compound key. Having separately named functions for separate return types feels just wrong.

Isn't this how most languages work? Even those that allow overloading do not allow overloading based purely on return type.

I think there's a difference in how we view Substrait's function (and maybe my view is incorrect 😅 ). Am I correct in understanding you see Substrait proto (and simple extension format) as a language for defining one standard of how a relational query engine should work, and the YAML files as the definition for that query engine written in that language? (Or at least something like this.)

Difference being I've seen Substrait rather as a language for defining how a query engine (or a plan, really) does work, with the YAML files defining the different options.

Put in other words, the two options would be:

  1. Substrait (YAML) files define a behavior for query plan. Engines should behave like Substrait defines. vs
  2. Engines have existing behavior. Substrait should be able to represent that behavior.

I think both approaches would be valid goals. I would expect it to take a long time until engines like Spark would move towards compatibility with Substrait, so I see (2) as more useful approach, however, the good thing is with the simple- and advanced extensions (2) is quite doable for users (e.g. me) even if the projet's main goal is (1), so that's cool.

a substrait producer usually goes from internal plan -> Substrait, ie. one direction only. It is not able to modify the plan to fit into the Substrait spec, so if its internal return type for a function doesn't match what Substrait expects, there is nothing it can do (I think?).

This is true even if Substrait doesn't exist. If a producer and a consumer do not agree on how a function should behave then there is a problem.

That's a different question. Even if the producer and consumer agrees on the output type, if the Substrait YAML has different output type, then there's nothing really that the producer can do to produce valid plans according to that YAML (note: the producer can not use the Substrait YAML and have its own, and that solves this problem).

I wonder if there is any reason/benefit overall to having the return type specified on the extension? Maybe it could be removed?

I think angst around return type "width" (e.g. i32 vs i64 vs u32 vs u64) is a symptom of a slightly different problem. In a perfect world there would only be one integer type (integer) and the width of that integer would be an implementation detail (I think bkietz sold me on this idea long ago). That's part of the reason I think i64 always is a perfectly fine approach.

I think there is still value in having the return type in the YAML. I think the main value is that the return type tells me the returned value is an integer and not a float / string / timestamp / etc.

There are differences apart from int width, for example Spark's floor/ceil return ints while Substrait specifies floats. But yea, would you agree the right answer here is to create new YAMLs for Spark, rather than even trying to fit it into the YAMLs in the Substrait repo?

@westonpace
Copy link
Member

Am I correct in understanding you see Substrait proto (and simple extension format) as a language for defining one standard of how a relational query engine should work, and the YAML files as the definition for that query engine written in that language? (Or at least something like this.)

I'm not entirely sure I follow. I view the proto files / substrait as defining operator behavior (in the world of relational algebra) and the YAML as defining function behavior (I don't know what this world is called 😆).

But yea, would you agree the right answer here is to create new YAMLs for Spark, rather than even trying to fit it into the YAMLs in the Substrait repo?

Yes, I think creating Spark-specific YAMLs would be a perfectly fine approach.

There are differences apart from int width, for example Spark's floor/ceil return ints while Substrait specifies floats.

Yes, these differences seem like they could cause trouble. For example, imagine the plan is SELECT (CEIL(0.3) / 2). If some engines return 0.5 and other engines return 0 then I think that is a bad thing. Spark should either reject the plan (I don't have a function ceil that behaves as you want) or convert substrait ceil into spark ceil + cast.

@Blizzara
Copy link
Contributor

Am I correct in understanding you see Substrait proto (and simple extension format) as a language for defining one standard of how a relational query engine should work, and the YAML files as the definition for that query engine written in that language? (Or at least something like this.)

I'm not entirely sure I follow. I view the proto files / substrait as defining operator behavior (in the world of relational algebra) and the YAML as defining function behavior (I don't know what this world is called 😆).

I think the question is do the YAMLs aim to define one target function behavior or do they aim to define all existing behaviors. With the former, it absolutely makes sense to have the return type (and to have only a single return type).

@jacques-n
Copy link
Contributor

jacques-n commented Aug 1, 2024

I think the question is do the YAMLs aim to define one target function behavior or do they aim to define all existing behaviors. With the former, it absolutely makes sense to have the return type (and to have only a single return type).

The way you asked this is unclear to me.

Each leaf argument/name combination in a yaml is trying to define the exact semantics of a function (which may exist in any of zero to many systems). If two different functions have different behavior, they should be two distinguished in one of the following ways. Note that for some kinds of variation in behavior you can choose from multiple choices. For some, you have limited choices.

Difference Via Options, same yaml & variant Separate Variants, same yaml Separate Yaml
Different output value semantics for same argument types and output type derivation Yes Yes Yes
Difference in input argument types or count No Yes Yes
Different output type semantics for same name and input types No No Yes

@jacques-n jacques-n dismissed westonpace’s stale review August 1, 2024 02:58

This appears resolved.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look good to me other than the question about character points or bytes.

@jacques-n jacques-n added the awaiting-user-input This issue is waiting on further input from users label Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-input This issue is waiting on further input from users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants