Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function names & URIs #634

Open
westonpace opened this issue Apr 18, 2024 · 12 comments
Open

Function names & URIs #634

westonpace opened this issue Apr 18, 2024 · 12 comments

Comments

@westonpace
Copy link
Member

westonpace commented Apr 18, 2024

This is in relation to some discussion that came up in #631

The duplicated names prevent substrait-java from being updated.

However, there is a question around whether names need to unique across ALL extensions, or just within a single extension file. While > these functions were added in error (I think) you could make the case that:

functions_arithmetic/min:timestamp,max:timestamp
functions_datetime/min:timestamp,max:timestamp

should be treated as different functions.

My understanding is that a fully qualified function name is the triple:

<function uri>,<function name>,<function arguments>

Therefore, yes, these are different functions (because the function uri is different). However, there are (at least) three problems that have never really been fully resolved:

  1. No one uses the function URI

The major producers that I am aware of today (isthmus, duckdb, ibis) either set the function URI to undefined, the empty string, or / (I think we actually have all three behaviors across the three producers 😮‍💨 )

Correspondingly, consumers tend to ignore this field. The one exception I'm aware of is Acero which will tolerate /, empty string, and undefined (Acero goes into a "fallback" mode where it does name-only matching and will match any registered function with the same name regardless of the URI) but which will accept URLs of the form https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml and also has a special URI urn:arrow:substrait_simple_extension_function which means "use the arrow compute function with the given name" (this is how we support UDFs).

  1. What is the canonical URL?

There are several choices. For example:

# Stable URI but potentially unstable definition
https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml
# Stable definition but unstable URI
https://github.com/substrait-io/substrait/blob/v0.47.0/extensions/functions_arithmetic.yaml

My preference is the former, for practical reasons.

  1. How is versioning handled?

This is discussed in more detail in #274

But the basic question is "what if we have tons of users and we decide to make a change to some function?"

@westonpace
Copy link
Member Author

I'm not sure I expect a solid answer but figured it would be good to solicit discussion and opinions.

@joellubi
Copy link
Contributor

joellubi commented May 1, 2024

I'm currently facing some design choices related to this and this is roughly where my thinking is at this point:

TLDR at bottom...

Generally, uniquely qualifying function names is not a problem that consumers have to deal with because the extension references are already resolved in the plan they are receiving.

For plan producers, it seems some specific preferences do need to be adopted in order to produce plans under practical constraints. Specifically, the input most producers are serializing into substrait will typically contain the function name and input types, but not the URI. Generally speaking this is coming from SQL but dataframes have the same issue.

In the statement SELECT a + b FROM c the function name can be determined by explicit mapping (i.e. + -> add) and the input types can be determined once the column references are resolved to the base schema. Assuming the signature resolves to add:i64_i64 most implementations will currently resolve the URI to be https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml but the choice is arbitrary under the current spec, as far as I can tell.

In general I think this is reasonable. If a plan producer wants to shadow the "default" add:i64_i64 with their own implementation, they can declare a new extension and handle their own disambiguation logic. Where this gets a little tricky is with the "default" extensions defined under https://github.com/substrait-io/substrait/blob/main/extensions. If there are multiple implementations of add:i64_i64 under different URIs that are NOT user-defined then the producer's choice of which one to use could end up being arbitrary and lead to surprises across implementations.

I think making this arbitrary choice is awkward but in practice most consumers don't do this at all which is a source of bugs. The original issue (#631) is an example and it looks like ibis-substrait has a similar silent bug where the second occurrence of a (name, inputs) pair in the extensions yaml will shadow the first.

TLDR

Given all this, I think one way to help simplify implementations would be to ensure that all functions defined under https://github.com/substrait-io/substrait/blob/main/extensions have a unique name:signature, so that the URI could be inferred from that pair. If the producer implementation additionally defines its own extension files then it is up to that implementation to decide which URI to prefer for a given function invocation.

Thoughts on this?

@westonpace
Copy link
Member Author

Generally, uniquely qualifying function names is not a problem that consumers have to deal with because the extension references are already resolved in the plan they are receiving.

Consumers generally need to map the fully qualified name to an actual function implementation though. Here is a simple python example showing the pattern I usually see:

###
# This part of the code base was written before substrait integration was considered
###
def plus(a, b):
  return a + b
def minus(a, b):
  return a - b
def l2_norm(x, y):
  return math.sqrt(x*x + y*y)
def evaluate_function(func, x, y):
  expr = f"func({x}, {y})"
  eval(expr)
###
# Later, initial substrait support is added, and a mapping needs to be created
###
function_mapping = {
  "add": "plus",
  "subtract": "minus",
  // No mapping for l2_norm since it doesn't exist in substrait
}
def evaluate_substrait_func(substrait_function, x, y):
  local_function_name = function_mapping[substrait_function]
  return evaluate_function(local_function_name, x, y)

Notice that URIs are not involved (because I don't commonly see it). This can lead to a problem. If a producer decides to shadow "add" and creates urn:my_custom_producer:add which means "return the string formed by converting each arg to a string and concatenating it" then they would expect the above consumer to reject the plan with "function urn:my_custom_producer:add is not defined" but instead it will just evaluate the standard add function.

Assuming the signature resolves to add:i64_i64 most implementations will currently resolve the URI to be https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml but the choice is arbitrary under the current spec, as far as I can tell.

I do not agree the choice is arbitrary. For example, let's decide we want to fix the above consumer to properly implement function mapping:

###
# A fixed example of substrait support
###
standard_function_mapping = {
  "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml": {
    "add": "plus",
    "subtract": "minus",
  },
  "https://some_other_site/substrait/linear_algebra.yaml": {
    "l2_norm": "l2_norm"
  }
}
def evaluate_substrait_func(sub_func_uri, sub_func_name, x, y):
  local_function_name = function_mapping[sub_func_uri][sub_func_name]
  return evaluate_function(local_function_name, x, y)

Now, if the producer provides urn:my_custom_producer:add then we will get an error, as we should, since the consumer cannot evaluate it.

However, if the producer arbitrarily decides on https://github.com/substrait-io/substrait/blob/v0.47.0/extensions/functions_arithmetic.yaml and they pass in add then the consumer should be able to satisfy it (it knows how to do the standard add) but will reject it because it doesn't recognize the URI the producer provided. The consumer and producer need to agree on what URI is used for any given function (including the standard functions). This means the choice of URI is not arbitrary.

If there are multiple implementations of add:i64_i64 under different URIs that are NOT user-defined then the producer's choice of which one to use could end up being arbitrary and lead to surprises across implementations.

How would you know if a URI is user-defined or not?

I think one way to help simplify implementations would be to ensure that all functions defined under https://github.com/substrait-io/substrait/blob/main/extensions have a unique name:signature, so that the URI could be inferred from that pair.

I partly agree. This is also called out here:

Every compound function signature must be unique. If two function implementations in a YAML file would generate the same compound function signature, then the YAML file is invalid and behavior is undefined.

I think it is technically legal for a function signature to be duplicated as long as the duplicate is in a different yaml file. This is because the yaml filename should be part of the URI. However, I would still discourage it, and am fine saying we don't want it, since it can be misleading.

@joellubi
Copy link
Contributor

joellubi commented May 2, 2024

Thanks for the response @westonpace. I pretty much agree with all of this, and would like to clarify that I don't think there's actually any ambiguity in the spec as it is currently defined: a function is uniquely specified by (uri,name,signature).

In practice many implementations don't handle URIs correctly, but the implications are different on the producer vs the consumer side. Consider an implementation executing SQL queries via substrait:

SQL -> Substrait Producer -> Substrait Consumer -> Query Engine

I believe you were focusing on the right-hand side in your response, where a substrait message is mapped to a physical plan for the query engine to execute. The reason I say there isn't ambiguity on the consumer side is because the plan it receives should already have an explicit URI in the Plan.ExtensionUris. Ignoring the URI is a mistake. Some implementations make this mistake, as you pointed out, but the guidance on the correct behavior is pretty clear IMO.

On the left side of the workflow a SQL query is getting planned and then serialized into a substrait message. Consider the query SELECT 1 + 2. Again, there is technically no ambiguity here. The producer may decide to map this to extensions/functions_arithmetic.yaml:add:i64_i64 or to shadow that implementation with its own e.g. my_extensions/functions_custom.yaml:add:i64_i64.

This only gets confusing IMO if we were to (hypothetically) define another extension file in the main substrait repo called for instance extensions/functions_math.yaml which also contained add:i64_i64. This is valid under the spec but it doesn't make it obvious to producers whether SELECT 1 + 2 should be serialized with a reference to extensions/functions_arithmetic.yaml:add:i64_i64 or extensions/functions_math.yaml:add:i64_i64. Again its the producer's choice which one it prefers but if we could at least avoid requiring this custom logic when supporting the standard extension files, it could help simplify adoption.

@vbarua
Copy link
Member

vbarua commented May 2, 2024

Correspondingly, consumers tend to ignore this field.

So substrait-java doesn't ignore the field, as I discovered during our version update internally in which plans using max:tstz in functions_arithmetic.yaml no longer worked, failing with:

java.lang.IllegalArgumentException: Unexpected aggregate function with key max:tstz. The namespace /functions_arithmetic.yaml is loaded but no aggregate function with this key was found.

Another example where URIs are very important. Substrait defines the avg function matching the semantics of the SQL spec in functions_arithmetic.yaml as:

  - name: "avg"
    impls:
      - args:
          - name: x
            value: i64
        options:
          overflow:
            values: [ SILENT, SATURATE, ERROR ]
        nullability: DECLARED_OUTPUT
        decomposable: MANY
        intermediate: "STRUCT<i64,i64>"
        return: i64?

However, we might want to have a variant of avg like

  - name: "avg"
    impls:
      - args:
          - name: x
            value: i64
        options:
          overflow:
            values: [ SILENT, SATURATE, ERROR ]
        nullability: DECLARED_OUTPUT
        decomposable: MANY
        intermediate: "STRUCT<i64,i64>"
        return: fp64?   # <-- RETURNS FLOATING POINT

in functions_postgres which matches Postgres semantics engine does.
(For extra cursed-ness these two variants have the same signature avg:i64)

A consumer supporting Postgres style avg only should reject any plan containing functions_arithmetic/avg:i64.
A producer targeting this consumer should generate plans using functions_postgres/avg:i64

More generally, a producer should only support a function if its implementation semantics match what is given in the function extension definition. A consumer should only use functions that match the semantics that they are trying to encode.

If there are multiple implementations of add:i64_i64 under different URIs that are NOT user-defined then the producer's choice of which one to use could end up being arbitrary and lead to surprises across implementations.

I actually agree that this is somewhat arbitrary in that a generic producer can chose any version that matches the semantics they are trying to encode. I would argue within the core substrait extensions we should avoid having functions that are identical except for URI

@vbarua
Copy link
Member

vbarua commented May 2, 2024

@joellubi

I believe you were focusing on the right-hand side

I'm definitely rhs focused as well 😅

For your example starting from SQL, consider something like:

SELECT avg(<i64 column>) FROM ...

From the SQL side of things, this avg could be either the one in functions_arithmetic or (the hypothetical) functions_postgres. What the producer chooses will be depend on what behaviour it wants and what behaviour the consumers it's talking to have available. I don't think there's a generally correct choice.

@joellubi
Copy link
Contributor

joellubi commented May 2, 2024

From the SQL side of things, this avg could be either the one in functions_arithmetic or (the hypothetical) functions_postgres. What the producer chooses will be depend on what behaviour it wants and what behaviour the consumers it's talking to have available. I don't think there's a generally correct choice.

This is a good point @vbarua, especially since functions_postgres would actually be very useful to have defined!

Currently some producers are auto-generating function definitions from the contents of https://github.com/substrait-io/substrait/blob/main/extensions. The function mapping for the SQL query above works fine as long avg:i64 is only defined once in those extension files. If we were to introduce functions_postgres then any implementation of the core substrait extensions would become non-trivial, and likely harder to auto-generate. The producer's preference of i64 or fp64 semantics would either need to be encoded somewhere in the application logic or taken as input from the end user. For example:

import fancy_client_that_uses_substrait as fc

res_i64 = fc.query("SELECT avg(order_quantity) FROM orders")
res_fp64 = fc.query("SELECT avg(order_quantity) FROM orders", dialect="postgres")

In this case the producer has the explicit information it needs to choose the correct extension file (if it assumes functions_arithmetic to be the default). I think the question this ultimately raises is whether the core extensions defined in the main substrait repo are prescriptive or not. Are they just example extensions that you may or may not use, or do we want to encourage producers/consumers to support all or as many of them as possible? Is auto-generating implementations from the core extension files something we want to make easier, or is it an anti-pattern?

@jacques-n
Copy link
Contributor

I am late to the party here and totally missed this thread. I need to figure out where these notification emails are going...

Given all this, I think one way to help simplify implementations would be to ensure that all functions defined under https://github.com/substrait-io/substrait/blob/main/extensions have a unique name:signature, so that the URI could be inferred from that pair. If the producer implementation additionally defines its own extension files then it is up to that implementation to decide which URI to prefer for a given function invocation.

Thoughts on this?

Not a fan. The intention has always been to create common function libraries in the substrait core project. So, postgres-functions, snowflake-functions, etc. In fact, I'm actually working on snowflake-functions right now and we're starting to introduce them. Forcing all the files to have different names for functions will just mean everyone is going to have to create mappings between differently named functions (yes, we called this avg_snowflake but you probably know it as avg).

either set the function URI to undefined, the empty string, or /

Not really true. substrait-java (and Isthmus) use the local paths (e.g. "/functions_arithmetic.yaml). A decent amount of work/infrastructure was put in place to support that patterns.

....

I agree that consumers and producers are ignoring the URI part of the spec and we have to adapt as best we can. I propose we add a few behaviors to incorporate laziness without excessively watering down the specification. The reality right now is that people could be lazy because they aren't that interested in binding to the right type patterns. E.g. output type derivation differences, etc. So:

  • State URI is optional. When unset or an "empty value", resolution is done using either an ordering. The ordering is a list of function URIs that should be use for named resolution in a "first wins" basis. In this mode, the first file that has a name/arg binding matching the function is the function that should be bound.
  • Define in substrait the default order of resolution of files in extensions using "first wins". This list should only include a minimal set of extensions, not all extensions.
  • Update plan format to optionally allow systems to override the default resolution ordering by specifying their own (e.g. I'm postgres and want the postgres functions to be resolved first).
  • Set a standard in substrait core that a change in default function ordering (e.g. introducing a function earlier in the default ordering that shadows a previously default function) is a backwards incompatible change.

Thoughts?

@joellubi
Copy link
Contributor

Not a fan. The intention has always been to create common function libraries in the substrait core project. So, postgres-functions, snowflake-functions, etc. In fact, I'm actually working on snowflake-functions right now and we're starting to introduce them. Forcing all the files to have different names for functions will just mean everyone is going to have to create mappings between differently named functions (yes, we called this avg_snowflake but you probably know it as avg).

This isn't quite what I meant, I agree with most of what you wrote. If I remember correctly, I think at the time I was asking about the "significance" of extensions defined in https://github.com/substrait-io/substrait/blob/main/extensions compared to arbitrary URI's. The examples you named are good ones to consider in this context e.g. postgres-functions, snowflake-functions.

It would be helpful to have a name for these "sets" of extensions. I don't think this concept exists currently in the substrait spec. IMHO it would be helpful if function signatures are distinct within one of these system-specific sets. So postgres-functions can only define add:i64_i64 once, but snowflake-functions is allowed to define its own set of functions.

I think these system-specific collections are a great idea, but it does bring back the question about the ones in https://github.com/substrait-io/substrait/blob/main/extensions. They aren't specific to any system. Are they intended to be a neutral set of "canonical" extensions that other system-specific collections map to/from? (this has been my interpretation)

It would be nice if it were possible to point to a "base-uri" and register functions within that set. Some examples:

  • github.com/substrait-io/substrait/blob/main/extensions/*
  • github.com/substrait-io/substrait/blob/main/extensions-postgres/*
  • github.com/substrait-io/substrait/blob/main/extensions-snowflake/*

If add:i64_i64 only appears once in each of these sets then it can be fairly straightforward to bind that collection of functions to a substrait producer/consumers internal function dispatcher (or however it's using the functions). If a signature can be defined multiple times in one of these sets, the work to determine which version of add:i64_i64 to dispatch for SELECT 1 + 1 is not trivial.

@EpsilonPrime
Copy link
Member

How do avg_snowflake and avg differ? Are there just missing options? Yes, most of the consumers don't handle options today but would handling them solve the problem more cleanly?

As to naming, the point of the URI scheme is to namespace the functions so having different prefixes should be different.

Another place we might need different function namespaces might be scalar vs aggregate vs window functions. It's entirely possible one would want an avg that works as all two or more of those types.

@westonpace
Copy link
Member Author

westonpace commented Jul 31, 2024

Checking my understanding. It sounds like both @jacques-n and @joellubi are describing a new concept that is a "collection of extensions" (my brain immediately wants to use "dialect" but I'll stick with "collection" for this comment at least).

Jacques is proposing an "ordering" suggesting that there can be duplicates in a collection but the order of the files determines how they are handled.

Joel is proposing no duplicates within a collection.

Both approaches solve the "lazy consumer" problem by allowing consumers to ignore URIs and putting the burden on producers (or in between components) to construct the appropriate collection that describes the consumer's behavior.

--

If I had to pick then I think I'm partial to the ordering approach. Mainly because an "ordering" allows you to build up a collection in layers. I could start with the base "core substrait" layer and then add a "postgres layer" which doesn't have to be complete, but both adds new functions and replaces existing functions.

@jacques-n
Copy link
Contributor

jacques-n commented Jul 31, 2024

We talked about this today in the sync. My sense of general consensus:

The spec is fine. The fact that people are ignoring it isn't a problem we need to fix in the spec. Adding multiple files with the function signature (name + args) is fine.

If that's an accurate synopsis of people's thinking, I think we close this out as "not a problem". I do think it is worthwhile to open tickets for "what is valid format for URI" but the main gist of this conversation has been more about the first point @westonpace made above which I think we're concluding is a non-issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants