-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the level of targeted portability of Substrait? #596
Comments
Let me tackle this in a few parts: Substrait does have some mandates regarding record orderCertain relations can specify "orderedness": https://substrait.io/relations/basics/#orderedness which is defined as:
That being said, there are a few holes. For example, the read operator is not able to specify its ordering (often the input data can have a known ordering if it is written to disk in some sorted order). For the most part, it's up to an engine on how to actually implement this. For example, a fetch (e.g. limit) relation, that receives data with a defined ordering, should return the top X results according to that ordering. Maybe this means the engine switches to single threaded execution. Maybe the engine runs in parallel with some kind of "batch index". Substrait does not care. Another thing that Substrait does not mandate is how data is output from a query engine. For example, is it returned in memory in Arrow format? Is it written to some temporary file? So there is potentially some wiggle room for an engine to respect the ordering during plan execution and then deliver the results out of order but I would say this is bad form. I would think that, if the terminal / root relation has a well defined ordering, then a well behaved consumer should return data respecting that ordering. The default comparison behavior for type classes is undefined but shouldn't beCurrently, Substrait does not define exactly how to compare items. For example, where should NaNs be placed when ordering floats? (total ordering is quite nice). This has been a long-standing TODO and it is something that we expect Substrait will eventually declare. For example, there are spots to specify a custom comparison function. Substrait defines "type classes" and not "encodings"Substrait's type classes do not mandate how the data is actually represented in memory (or on disk). A timestamp could be represented by a single 64 bit value or it could be a pair of 64 bit values. The only thing that Substrait mandates for most type classes is precision and range (e.g. if a column has an INT8 type then it must not allow a value of 128 and should trigger an overflow even if the underlying engine stores all integers as 64-bit values) When it comes to the decimal type I think the precision and range is pretty straightforward. Functions are tricky. We want to be clear but...Specifying the exact behavior of functions is a daunting task. For a simple example we can consider integer addition. When an integer overflow occurs should the engine return "the highest possible value", "an error", "null", or "wrap around". Different engines today will have different behaviors. Also, "add" is an easy one. There are other behaviors which are not well defined by Substrait. What is the exact answer for a floating point sum operation? Because of floating point rounding errors the final answer will depend on the algorithm used (e.g. kahan summation). Consider the There has been some attempt to define these corner case behaviors using options (there is an This is still an ongoing effort. No producer that I am aware of actually sets these options. Most consumers ignore them completely. I think the most complete work here has been in the BFT which attempts to map various SQL (not necessarily just Substrait) engines to specific function options. One potential path forward, instead of requiring consumers to specify options, is to allow consumers to make "blank plans" (no options specified) and then users can "localize" the blank plan to whatever dialect they are used to (e.g. what behaviors they expect) using the information from the BFT. This will still require consumers to actually respect the options and reject plans accordingly. An ever longer term goal would be to translate plans from one dialect to another. ConclusionBoiling this all down I would say that the prime directive for Substrait is to be pragmatic about what's possible to specify today while providing a path towards clear and definite plans in the future. Given this, I would say that the current state of things is that different engines/consumers can respond to plans in very different (and sometimes invalid) ways. |
Thanks for the prompt and elaborate answer! This was exactly was I was looking for!
Thanks for that pointer. I hadn't looked at the specification in quite a while but I think it is quite clear now: the order is undefined except where specified otherwise. For me, this makes it clear what the specification mandates.
That would make sense. But there is something in the spec that I would have interpreted otherwise:
Good point! BTW, it may be worth specifying explicitly what string equality means (if not done so already). Collations in SQL may affect string ordering and even string equality in quite unexpected ways!
Right, I think I can related that to the specification. So the semantics are pretty well-defined. Just to confirm: this means that a Substrait consumer that has a different definition of, say,
Good point! BTW: this should carry over to aggregation with sum, right?
I'd argue that this is related to orderedness. The question arises for any function that is either not associative or not commutative (or both); floating-point addition is only one of them. Take Applied to floating-point addition and respecting the fact that it is not associative, this would mean that only one result is valid if the input order is specified, namely that of summing the inputs in their input order. However, for performance reasons, one could argue that we should pretend that the summation is associative, in which case any summation order would be fine. Just for completeness: Even integer addition is non-associative due to overflows. Whether or not an overflow occurs may depend on the summation order... I am wondering if associativity and commutativity aren't something that is missing in the definition of
Wow, I did not know about BFT but that looks extremely valuable! This is what I'd expect/hope for for the specification of functions in Substrait. Unless I am overlooking something, today, it isn't even stated explicitly that, say,
This makes complete sense to me! A Substrait producer would have to select options such that the set of possible answers corresponds to what they expect, most likely depending on the semantics of their input language. While there would be multiple possible options at various points in the plan and each might restrict the possible set of answers to more than one (i.e., there may still be multiple different "correct" answers due to indeterminism or implementation choice), it would still be (theoretically) possible to distinguish correct from incorrect answers.
What I think could be helpful here is an official test suite. Given a database and a Substrait plan, what is/are the expected result(s). The outcome of each test would probably one of "correct", "incorrect", and "unsupported"; the latter case letting engines report that they don't support a particular option. Deciding between correct and incorrect may not be trivial in many cases, namely, if several answers are acceptable; however, one can probably get quite far with very small inputs where it is still possible to simply enumerate all correct results. To circle back to my original question: would such a level of portability the desired end goal of Substrait? |
I'll start with the last question. In my opinion, yes. Though Substrait is a community project, others are free to have differing opinions. If I had to guess I think that most others are simply indifferent on this issue. It is viewed as too lofty a goal but innocuous.
Ah! Good catch. Looks like that isn't a hole after all.
Yes, I think we want to explicitly define default equality and ordering for all of the type classes (though maybe string is the only interesting one when it comes to equality).
If the types behave semantically different (e.g. infinite precision) then yes. If it's just different storage (e.g. variable-length binary) then it's fine. Although I don't expect emulation to be likely anytime soon (it's a supply and demand issue and the demand for Substrait isn't high enough yet). For now, I think the goal is to catch this and flag it. Users should be forced to say something like "allow non-standard types" or "allow alternative decimals". If anyone ever ends up adding a sqlite consumer this type of thing will be a big headache 😆 Integer overflow seamlessly transitions results into floats. We'd need to invent some kind of "number" user defined data type.
Any aggregate or window function that is either not associative or commutative. Scalar functions should yield the exact same answer regardless of the order of the inputs (although there is some contention / gray area here which is why seeded random is not yet a scalar function). Also, non-associative scalar functions can technically cause different results when plans get restructured (e.g. during an optimization pass) but I think that is a separate issue entirely 😵
Note that, aggregate functions whose output obviously depends on the order (e.g. In postgres you would do substrait/proto/substrait/algebra.proto Lines 1475 to 1479 in d9b9672
Nothing in Substrait would prevent you from defining an ordering for
https://github.com/substrait-io/consumer-testing Also, BFT has its own suite of tests (though focused on expressions, not plans, and not using Substrait) |
Just to keep track of this somewhere, and because it may be of interest for someone finding this issue: I learned about two related resources:
|
I am trying to understand how rigorous Substrait aims to be with its specification and, hence, how portable it will eventually be. More concretely, do we expect exactly the same results independently of the backend that this runs on?
Note that in the SQL standard and most SQL dialects, this isn't the case at least with respect to the order of the results (including any unsorted intermediate result) even for different executions in the same system. If not, how exactly do we define the semantics and hence the set of "correct" results of a plan and how do we test implementations for compliance? On the issue of result order, the Substrait specification could leave the order undefined, like in the SQL standard, and one could probably come up with some acceptance tests that allow any valid execution.
However, there are other issues, such as the exact behavior of
DECIMAL
s, which may result in different results unrelated to the order. (I just commented on ibis-project/ibis#8195, see details there.) If Substrait's specification said "the decimal operations do whatever the consumer does with decimals", then changing consumers might obviously result in a different query result. If it specified instead which exact bits need to be produced, it could be that some consumers need to emulate the specified behavior with more and/or more costly operations, which has its own downsides. What is Substrait's take here?The text was updated successfully, but these errors were encountered: