Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add test file format #680

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Substrait Test Format

This document describes the format for Substrait scalar test files.
A test file consists of the following elements:

1. Version declaration
2. Optional include statements
3. One or more test groups, each containing one or more test cases

## Syntax

### Version Declaration
The version declaration must be the first line of the file. It specifies the version of the test file format. The version declaration must be in the following format:
```
### SUBSTRAIT_SCALAR_TEST: V1
```

### Include Statements
Include statements should have at least one include statement. The include statement specifies the path to substrait extension functions. The include statement must be in the following format:
```
### SUBSTRAIT_INCLUDE: /extensions/functions_aggregate_approx.yaml
scgkiran marked this conversation as resolved.
Show resolved Hide resolved
```

### Test Groups
scgkiran marked this conversation as resolved.
Show resolved Hide resolved
A test group is a collection of test cases that are logically related.
- **description**: A string describing the test group or case. The description must start with a `#` character.
```code
# Common Maths
```
### Test Cases
A test case consists of the following elements:

- **function**: The name of the function being tested. The function name must be a string.
- **arguments**: Comma-separated list of arguments to the function. The arguments must be literals.
scgkiran marked this conversation as resolved.
Show resolved Hide resolved
- **options**: Optional comma-separated list of options in `key:value` format. The options describe the behavior of the function. The test should be run only on dialects that support the options. If options are not specified, the test should be run for all permutations of the options.
scgkiran marked this conversation as resolved.
Show resolved Hide resolved
- **result**: The expected result of the function. Either `SUBSTRAIT_ERROR` or a literal value.
- **literal**: In the format `<name>::<datatype>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name? Maybe value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to value

- **description**: A string describing the test case

```code
add(126::i8, 1::i8) = 127::i8 # addition of two numbers
```

### Spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define the specification here and in the Antlr grammar file will we end up changing both places all the time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, hopefully only when new types are introduced.


```
doc := <version>
(<include>)+
((<test_group>)?(<test_case>)+\n)+
version := ### SUBSTRAIT_SCALAR_TEST: <test_library_version>
include := ### SUBSTRAIT_INCLUDE: <uri>
test_group := # <description>
test_case := <function>(<arguments>) ([<options>])? = <result> (#<description>)?
description := string
function := string
Comment on lines +54 to +55
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the token string or is string defined somewhere? Can function have a space in it or special characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[A-Za-z0-9] [A-Za-z0-9_]*
I will redefine the grammar in antlr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using the definition of string already implemented by the textplan Antlr grammar.

arguments := <argument>, <argument>, ... <argument>
argument := <literal>
literal := <name>::<datatype>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is name defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to value. I described literal values in separate section.

result := SUBSTRAIT_ERROR | <literal>
options := <optLiteral>, <optLiteral>, ... <optLiteral>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options are values passed to the functions but not Substrait literals. Probably best to avoid the confusion here.

optLiteral := <option_name>:<option_value>
scgkiran marked this conversation as resolved.
Show resolved Hide resolved
```

**TODO:** use ANTLR to describe the grammar and generate parser
### Literals

#### String
- **string**, **fixedchar**, **varchar**: A sequence of characters enclosed in single quotes. Example: 'Hello, world!'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are escapes handled?


#### Integer
Integers are represented as sequences of digits. Negative numbers are preceded by a minus sign.
- **i8**: 8-bit integer, range: -128 to 127
- **i16**: 16-bit integer, range: -32,768 to 32,767
- **i32**: 32-bit integer, range: -2,147,483,648 to 2,147,483,647
- **i64**: 64-bit integer, range: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I get the reason for using -32,768 to 32,767 (with commas) but those aren't valid literals. Would it be clearer to say -32768 to 32767?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it.


#### Fixed Point decimals
- **decimal**: Fixed-point decimal number. Maximum 38 digits total, with up to 37 digits after the decimal point.
Example: 123.456

#### Floating Point numbers
- **float**: General floating-point number, can be represented as:
* Standard decimal notation: 123.456
* Scientific notation: 1.23e4
* Special values: `nan` (Not a Number), `+inf` (Positive Infinity), `-inf` (Negative Infinity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these case sensitive?

Also, what about +0 vs -0 and "signaling NaN" vs "non-signaling NaN"? I think it's ok to not have a way to represent the distinction but we should be explicit that we don't in the description here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added +0 -0 snan

- **float32**: Single-precision float, approximately 6 significant digits, range: ~1.2e-38 to ~3.4e38
- **float64**: Double-precision float, approximately 15 significant digits, range: ~2.3e-308 to ~1.7e308
Comment on lines +94 to +95
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a "test runner" only validate the answer within 6 digits? In other words, if one engine returns (float32) divide(1.0, 3.0) as 0.3333333432674407958984375 and another engine returns divide(1.0, 3.0) as 0.3333330452442169189453125 should the test runner consider both engines correct?


#### Boolean
- Valid values: TRUE, FALSE, NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is NULL called out explicitly here? Can't all data types be null? Also, we should mention what exactly the syntax for null is (is it null or NULL)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed null from here. I am thinking case sensitive null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these case sensitive TRUE and FALSE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking case sensitive true and false


#### Date and Time
All date and time literals use ISO 8601 format:

- **date**: `YYYY-MM-DD`, example: `2021-01-01`
- **time**: `HH:MM:SS[.fraction]`, example: `12:00:00.000`
- **timestamp**: `YYYY-MM-DD HH:MM:SS[.fraction]`, example: `2021-01-01 12:00:00`
- **timestamp_tz**: `YYYY-MM-DD HH:MM:SS[.fraction]±HH:MM`, example: `2021-01-01 12:00:00+05:30`
- **interval year**: `INTERVAL 'P[n]Y[n]M'`, example: `INTERVAL 'P2Y3M'` (2 years, 3 months)
- **interval days**: `INTERVAL 'P[n]DT[n]H[n]M[n]S'`, example: `INTERVAL 'P2DT3H2M9S'` (2 days, 3 hours, 2 minutes, 9 seconds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a somewhat recent change, but also need to encode precision for interval_day now.

See https://substrait.io/types/type_classes/#compound-types


#### Other complex types
**TODO** Add support for complex types like arrays, structs, maps etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are already defined in the text format. You do need a way of specifying these that is amenable to recursive parsing. The Substrait C++ type format took an approach that was more error prone.


### Data Types

- **bool**: Boolean
- **i8**: 8-bit signed integer
- **i16**: 16-bit signed integer
- **i32**: 32-bit signed integer
- **i64**: 64-bit signed integer
- **f32**: 32-bit floating point number
- **f64**: 64-bit floating point number
- **dec**: Fixed-point `decimal<P,S>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do short forms still need to be parameterized? Can we refer to the list elsewhere? Can long names be interchangeably used?

Copy link
Contributor Author

@scgkiran scgkiran Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes parametrization is needed. I prefer to not to use long names in the tests.

- **str**: Variable-length string
- **fchar**: Fixed-length string `fixedchar<N>`
- **vchar**: Variable-length string `varchar<N>`
- **vbin**: Fixed-length binary `fixedbinary<N>`
- **date**: Date
- **time**: Time
- **ts**: Timestamp
- **tstz**: Timestamp with timezone
scgkiran marked this conversation as resolved.
Show resolved Hide resolved
- **iyear**: Interval year
- **iday**: Interval days
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the short names defined in https://substrait.io/extensions/#function-signature-compound-names

If that is true can we link there (instead of or in addition to explicitly listing here)?

Also, can we get away with the short name? I think we can, but that means we will do inference based on the literal value. For example, if the type is fchar and the literal value is 'abc' then the tester should infer the full type is fixedchar<3>. Or, for a more interesting example, if the type is dec and the literal value is 01.00200 then the tester should infer the type is decimal<7,5>. Or if ts and 2021-03-20 01:17:00.123 then we can infer the type is precision_timestamp<3>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to retain the shortType name, and verify that literal value matches the type in the tester.
This will

  1. account the tests for each function signature and identify functions with no/low test coverage.
  2. Catch any unintended mis typed testcases.


### Example of a test file

```code
### SUBSTRAIT_SCALAR_TEST:V1
### SUBSTRAIT_INCLUDE: /extensions/functions_arithmetic.yaml

# Common Maths
add(126::i8, 1::i8) = 127::i8

# Arithmetic Overflow Tests
add(127::i8, 1::i8) [overflow:ERROR] = <!ERROR> #check overflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is <!ERROR> defined? I see something about SUBSTRAIT_ERROR in the grammar but nothing about a ! or <> (and SUBSTRAIT_ERROR != ERROR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it now.

```
The above test file has two test groups "Common Maths" and "Arithmetic Overflow Tests". Each has one test case. The test case in the second group has a name whereas case in the first one does not.
31 changes: 31 additions & 0 deletions tests/cases/arithmetic/add.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
### SUBSTRAIT_SCALAR_TEST: 1.0
### SUBSTRAIT_INCLUDE: /extensions/functions_arithmetic.yaml

# basic: Basic examples without any special cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the name the whole line or only up to the semicolon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whole line

add(120::i8, 5::i8) = 125::i8
add(100::i16, 100::i16) = 200::i16
add(30000::i32, 30000::i32) = 60000::i32
add(2000000000::i64, 2000000000::i64) = 4000000000::i64

# overflow: Examples demonstrating overflow behavior
add(120::i8, 10::i8) [overflow:ERROR] = error
add(30000::i16, 30000::i16) [overflow:ERROR] = error
add(2000000000::i32, 2000000000::i32) [overflow:ERROR] = error
add(9223372036854775807::i64, 1::i64) [overflow:ERROR] = error
scgkiran marked this conversation as resolved.
Show resolved Hide resolved

# overflow: Examples demonstrating overflow behavior tests: overflow with SATURATE
add(120::i8, 10::i8) [overflow:SATURATE] = 127::i8
add(-120::i8, -10::i8) [overflow:SATURATE] = -128::i8

# overflow: Examples demonstrating overflow behavior tests: overflow with SILENT
add(120::i8, 10::i8) [overflow:SILENT] = undefined
scgkiran marked this conversation as resolved.
Show resolved Hide resolved

# floating_exception: Examples demonstrating exceptional floating point cases
add(1.5e+308::fp64, 1.5e+308::fp64) = inf::fp64
add(-1.5e+308::fp64, -1.5e+308::fp64) = -inf::fp64

# rounding: Examples demonstrating floating point rounding behavior
add(4.5::fp32, 2.500001::fp32) [rounding:TIE_TO_EVEN] = 7.000001::fp32
Comment on lines +27 to +28
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really demonstrate rounding error. If we're sticking with the 6/15 digits of precision thing then it is impossible to demonstrate rounding error. Should we just remove these tests?


# types: Examples demonstrating behavior of different data types
add(4.5::fp64, 2.5000007152557373::fp64) = 7.00000071525573::fp64
21 changes: 21 additions & 0 deletions tests/cases/arithmetic_decimal/power.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
### SUBSTRAIT_SCALAR_TEST: 1.0
### SUBSTRAIT_INCLUDE: extensions/functions_arithmetic_decimal.yaml

# basic: Basic examples without any special cases
power(8::decimal, 2::decimal<38, 0>) = 64::fp64
scgkiran marked this conversation as resolved.
Show resolved Hide resolved
power(1.0::decimal, -1.0::decimal<38, 0>) = 1.0::fp64
power(2.0::decimal<38, 0>, -2.0::decimal<38, 0>) = 0.25::fp64
power(13::decimal<38, 0>, 10::decimal<38, 0>) = 137858491849::fp64

# result_more_than_input_precison: Examples demonstrating result with more precision than input
power(16::decimal<2, 0>, 4::decimal<38, 0>) = 65536::fp64
Copy link
Member

@EpsilonPrime EpsilonPrime Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could reuse the text format's definition of literals?

https://github.com/substrait-io/substrait-cpp/blob/main/src/substrait/textplan/README.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to stick with :: format generally. It is a cast format consistent with many systems the people looking at these tests will be familiar with and I believe much more intuitive understood than underscore.

I'm similarly inclined to try to find more standard ways to represent other things as opposed to what the text format has done. There are several examples of alternative ways to implement things to what is commonly done in SQL (which has the most consistent way to approach these things across systems). Examples beyond the use of cast syntax versus underscores include. (1) String literals are double quoted and backticked while sql string literals are single quoted and (2) brackets being overloaded to mean any of intervals, lists, maps and structs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The :: format is a type subclass operator. As such it requires a mental context switch every time I look at it. The underscore was chosen in the text format to create a visual separation between the literal value and its type thus aiding in readability (and it's also requires less keystrokes).

Braces are used for literal values and angle brackets are used for type composition and parameterization. This helps to reinforce the distinction between value and type. I have yet to be able to determine what the value of any interval literal is in the proposed format. However the text format's use of labels makes it ultra-clear what each part of the interval is.

The text format's string implementation does draw from other sources than SQL but that's mostly because the SQL version has its own warts which make it harder to read. SQL's implicit concatenation is the main SQL string feature that is difficult for humans to parse (it requires the counting of active single quotes).

The text format was reviewed in the community meeting twice (February 1st, 2023 and April 12th, 2023) so presumably there is some community agreement about its design validity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that the text format defined in substrait-cpp isn't valid. It's definitely valid.

I think we come from different perspectives. I'm coming from the perspective that we should try to leverage existing patterns as much as possible. My expectation is that the vast majority of systems that will be tested are SQL systems so leveraging SQL patterns is desirable as much as reasonably possible and comes with the added benefit of being obvious in meaning to a large group of people reviewing the test case. If we look at the set of BFT dialects, for example, we have the following:

Dialect API Single Quotes for Literals Supports :: for Cast
cudf cudf api no no
datafusion SQL yes no
duckdb SQL yes yes
postgres SQL yes yes
snowflake SQL yes yes
sqllite SQL yes no
velox velox api no no

So a decent portion are SQL based. Single-quoted literals are the most common way to represent strings in those systems. 3/7 have support for the double colon as meaning cast. Given that, I prefer that over something invented.


# floating_exception: Examples demonstrating exceptional floating point cases
power(1.5e+10::decimal<38, 0>, 1.5e+20::decimal<38, 0>) = inf::fp64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the short form names as noted above or are both legal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should be short form. Fixed it.

power(-16::decimal<4, 0>, 1001::decimal<4, 0>) = -inf::fp64

# complex_number: Examples demonstrating complex number output
power(-1::decimal, 0.5::decimal<38,1>) [complex_number_result:NAN] = nan

# complex_number: Examples demonstrating complex number output tests: complex_number_result with ERROR
power(-1::decimal, 0.5::decimal<38,1>) [complex_number_result:ERROR] = error
25 changes: 25 additions & 0 deletions tests/cases/datetime/lt_datetime.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
### SUBSTRAIT_SCALAR_TEST: 1.0
### SUBSTRAIT_INCLUDE: /extensions/functions_datetime.yaml

# timestamps: examples using the timestamp type
lt(2016-12-31 13:30:15::timestamp, 2017-12-31 13:30:15::timestamp) = True::boolean
lt(2018-12-31 13:30:15::timestamp, 2017-12-31 13:30:15::timestamp) = False::boolean

# timestamp_tz: examples using the timestamp_tz type
lt(1999-01-08 01:05:05-08:00::timestamp_tz, 1999-01-08 04:05:06-05:00::timestamp_tz) = True::boolean
lt(1999-01-08 01:05:06-08:00::timestamp_tz, 1999-01-08 04:05:06-05:00::timestamp_tz) = False::boolean
scgkiran marked this conversation as resolved.
Show resolved Hide resolved

# date: examples using the date type
lt(2020-12-30::date, 2020-12-31::date) = True::boolean
lt(2020-12-31::date, 2020-12-30::date) = False::boolean

# interval: examples using the interval type
lt(INTERVAL 'P7D'::interval, INTERVAL 'P6D'::interval) = False::boolean
scgkiran marked this conversation as resolved.
Show resolved Hide resolved
lt(INTERVAL 'P5D'::interval, INTERVAL '6D'::interval) = True::boolean
lt(INTERVAL 'P5Y'::interval, INTERVAL 'P6Y'::interval) = True::boolean
lt(INTERVAL 'P7Y'::interval, INTERVAL 'P6Y'::interval) = False::boolean

# null_input: examples with null args or return
lt(None::interval, INTERVAL 'P5D'::interval) = Null::boolean
lt(None::date, 2020-12-30::date) = Null::boolean
lt(None::timestamp, 2018-12-31 13:30:15::timestamp) = Null::boolean
scgkiran marked this conversation as resolved.
Show resolved Hide resolved
Loading