[DISCUSSION] Properly support array values in new engine #1300

GumpacG · 2023-01-27T20:52:05Z

What is the bug?

The new engine does not return values in an array while the Legacy engine returns all values in a row as an array. Implementing same support as V1 does isn't a right way, because legacy engine produces inconsistent value.

How can one reproduce the bug?

Steps to reproduce the behavior:

Start OpenSearch server
Clean index if was created before: curl -XDELETE 'http://localhost:9200/dbg'
Create a simple index with automatic mapping: curl -X POST "localhost:9200/dbg/_doc/?pretty" -H 'Content-Type: application/json' -d '{"myNum": 5}'
Query data: select * from dbg. Not bas so far.
Add new doc: curl -X POST "localhost:9200/dbg/_doc/?pretty" -H 'Content-Type: application/json' -d '{"myNum": [3, 4]}'
Check mapping: curl -X GET "localhost:9200/dbg?pretty"

"mappings" : {
  "properties" : {
	"myNum" : {
	  "type" : "long"
	}
  }
}

Query in the new engine: curl -s -XPOST http://localhost:9200/_plugins/_sql -H 'Content-Type: application/json' -d '{"query": "select * from dbg"}'

{
  "schema": [
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "datarows": [
    [
      5
    ],
    [
      3
    ]
  ],
  "total": 2,
  "size": 2,
  "status": 200
}

(if you have only second doc in the index)

"schema": [
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "datarows": [
    [
      3
    ],
    [
      4
    ]
  ]

Query in the legacy engine: curl -s -XPOST http://localhost:9200/_plugins/_sql -H 'Content-Type: application/json' -d '{"query": "select * from dbg", "fetch_size": 20}'

{
  "schema": [
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "total": 2,
  "datarows": [
    [
      5
    ],
    [
      [
        3,
        4
      ]
    ]
  ],
  "size": 2,
  "status": 200
}

What is the expected behavior?

TBD

Why legacy response is incorrect?

It declares data type as long, but returns a number and array of numbers. Imagine a user has a parser for response, what should parser do with such values?
You can try our JDBC driver as an example of a customer application.

What is your host/environment?

main @ 6108ca1

The text was updated successfully, but these errors were encountered:

Yury-Fridlyand · 2023-01-31T23:02:52Z

We had a team discussion outside of GH and I'll list ideas we got and notes for them.

Return all values as array when there is a mix of `array<type>` and `type`

TL;DR Not applicable.
Imagine you have this fix done and you repeat experiment given above.

When index dbg has one doc, SQL returns type long, but when second doc posted to the index, SQL returns array.
When you query for the first doc, SQL returns type long, but when you query for the second (or both), SQL returns array.

The user didn't change index/mapping/field/column type/etc, but data type was changed, it is unacceptable.

Always return type as `array`

TL;DR Not applicable.

A user would lose information about data type of a column/field.

Add new REST argument to parametrize the query

It could be a part of the url, e.g. localhost:9200/_plugins/_sql?param=value or part of json body, e.g.

{
  "query": "select * from dbg",
  "param": "value"
}

In that case user would be able to specify whether it wants a loose response (like legacy does now) or strict. TBD what to do with array values in that case? Replace by nulls? Don't return?

Put all responsibility on user: `CAST`

Enforce strict mode. A user should do cast to get array values. Without cast they should be TBD (omitted? nulled?).
Example:

SELECT CAST(myNum AS ARRAY), myNum FROM dbg;

Cast to array should be implemented though.

Put all responsibility on user: `PartiQL`

Ref link: https://partiql.org/tutorial.html#variables-can-range-over-data-with-different-types

Enforce strict mode. A user should specify how to interpret values using combination of CASE and IS clauses. TBD what to do with value if no instructions given (omit? null? error?).

SELECT CASE WHEN (myNum IS ARRAY) THEN p[0]
       ELSE p END AS myNum
FROM dbg;

or

SELECT num AS myNum FROM dbg as d, d.myNum[0] as num

To be continued...

Ideas are welcome!

Notes

1. In any case a user should be able to understand that there is a complex value in the index. Consider updating response for DESCRIBE query.
2. If complex value produced by a function (e.g. nested) and can't be inserted into response (for example, CAST or CASE is missing), an error should be raised. It should unambiguously say what is missing in the query.

dai-chen · 2023-02-01T17:12:58Z

@Yury-Fridlyand Thanks for sharing the notes! Have we considered also the idea in Preto/Trino: #442 (comment) ?

Yury-Fridlyand · 2023-02-01T23:11:04Z

Following that the response be like

{
  "schema": [
    {
      "name": "myNum",
      "type": "long",
      "array": true
    }
  ],
  "total": 2,
  "datarows": [
    [
      5
    ],
    [
      [
        3,
        4
      ]
    ]
  ],
  "size": 2,
  "status": 200
}

Why not?
It could be implemented along with CAST/PartiQL approach. Keep in mind it is a breaking change.

GumpacG · 2023-02-08T23:33:51Z

This also concerns fields being passed in to functions as only the last value in the array is used. Also, note the incorrect type with the example below.
For example:
Query: SELECT upper(name), name FROM string_array
Output:

{
    "schema": [
        {
            "name": "upper(name)",
            "type": "double"
        },
        {
            "name": "name",
            "type": "keyword"
        }
    ],
    "total": 5,
    "datarows": [
        [
            "JOE",
            "joe"
        ],
        [
            "STEVE",
            "steve"
        ],
        [
            "BOB",
            "bob"
        ],
        [
            "ANDY",
            "andy"
        ],
        [
            "ADAM",
            [
                "david",
                "adam"
            ]
        ]
    ],
    "size": 5,
    "status": 200
}

akuzin1 · 2023-06-08T10:35:53Z

A current blocker for implementing Array Support in JDBC driver, so this would be a great feature to have, any status update on priority?

Yury-Fridlyand · 2023-06-09T18:33:28Z

I'm going to research possible solutions for this and share for discussion.
Meanwhile I found more tricky sample, which we should consider as well.

POST dbg/_doc
{
  "obj": [[1, 2], [3, 4], 5]
}

search result:

{
  "took" : 361,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "dbg",
        "_id" : "uDBnoYgB5NyEnr3HyanK",
        "_score" : 1.0,
        "_source" : {
          "obj" : [
            [
              1,
              2
            ],
            [
              3,
              4
            ],
            5
          ]
        }
      }
    ]
  }
}

mapping:

{
  "dbg" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "obj" : {
          "type" : "long"
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1686168574104",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "LN-lYUQFSsi9MVpa5VT-Zg",
        "version" : {
          "created" : "136297827"
        },
        "provided_name" : "dbg"
      }
    }
  }
}

Yury-Fridlyand · 2023-06-13T00:51:40Z

Please proceed with discussion in #1733.

acarbonetto · 2023-06-20T20:04:48Z

Supporting all of the above use cases will take multiple tries, and each should be dealt with separately. We can separate each use cases into a individual issues.

To solve the primitive/array expanding into multiple rows, we should try and use the metadata (cues) to determine if the data is treated as a primitive object or an array (we cannot do both easily). Doing something like what Presto/Trino supports in the index mapping: https://trino.io/docs/current/connector/elasticsearch.html#array-types would be simple. It would indicate that the mapped symbol should treat the data record as an array (or an array of 1 if the data is not defined as an array).

We should PoC this and see if it works to solve #1733 (comment).

Signed-off-by: Norman Jordan <[email protected]>

…ect#3095) Signed-off-by: Norman Jordan <[email protected]> (cherry picked from commit e109417)

Signed-off-by: Norman Jordan <[email protected]> (cherry picked from commit e109417) Co-authored-by: normanj-bitquill <[email protected]>

GumpacG added bug Something isn't working untriaged labels Jan 27, 2023

GumpacG mentioned this issue Jan 27, 2023

Rework on OpenSearchDataType: parse, store and use mapping information Bit-Quill/opensearch-project-sql#180

Merged

6 tasks

Yury-Fridlyand changed the title ~~[BUG] New engine does not return array values as an array~~ [DISCUSSION] Properly support array values in new engine Jan 31, 2023

dai-chen added feature and removed bug Something isn't working untriaged labels Feb 1, 2023

Yury-Fridlyand mentioned this issue Feb 2, 2023

Fix for arrays being in separated rows or returned as null Bit-Quill/opensearch-project-sql#213

Closed

6 tasks

GumpacG mentioned this issue Feb 8, 2023

Add tests for nested in select clause Bit-Quill/opensearch-project-sql#222

Merged

6 tasks

forestmvey mentioned this issue Apr 10, 2023

Add Nested Function Support In SELECT Clause #1490

Merged

6 tasks

This was referenced Jun 8, 2023

Adding Array Support including Unit Tests opensearch-project/sql-jdbc#81

Draft

Improve Array and ExprValue Parsing With Inner Hits Bit-Quill/opensearch-project-sql#274

Merged

Yury-Fridlyand mentioned this issue Jun 22, 2023

Support Array and ExprValue Parsing With Inner Hits #1737

Merged

6 tasks

acarbonetto mentioned this issue Aug 1, 2023

[FEATURE] Retrieve the original raw document via SQL (_source column) #1918

Open

normanj-bitquill mentioned this issue Oct 22, 2024

Array values are preserved (#1300) #3095

Merged

7 tasks

penghuo pushed a commit that referenced this issue Oct 24, 2024

Array values are preserved (#1300) (#3095)

e109417

Signed-off-by: Norman Jordan <[email protected]>

penghuo closed this as completed in #3095 Oct 24, 2024

normanj-bitquill mentioned this issue Oct 24, 2024

Added documentation for the plugins.query.field_type_tolerance setting (#1300) #3118

Open

7 tasks

penghuo pushed a commit to penghuo/os-sql that referenced this issue Oct 25, 2024

Array values are preserved (opensearch-project#1300) (opensearch-proj…

b101b6a

…ect#3095) Signed-off-by: Norman Jordan <[email protected]> (cherry picked from commit e109417)

penghuo pushed a commit to penghuo/os-sql that referenced this issue Oct 25, 2024

Array values are preserved (opensearch-project#1300) (opensearch-proj…

7ef1071

…ect#3095) Signed-off-by: Norman Jordan <[email protected]> (cherry picked from commit e109417)

penghuo added a commit that referenced this issue Oct 25, 2024

Array values are preserved (#1300) (#3095) (#3120)

81577df

Signed-off-by: Norman Jordan <[email protected]> (cherry picked from commit e109417) Co-authored-by: normanj-bitquill <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Properly support array values in new engine #1300

[DISCUSSION] Properly support array values in new engine #1300

GumpacG commented Jan 27, 2023 •

edited by Yury-Fridlyand

Loading

Yury-Fridlyand commented Jan 31, 2023

dai-chen commented Feb 1, 2023

Yury-Fridlyand commented Feb 1, 2023

GumpacG commented Feb 8, 2023

akuzin1 commented Jun 8, 2023

Yury-Fridlyand commented Jun 9, 2023

Yury-Fridlyand commented Jun 13, 2023

acarbonetto commented Jun 20, 2023

[DISCUSSION] Properly support array values in new engine #1300

[DISCUSSION] Properly support array values in new engine #1300

Comments

GumpacG commented Jan 27, 2023 • edited by Yury-Fridlyand Loading

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

Why legacy response is incorrect?

What is your host/environment?

Yury-Fridlyand commented Jan 31, 2023

Return all values as array when there is a mix of array<type> and type

Always return type as array

Add new REST argument to parametrize the query

Put all responsibility on user: CAST

Put all responsibility on user: PartiQL

To be continued...

Notes

dai-chen commented Feb 1, 2023

Yury-Fridlyand commented Feb 1, 2023

GumpacG commented Feb 8, 2023

akuzin1 commented Jun 8, 2023

Yury-Fridlyand commented Jun 9, 2023

Yury-Fridlyand commented Jun 13, 2023

acarbonetto commented Jun 20, 2023

GumpacG commented Jan 27, 2023 •

edited by Yury-Fridlyand

Loading

Return all values as array when there is a mix of `array<type>` and `type`

Always return type as `array`

Put all responsibility on user: `CAST`

Put all responsibility on user: `PartiQL`