Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Properly support array values in new engine #1300

Closed
GumpacG opened this issue Jan 27, 2023 · 8 comments · Fixed by #3095 · May be fixed by #3118
Closed

[DISCUSSION] Properly support array values in new engine #1300

GumpacG opened this issue Jan 27, 2023 · 8 comments · Fixed by #3095 · May be fixed by #3118
Labels

Comments

@GumpacG
Copy link
Collaborator

GumpacG commented Jan 27, 2023

What is the bug?

The new engine does not return values in an array while the Legacy engine returns all values in a row as an array. Implementing same support as V1 does isn't a right way, because legacy engine produces inconsistent value.

How can one reproduce the bug?

Steps to reproduce the behavior:

  1. Start OpenSearch server
  2. Clean index if was created before: curl -XDELETE 'http://localhost:9200/dbg'
  3. Create a simple index with automatic mapping: curl -X POST "localhost:9200/dbg/_doc/?pretty" -H 'Content-Type: application/json' -d '{"myNum": 5}'
  4. Query data: select * from dbg. Not bas so far.
  5. Add new doc: curl -X POST "localhost:9200/dbg/_doc/?pretty" -H 'Content-Type: application/json' -d '{"myNum": [3, 4]}'
  6. Check mapping: curl -X GET "localhost:9200/dbg?pretty"
"mappings" : {
  "properties" : {
	"myNum" : {
	  "type" : "long"
	}
  }
}
  1. Query in the new engine: curl -s -XPOST http://localhost:9200/_plugins/_sql -H 'Content-Type: application/json' -d '{"query": "select * from dbg"}'
{
  "schema": [
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "datarows": [
    [
      5
    ],
    [
      3
    ]
  ],
  "total": 2,
  "size": 2,
  "status": 200
}

(if you have only second doc in the index)

"schema": [
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "datarows": [
    [
      3
    ],
    [
      4
    ]
  ]
  1. Query in the legacy engine: curl -s -XPOST http://localhost:9200/_plugins/_sql -H 'Content-Type: application/json' -d '{"query": "select * from dbg", "fetch_size": 20}'
{
  "schema": [
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "total": 2,
  "datarows": [
    [
      5
    ],
    [
      [
        3,
        4
      ]
    ]
  ],
  "size": 2,
  "status": 200
}

What is the expected behavior?

TBD

Why legacy response is incorrect?

It declares data type as long, but returns a number and array of numbers. Imagine a user has a parser for response, what should parser do with such values?
You can try our JDBC driver as an example of a customer application.

What is your host/environment?

main @ 6108ca1

@GumpacG GumpacG added bug Something isn't working untriaged labels Jan 27, 2023
@Yury-Fridlyand Yury-Fridlyand changed the title [BUG] New engine does not return array values as an array [DISCUSSION] Properly support array values in new engine Jan 31, 2023
@Yury-Fridlyand
Copy link
Collaborator

We had a team discussion outside of GH and I'll list ideas we got and notes for them.

Return all values as array when there is a mix of array<type> and type

TL;DR Not applicable.
Imagine you have this fix done and you repeat experiment given above.

  1. When index dbg has one doc, SQL returns type long, but when second doc posted to the index, SQL returns array.
  2. When you query for the first doc, SQL returns type long, but when you query for the second (or both), SQL returns array.

The user didn't change index/mapping/field/column type/etc, but data type was changed, it is unacceptable.

Always return type as array

TL;DR Not applicable.

A user would lose information about data type of a column/field.

Add new REST argument to parametrize the query

It could be a part of the url, e.g. localhost:9200/_plugins/_sql?param=value or part of json body, e.g.

{
  "query": "select * from dbg",
  "param": "value"
}

In that case user would be able to specify whether it wants a loose response (like legacy does now) or strict. TBD what to do with array values in that case? Replace by nulls? Don't return?

Put all responsibility on user: CAST

Enforce strict mode. A user should do cast to get array values. Without cast they should be TBD (omitted? nulled?).
Example:

SELECT CAST(myNum AS ARRAY), myNum FROM dbg;

Cast to array should be implemented though.

Put all responsibility on user: PartiQL

Ref link: https://partiql.org/tutorial.html#variables-can-range-over-data-with-different-types

Enforce strict mode. A user should specify how to interpret values using combination of CASE and IS clauses. TBD what to do with value if no instructions given (omit? null? error?).

SELECT CASE WHEN (myNum IS ARRAY) THEN p[0]
       ELSE p END AS myNum
FROM dbg;

or

SELECT num AS myNum FROM dbg as d, d.myNum[0] as num

To be continued...

Ideas are welcome!

Notes

1. In any case a user should be able to understand that there is a complex value in the index. Consider updating response for DESCRIBE query.
2. If complex value produced by a function (e.g. nested) and can't be inserted into response (for example, CAST or CASE is missing), an error should be raised. It should unambiguously say what is missing in the query.

@dai-chen
Copy link
Collaborator

dai-chen commented Feb 1, 2023

@Yury-Fridlyand Thanks for sharing the notes! Have we considered also the idea in Preto/Trino: #442 (comment) ?

@dai-chen dai-chen added feature and removed bug Something isn't working untriaged labels Feb 1, 2023
@Yury-Fridlyand
Copy link
Collaborator

Following that the response be like

{
  "schema": [
    {
      "name": "myNum",
      "type": "long",
      "array": true
    }
  ],
  "total": 2,
  "datarows": [
    [
      5
    ],
    [
      [
        3,
        4
      ]
    ]
  ],
  "size": 2,
  "status": 200
}

Why not?
It could be implemented along with CAST/PartiQL approach. Keep in mind it is a breaking change.

@GumpacG
Copy link
Collaborator Author

GumpacG commented Feb 8, 2023

This also concerns fields being passed in to functions as only the last value in the array is used. Also, note the incorrect type with the example below.
For example:
Query: SELECT upper(name), name FROM string_array
Output:

{
    "schema": [
        {
            "name": "upper(name)",
            "type": "double"
        },
        {
            "name": "name",
            "type": "keyword"
        }
    ],
    "total": 5,
    "datarows": [
        [
            "JOE",
            "joe"
        ],
        [
            "STEVE",
            "steve"
        ],
        [
            "BOB",
            "bob"
        ],
        [
            "ANDY",
            "andy"
        ],
        [
            "ADAM",
            [
                "david",
                "adam"
            ]
        ]
    ],
    "size": 5,
    "status": 200
}

@akuzin1
Copy link

akuzin1 commented Jun 8, 2023

A current blocker for implementing Array Support in JDBC driver, so this would be a great feature to have, any status update on priority?

@Yury-Fridlyand
Copy link
Collaborator

I'm going to research possible solutions for this and share for discussion.
Meanwhile I found more tricky sample, which we should consider as well.

POST dbg/_doc
{
  "obj": [[1, 2], [3, 4], 5]
}

search result:

{
  "took" : 361,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "dbg",
        "_id" : "uDBnoYgB5NyEnr3HyanK",
        "_score" : 1.0,
        "_source" : {
          "obj" : [
            [
              1,
              2
            ],
            [
              3,
              4
            ],
            5
          ]
        }
      }
    ]
  }
}

mapping:

{
  "dbg" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "obj" : {
          "type" : "long"
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1686168574104",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "LN-lYUQFSsi9MVpa5VT-Zg",
        "version" : {
          "created" : "136297827"
        },
        "provided_name" : "dbg"
      }
    }
  }
}

@Yury-Fridlyand
Copy link
Collaborator

Please proceed with discussion in #1733.

@acarbonetto
Copy link
Collaborator

Supporting all of the above use cases will take multiple tries, and each should be dealt with separately. We can separate each use cases into a individual issues.

To solve the primitive/array expanding into multiple rows, we should try and use the metadata (cues) to determine if the data is treated as a primitive object or an array (we cannot do both easily). Doing something like what Presto/Trino supports in the index mapping: https://trino.io/docs/current/connector/elasticsearch.html#array-types would be simple. It would indicate that the mapped symbol should treat the data record as an array (or an array of 1 if the data is not defined as an array).

We should PoC this and see if it works to solve #1733 (comment).

penghuo pushed a commit that referenced this issue Oct 24, 2024
penghuo pushed a commit to penghuo/os-sql that referenced this issue Oct 25, 2024
penghuo pushed a commit to penghuo/os-sql that referenced this issue Oct 25, 2024
penghuo added a commit that referenced this issue Oct 25, 2024
Signed-off-by: Norman Jordan <[email protected]>
(cherry picked from commit e109417)

Co-authored-by: normanj-bitquill <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants