Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust panicking through Python library when a delete predicate uses a nullable field #2019

Closed
liamphmurphy opened this issue Jan 2, 2024 · 8 comments · Fixed by #2033
Closed
Assignees
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate bug Something isn't working

Comments

@liamphmurphy
Copy link
Contributor

Environment

Delta-rs version: python-v0.15.0

Environment:

  • OS: MacOS, localstack simulating S3 in Docker

Bug

What happened:

We're using the delta-rs python library to setup the ability to delete from our existing Deltalake tables in S3. They're simple WHERE clauses like the following:

table.delete(predicate="properties.pageUrl = 'testing.com'")

However, I'm seeing a Rust panic:

thread 'tokio-runtime-worker' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-48.0.1/src/record_batch.rs:487:9:
assertion `left == right` failed: Cannot convert nullable StructArray to RecordBatch, see StructArray documentation
  left: 1
 right: 0
stack backtrace:
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[ERROR] DeltaError: Generic error: Failed to execute write task: task 48 panicked
Traceback (most recent call last):
  File "/var/task/aws_lambda_powertools/logging/logger.py", line 449, in decorate
    return lambda_handler(event, context, *args, **kwargs)
  File "/var/task/newrelic/api/lambda_handler.py", line 135, in _nr_lambda_handler_wrapper_
    result = wrapped(*args, **kwargs)
  File "/var/task/lambda.py", line 60, in lambda_handler
    dt.delete(predicate=predicate)
  File "/var/task/deltalake/table.py", line 1107, in delete
    metrics = self._table.delete(
END RequestId: 8df76707-d13c-45b2-b112-d1801161785c

This is happening on a field that is nullable..

What you expected to happen:

When a field is null / nullable, either a success or at least some specific error explaining why you can't delete on nullable fields.

How to reproduce it:

Setup a delta table with a nullable field, and attempt to perform a deletion on that field using deltatable's delete method.

More details:

@liamphmurphy liamphmurphy added the bug Something isn't working label Jan 2, 2024
@ion-elgreco
Copy link
Collaborator

@liamphmurphy can you share your table schema

@liamphmurphy
Copy link
Contributor Author

@ion-elgreco Omitting some fields, but here's what I believe is the relevant part. Obtained from dt.schema().to_json()

{
    "type": "struct",
    "fields": [
        {
            "name": "properties",
            "type": {
                "type": "struct",
                "fields": [
                    {
                        "name": "pageUrl",
                        "type": "string",
                        "nullable": true,
                        "metadata": {}
                    }
                ]
            },
            "nullable": true,
            "metadata": {}
        }
    ]
}

@Blajda
Copy link
Collaborator

Blajda commented Jan 3, 2024

@liamphmurphy how did you write data to this table? If you used our datalake library, did you use it in conjunction with another data processing library?
I'm trying to replicate the above issue but I can't create a null struct. Seems like arrow-rs might have a bug here. Based on the arrow spec for structs I don't see your schema being an issue.


Test code that attempts to replicate the issue. when the null buffer is all true then there is no issue with constructing the record batch.

    #[tokio::test]
    async fn test_delete_nested() {
        use arrow_schema::{DataType, Field, Schema as ArrowSchema};
        // Replicate issue with struct predicates
        let schema = Arc::new(ArrowSchema::new(vec![
            Field::new("id", DataType::Utf8, true),
            Field::new("props", DataType::Struct(Fields::from(vec![Field::new("a", DataType::Utf8, true)])), true),
        ]));
         

        let struct_array = StructArray::new(
            Fields::from(vec![Field::new("a", DataType::Utf8, true)]),
            vec![
                Arc::new(arrow::array::StringArray::from(vec![
                    Some("2021-02-01"),
                    Some("2021-02-02"),
                    None,
                    None
                ])) as ArrayRef
            ],
            Some(NullBuffer::from_iter(vec![true, true, true, false]))
        );


        let data = vec![
            Arc::new(arrow::array::StringArray::from(vec!["A", "B", "C", "D"])) as ArrayRef,
            Arc::new(struct_array) as ArrayRef
        ];


        let table = DeltaOps::new_in_memory()
            .write(vec![RecordBatch::try_new(schema.clone(), data).unwrap()])
            .await
            .unwrap();

        dbg!("written");
        let (table, _metrics) = DeltaOps(table)
            .delete()
            .with_predicate("props['a'] = '2021-02-02'")
            .await
            .unwrap();

        let expected = [
            "+----+-----------------+",
            "| id | props           |",
            "+----+-----------------+",
            "| A  | {a: 2021-02-01} |",
            "| C  | {a: 2021-02-03} |",
            "+----+-----------------+",
        ];
        let actual = get_data(&table).await;
        assert_batches_sorted_eq!(&expected, &actual);
    }

@Blajda
Copy link
Collaborator

Blajda commented Jan 3, 2024

Root Cause is within the rust writer:

let child_batch = RecordBatch::from(StructArray::from(col.into_data()));

We can't convert a struct into a record batch like this since the validity array is lost.

@Blajda Blajda self-assigned this Jan 3, 2024
@rtyler rtyler added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Jan 3, 2024
@liamphmurphy
Copy link
Contributor Author

liamphmurphy commented Jan 3, 2024

@Blajda Appreciate the quick response. We're using the Python bindings from this repo for the write to the delta lake.

But, regarding additional data processing, currently we're converting our internal JSON schemas into equivalent Pyarrow schemas, and then using that schema in the table creation. This is a manual process so there could be something goofy there. The manual part is basically mapping JSON schema types to Pyrarrow types. An example:

elif prop_type == "boolean":
      prop_schema = pa.bool_()
... other types ...
fields.append(
      pa.field(prop_name, prop_schema)
)

Barring that, this is how we write to our delta lake:

write_deltalake(
          s3_path, 
          table,
          schema=pyarrow_schema,
          mode="append",
          partition_by=["uid","date","hour"],
          overwrite_schema=True
)

@ion-elgreco
Copy link
Collaborator

@liamphmurphy just to note, overwrite_schema only works when you pass mode='overwrite'.

@liamphmurphy
Copy link
Contributor Author

@ion-elgreco Ah thanks for the heads up!

@Blajda
Copy link
Collaborator

Blajda commented Jan 4, 2024

@liamphmurphy thanks for the insights on how you are writing. When you were appending to the delta table it uses the pyarrow writer but the delete command uses the rust writer to rewrite any required files. The rust writer has a bug with null-able structs which is resolved by the PR I submitted.

Blajda added a commit that referenced this issue Jan 6, 2024
# Description
Fixes an issue where the writer attempts to convert a Arrow `Struct`
into a `RecordBatch`. This cannot be done since it will drop the
validity array and would prevents structs with a value of `null` from
being stored correctly.

This PR also extends the predicate representation for struct field
access, list index access, and list range access.

# Related Issue(s)
- closes #2019
r3stl355 pushed a commit to r3stl355/delta-rs that referenced this issue Jan 10, 2024
# Description
Fixes an issue where the writer attempts to convert a Arrow `Struct`
into a `RecordBatch`. This cannot be done since it will drop the
validity array and would prevents structs with a value of `null` from
being stored correctly.

This PR also extends the predicate representation for struct field
access, list index access, and list range access.

# Related Issue(s)
- closes delta-io#2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants