-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust panicking through Python library when a delete predicate uses a nullable field #2019
Comments
@liamphmurphy can you share your table schema |
@ion-elgreco Omitting some fields, but here's what I believe is the relevant part. Obtained from
|
@liamphmurphy how did you write data to this table? If you used our datalake library, did you use it in conjunction with another data processing library? Test code that attempts to replicate the issue. when the null buffer is all true then there is no issue with constructing the record batch. #[tokio::test]
async fn test_delete_nested() {
use arrow_schema::{DataType, Field, Schema as ArrowSchema};
// Replicate issue with struct predicates
let schema = Arc::new(ArrowSchema::new(vec![
Field::new("id", DataType::Utf8, true),
Field::new("props", DataType::Struct(Fields::from(vec![Field::new("a", DataType::Utf8, true)])), true),
]));
let struct_array = StructArray::new(
Fields::from(vec![Field::new("a", DataType::Utf8, true)]),
vec![
Arc::new(arrow::array::StringArray::from(vec![
Some("2021-02-01"),
Some("2021-02-02"),
None,
None
])) as ArrayRef
],
Some(NullBuffer::from_iter(vec![true, true, true, false]))
);
let data = vec![
Arc::new(arrow::array::StringArray::from(vec!["A", "B", "C", "D"])) as ArrayRef,
Arc::new(struct_array) as ArrayRef
];
let table = DeltaOps::new_in_memory()
.write(vec![RecordBatch::try_new(schema.clone(), data).unwrap()])
.await
.unwrap();
dbg!("written");
let (table, _metrics) = DeltaOps(table)
.delete()
.with_predicate("props['a'] = '2021-02-02'")
.await
.unwrap();
let expected = [
"+----+-----------------+",
"| id | props |",
"+----+-----------------+",
"| A | {a: 2021-02-01} |",
"| C | {a: 2021-02-03} |",
"+----+-----------------+",
];
let actual = get_data(&table).await;
assert_batches_sorted_eq!(&expected, &actual);
} |
Root Cause is within the rust writer:
We can't convert a struct into a record batch like this since the validity array is lost. |
@Blajda Appreciate the quick response. We're using the Python bindings from this repo for the write to the delta lake. But, regarding additional data processing, currently we're converting our internal JSON schemas into equivalent Pyarrow schemas, and then using that schema in the table creation. This is a manual process so there could be something goofy there. The manual part is basically mapping JSON schema types to Pyrarrow types. An example:
Barring that, this is how we write to our delta lake:
|
@liamphmurphy just to note, overwrite_schema only works when you pass mode='overwrite'. |
@ion-elgreco Ah thanks for the heads up! |
@liamphmurphy thanks for the insights on how you are writing. When you were appending to the delta table it uses the pyarrow writer but the delete command uses the rust writer to rewrite any required files. The rust writer has a bug with null-able structs which is resolved by the PR I submitted. |
# Description Fixes an issue where the writer attempts to convert a Arrow `Struct` into a `RecordBatch`. This cannot be done since it will drop the validity array and would prevents structs with a value of `null` from being stored correctly. This PR also extends the predicate representation for struct field access, list index access, and list range access. # Related Issue(s) - closes #2019
# Description Fixes an issue where the writer attempts to convert a Arrow `Struct` into a `RecordBatch`. This cannot be done since it will drop the validity array and would prevents structs with a value of `null` from being stored correctly. This PR also extends the predicate representation for struct field access, list index access, and list range access. # Related Issue(s) - closes delta-io#2019
Environment
Delta-rs version:
python-v0.15.0
Environment:
Bug
What happened:
We're using the delta-rs python library to setup the ability to delete from our existing Deltalake tables in S3. They're simple WHERE clauses like the following:
table.delete(predicate="properties.pageUrl = 'testing.com'")
However, I'm seeing a Rust panic:
This is happening on a field that is nullable..
What you expected to happen:
When a field is null / nullable, either a success or at least some specific error explaining why you can't delete on nullable fields.
How to reproduce it:
Setup a delta table with a nullable field, and attempt to perform a deletion on that field using deltatable's
delete
method.More details:
The text was updated successfully, but these errors were encountered: