-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implementation for replaceWhere #1996
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
It is passing the CI on my box and with added docs I'm removing the draft once CI is passing here (one of the Python tests is flaky, need to be fixed) |
looks like few extra checks were introduced recently so lint is picking up changes which are not mine, e.g. crates/deltalake-core/src/delta_datafusion/mod.rs:149:37 |
Could you create a separate PR to address that so main is fixed? |
Ahh, fixed and pushed here before seeing your response. Can do another PR if they still break but leaving that for another day, done for today |
Looks like I am close to removing the draft, only need to fix this flaky |
I think this should be done. Delete is one of the things we should keep DRY which will help when deletion vectors are implemented. All operations will benefit instead of needing to update code in different location. My 2 cents is that we shouldn't modify the writer to implement this operation but rather should have a brand new operation structured in a manner similar to merge, delete, and update. Some builder interface like below
|
@Blajda I understand that you suggesting a standalone operator just for
If this is what you were suggesting I can give it a try |
@r3stl355 Yes I agree with point 1. For 2. one approach is to factor out a function that returns a tuple of files to be removed and an optional stream of record batches. I'm not sure on what 3. would provide. |
The 3. would allow to check if the new data conforms to the |
I'm not sure if I agree with taking this out of write. It's still a write operation |
What's the user-facing public interface for Python? |
@MrPowers Good old I also added docs and examples in docs for both Python and Rust (https://github.com/delta-io/delta-rs/pull/1996/files#diff-681ad64170174eef8ddc2c987ca2e99af32d90a79720b50b3f575a4a00ec4a50R55) |
@r3stl355 - ah, got it. Here's the delta-spark syntax: (
df.write.format("delta")
.option("replaceWhere", "number > 2")
.mode("overwrite")
.save("tmp/my_data")
) Looks like the proposed syntax is as follows: write_deltalake(
table_path,
data,
mode="overwrite",
predicate="id = '1'",
engine="rust",
) I almost feel like the |
@MrPowers I don't think it makes sense to introduce another mode because replace where without a predicate wouldnt be possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add one more test cases to match the pyarrow partition filter overwrite test cases? :)
A test case where the data is not meeting the constraint. Also I wonder if you could capture that constraint error and give a better error msg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long, but finally got around to do some reviewing.
I am a bit less happy with my code now after snapshot was made an Option
This is actually also one of my biggest concerns annoyances right now, and kind of boild down to what purpose each of these structs serve in our codebase. Previously, DeltaTableState
took no opinion on if Metadata
and Protocol
exist. However if we cannot get these, there really is not anything meaningful we can do with a delta table. However DeltaTable
has some operations that work on empty locations etc. As a result, we were raising errors all throughout our operations if the back then optionsal metadata / protocol actions were not available. While this requires some more refactoring, all operations (except the ones that can cretate a table) should always require a valid snapshot.
btw. DeltaTableState
is by now also just a thin wrapper around a snapshot, which I also hope to remove soon :).
Thank you for the review @roeap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience with this PR @r3stl355, and great work!
I'll leave it open for a bit, since @MrPowers had some questions around the APIs, but form my end passing the predicate to write_deltalake
seems reasonable. While this function already takes a lot of parameters, I feel this PR does not make it worse :).
@roeap once we drop the Pyarrow writer the write_deltalake will be much leaner |
minimum standard achieved 😁 . I'll create couple of related issues |
@MrPowers do you still have those questions around API? If not then this should be ready to merge (and I squashed all my commits into one as agreed) |
@r3stl355 - it seems we have seen a bigger rename of files before I could get that in :( - sorry oyu have to resolve once more. Since we haven't heard back, I would just merge once we are green again ... |
Signed-off-by: Nikolay Ulmasov <[email protected]>
Thanks @roeap , should be good now |
It's been back and forth few times, something that I understand and don't mind but unfortunately I don't have time to handle this in next few weeks so parking it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Niko! Great feature :) We are one step closer to deprecating the PyArrow writer now!
# Description First/naive implementation of `replaceWhere` for `write`. Code compiles and there is a test to verify the outcome. I would appreciate any feedback on improving the structure/implementation. For example, I copied the part of code from `delete` operation because there is no way to call that code in `delete` directly from `write` - should I look into extracting that code from `delete` to somewhere central? Seems to also works with partitions columns. # Related Issue(s) delta-io#1957 # Documentation Added a section in docs --------- Signed-off-by: Nikolay Ulmasov <[email protected]> Co-authored-by: Ion Koutsouris <[email protected]>
Description
First/naive implementation of
replaceWhere
forwrite
. Code compiles and there is a test to verify the outcome. I would appreciate any feedback on improving the structure/implementation. For example, I copied the part of code fromdelete
operation because there is no way to call that code indelete
directly fromwrite
- should I look into extracting that code fromdelete
to somewhere central?Seems to also works with partitions columns.
Related Issue(s)
#1957
Documentation
Added a section in docs