Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: clarify locking mechanism requirement for S3 #2558

Merged
merged 8 commits into from
Jun 1, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 21 additions & 2 deletions docs/usage/writing/writing-to-s3-with-locking-provider.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# Writing to S3 with a locking provider

A locking mechanism is needed to prevent unsafe concurrent writes to a
delta lake directory when writing to S3.
Delta lake guarantees [ACID transactions](https://delta-io.github.io/delta-rs/how-delta-lake-works/delta-lake-acid-transactions/) when writing data. This is done by default when writing to all supported object stores except AWS S3. (Some S3 clients like CloudFlare R2 or MinIO may enable atomic renames, refer to [this section](#enabling-concurrent-writes-for-alternative-clients) for more information).
inigohidalgo marked this conversation as resolved.
Show resolved Hide resolved

When writing to S3, delta-rs provides a locking mechanism to ensure that concurrent writes are safe. This is done by default when writing to S3, but you can opt-out by setting the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to ``true``.

To enable safe concurrent writes to AWS S3, we must provide an external locking mechanism.

### DynamoDB
DynamoDB is the only available locking provider at the moment in delta-rs. To enable DynamoDB as the locking provider, you need to set the ``AWS_S3_LOCKING_PROVIDER`` to 'dynamodb' as a ``storage_options`` or as an environment variable.
Expand Down Expand Up @@ -80,3 +83,19 @@ In DynamoDB, you need those permissions:
- dynamodb:Query
- dynamodb:PutItem
- dynamodb:UpdateItem

### Enabling concurrent writes for alternative clients

Unlike AWS S3, some S3 clients support atomic renames by passing some headers
in requests.

For CloudFlare R2 passing this in the storage_options will enable concurrent writes:

```python
storage_options = {
"copy_if_not_exists": "header: cf-copy-destination-if-none-match: *",
}
```

Something similar can be done with MinIO but the header to pass should be verified
in the MinIO documentation.
6 changes: 3 additions & 3 deletions python/deltalake/writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,9 +208,9 @@ def write_deltalake(
For higher protocol support use engine='rust', this will become the default
eventually.

A locking mechanism is needed to prevent unsafe concurrent writes to a
delta lake directory when writing to S3. For more information on the setup, follow
this usage guide: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/
To enable safe concurrent writes when writing to S3, an additional locking
mechanism must be supplied. For more information on enabling concurrent writing to S3, follow this usage guide:
inigohidalgo marked this conversation as resolved.
Show resolved Hide resolved
https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/

Args:
table_or_uri: URI of a table or a DeltaTable object.
Expand Down
Loading