Skip to content

Commit

Permalink
chore: update docs minio/R2
Browse files Browse the repository at this point in the history
  • Loading branch information
ion-elgreco committed Aug 22, 2024
1 parent 480e8b6 commit b818f8b
Show file tree
Hide file tree
Showing 5 changed files with 91 additions and 10 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,8 @@ of features outlined in the Delta [protocol][protocol] is also [tracked](#protoc
| -------------------- | :-----: | :-----: | ---------------------------------------------------------------- |
| Local | ![done] | ![done] | |
| S3 - AWS | ![done] | ![done] | requires lock for concurrent writes |
| S3 - MinIO | ![done] | ![done] | requires lock for concurrent writes |
| S3 - R2 | ![done] | ![done] | No lock required when using `AmazonS3ConfigKey::CopyIfNotExists` |
| S3 - MinIO | ![done] | ![done] | No lock required when using `AmazonS3ConfigKey::ConditionalPut` with `storage_options = {"conditional_put":"etag"}` |
| S3 - R2 | ![done] | ![done] | No lock required when using `AmazonS3ConfigKey::ConditionalPut` with `storage_options = {"conditional_put":"etag"}` |
| Azure Blob | ![done] | ![done] | |
| Azure ADLS Gen2 | ![done] | ![done] | |
| Microsoft OneLake | ![done] | ![done] | |
Expand Down
83 changes: 83 additions & 0 deletions docs/integrations/object-storage/s3-like.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# CloudFlare R2 & Minio

`delta-rs` offers native support for using Cloudflare R2 and Minio's as storage backend. R2 and Minio support conditional puts, however we have to pass this flag into the storage options. See the example blow

You don’t need to install any extra dependencies to red/write Delta tables to S3 with engines that use `delta-rs`. You do need to configure your AWS access credentials correctly.

## Passing S3 Credentials

You can pass your AWS credentials explicitly by using:

- the `storage_options `kwarg
- Environment variables

## Example

Let's work through an example with Polars. The same logic applies to other Python engines like Pandas, Daft, Dask, etc.

Follow the steps below to use Delta Lake on S3 (R2/Minio) with Polars:

1. Install Polars and deltalake. For example, using:

`pip install polars deltalake`

2. Create a dataframe with some toy data.

`df = pl.DataFrame({'x': [1, 2, 3]})`

3. Set your `storage_options` correctly.

```python
storage_options = {
'AWS_SECRET_ACCESS_KEY': <access_key>,
'conditional_put': 'etag', # Here we say to use conditional put, this provides safe concurrency.
}
```

4. Write data to Delta table using the `storage_options` kwarg.

```python
df.write_delta(
"s3://bucket/delta_table",
storage_options=storage_options,
)
```

## Delta Lake on S3: Safe Concurrent Writes

You need a locking provider to ensure safe concurrent writes when writing Delta tables to S3. This is because S3 does not guarantee mutual exclusion.

A locking provider guarantees that only one writer is able to create the same file. This prevents corrupted or conflicting data.

`delta-rs` uses DynamoDB to guarantee safe concurrent writes.

Run the code below in your terminal to create a DynamoDB table that will act as your locking provider.

```
aws dynamodb create-table \
--table-name delta_log \
--attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \
--key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5
```
If for some reason you don't want to use DynamoDB as your locking mechanism you can choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to `true` in order to enable S3 unsafe writes.
Read more in the [Usage](../../usage/writing/writing-to-s3-with-locking-provider.md) section.
## Delta Lake on S3: Required permissions
You need to have permissions to get, put and delete objects in the S3 bucket you're storing your data in. Please note that you must be allowed to delete objects even if you're just appending to the Delta Lake, because there are temporary files into the log folder that are deleted after usage.
In AWS S3, you will need the following permissions:
- s3:GetObject
- s3:PutObject
- s3:DeleteObject
In DynamoDB, you will need the following permissions:
- dynamodb:GetItem
- dynamodb:Query
- dynamodb:PutItem
- dynamodb:UpdateItem
6 changes: 3 additions & 3 deletions docs/integrations/object-storage/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,9 @@ storage_options = {
)
```

## Delta Lake on S3: Safe Concurrent Writes
## Delta Lake on AWS S3: Safe Concurrent Writes

You need a locking provider to ensure safe concurrent writes when writing Delta tables to S3. This is because S3 does not guarantee mutual exclusion.
You need a locking provider to ensure safe concurrent writes when writing Delta tables to AWS S3. This is because AWS S3 does not guarantee mutual exclusion.

A locking provider guarantees that only one writer is able to create the same file. This prevents corrupted or conflicting data.

Expand All @@ -84,7 +84,7 @@ If for some reason you don't want to use DynamoDB as your locking mechanism you
Read more in the [Usage](../../usage/writing/writing-to-s3-with-locking-provider.md) section.
## Delta Lake on S3: Required permissions
## Delta Lake on AWS S3: Required permissions
You need to have permissions to get, put and delete objects in the S3 bucket you're storing your data in. Please note that you must be allowed to delete objects even if you're just appending to the Delta Lake, because there are temporary files into the log folder that are deleted after usage.
Expand Down
7 changes: 2 additions & 5 deletions docs/usage/writing/writing-to-s3-with-locking-provider.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,13 +101,10 @@ In DynamoDB, you need those permissions:
Unlike AWS S3, some S3 clients support atomic renames by passing some headers
in requests.

For CloudFlare R2 passing this in the storage_options will enable concurrent writes:
For CloudFlare R2 or Minio passing this in the storage_options will enable concurrent writes:

```python
storage_options = {
"copy_if_not_exists": "header: cf-copy-destination-if-none-match: *",
"conditional_put": "etag",
}
```

Something similar can be done with MinIO but the header to pass should be verified
in the MinIO documentation.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ nav:
- Object Storage:
- integrations/object-storage/hdfs.md
- integrations/object-storage/s3.md
- integrations/object-storage/s3-like.md
- Arrow: integrations/delta-lake-arrow.md
- Daft: integrations/delta-lake-daft.md
- Dagster: integrations/delta-lake-dagster.md
Expand Down

0 comments on commit b818f8b

Please sign in to comment.