Python bindings and integrations for the excellent object_store
crate.
The main idea is to provide a common interface to various storage backends including the
objects stores from most major cloud providers. The APIs are very focussed and taylored
towards modern cloud native applications by hiding away many features (and complexities)
encountered in full fledges file systems.
Among the included backend are:
- Amazon S3 and S3 compliant APIs
- Google Cloud Storage Buckets
- Azure Blob Gen1 and Gen2 accounts (including ADLS Gen2)
- local storage
- in-memory store
The object-store-python
package is available on PyPI and can be installed via
poetry add object-store-python
or using pip
pip install object-store-python
The main ObjectStore
API mirrors the native object_store
implementation, with some slight adjustments for ease of use in python programs.
from object_store import ObjectStore, ObjectMeta, Path
# we use an in-memory store for demonstration purposes.
# data will not be persisted and is not shared across store instances
store = ObjectStore("memory://")
store.put(Path("data"), b"some data")
data = store.get("data")
assert data == b"some data"
blobs = store.list()
meta = store.head("data")
range = store.get_range("data", start=0, length=4)
assert range == b"some"
store.copy("data", "copied")
copied = store.get("copied")
assert copied == data
from object_store import ObjectStore, ObjectMeta, Path
# we use an in-memory store for demonstration purposes.
# data will not be persisted and is not shared across store instances
store = ObjectStore("memory://")
path = Path("data")
await store.put_async(path, b"some data")
data = await store.get_async(path)
assert data == b"some data"
blobs = await store.list_async()
meta = await store.head_async(path)
range = await store.get_range_async(path, start=0, length=4)
assert range == b"some"
await store.copy_async(Path("data"), Path("copied"))
copied = await store.get_async(Path("copied"))
assert copied == data
As much as possible we aim to make access to various storage backends dependent only on runtime configuration. The kind of service is always derived from the url used to specifiy the storage location. Some basic configuration can also be derived from the url string, dependent on the chosen url format.
from object_store import ObjectStore
storage_options = {
"azure_storage_account_name": "<my-account-name>",
"azure_client_id": "<my-client-id>",
"azure_client_secret": "<my-client-secret>",
"azure_tenant_id": "<my-tenant-id>"
}
store = ObjectStore("az://<container-name>", storage_options)
We can provide the same configuration via the environment.
import os
from object_store import ObjectStore
os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = "<my-account-name>"
os.environ["AZURE_CLIENT_ID"] = "<my-client-id>"
os.environ["AZURE_CLIENT_SECRET"] = "<my-client-secret>"
os.environ["AZURE_TENANT_ID"] = "<my-tenant-id>"
store = ObjectStore("az://<container-name>")
The recommended url format is az://<container>/<path>
and Azure always requieres
azure_storage_account_name
to be configured.
- shared key
azure_storage_account_key
- service principal
azure_client_id
azure_client_secret
azure_tenant_id
- shared access signature
azure_storage_sas_key
(as provided by StorageExplorer)
- bearer token
azure_storage_token
- managed identity
- if using user assigned identity one of
azure_client_id
,azure_object_id
,azure_msi_resource_id
- if no other credential can be created, managed identity will be tried
- if using user assigned identity one of
- workload identity
azure_client_id
azure_tenant_id
azure_federated_token_file
The recommended url format is s3://<bucket>/<path>
S3 storage always requires a
region to be specified via one of aws_region
or aws_default_region
.
- access key
aws_access_key_id
aws_secret_access_key
- session token
aws_session_token
- imds instance metadata
aws_metadata_endpoint
- profile
aws_profile
AWS supports virtual hosting of buckets, which can be configured by setting
aws_virtual_hosted_style_request
to "true".
When an alternative implementation or a mocked service like localstack is used, the service
endpoint needs to be explicitly specified via aws_endpoint
.
The recommended url format is gs://<bucket>/<path>
.
- service account
google_service_account
from pathlib import Path
import numpy as np
import pyarrow as pa
import pyarrow.fs as fs
import pyarrow.dataset as ds
import pyarrow.parquet as pq
from object_store import ArrowFileSystemHandler
table = pa.table({"a": range(10), "b": np.random.randn(10), "c": [1, 2] * 5})
base = Path.cwd()
store = fs.PyFileSystem(ArrowFileSystemHandler(str(base.absolute())))
pq.write_table(table.slice(0, 5), "data/data1.parquet", filesystem=store)
pq.write_table(table.slice(5, 10), "data/data2.parquet", filesystem=store)
dataset = ds.dataset("data", format="parquet", filesystem=store)
If you do not have just
installed and do not wish to install it,
have a look at the justfile
to see the raw commands.
To set up the development environment, and install a dev version of the native package just run:
just init
This will also configure pre-commit
hooks in the repository.
To run the rust as well as python tests:
just test