Skip to content

Latest commit

 

History

History
64 lines (45 loc) · 1.37 KB

README.md

File metadata and controls

64 lines (45 loc) · 1.37 KB

Data Lakehouse with Delta Lake

This repository contains a setup for a data lakehouse architecture using Apache Spark, Delta Lake, and MinIO (S3-compatible object storage).

Components

  • Apache Spark 3.4.0
  • Delta Lake 2.4.0
  • MinIO (S3-compatible object storage)
  • Jupyter Notebook with PySpark

Setup

  1. Build the Docker images:
make build
  1. Start the services:
make up
  1. Access Jupyter Notebook:
  1. Access Spark UI:
  1. Access MinIO UI:

Usage

  1. Use the Jupyter Notebook interface to create and run PySpark notebooks.
  2. Utilize Delta Lake for ACID transactions and time travel capabilities.
  3. Store and retrieve data from MinIO using the S3A filesystem.

Configuration

  • Spark configuration is available in notebooks/spark-defaults.conf
  • Docker services are defined in docker-compose.yml

Additional Information

  • The setup includes necessary JAR files for S3 connectivity and Delta Lake integration.
  • PySpark and Delta Spark libraries are pre-installed in the Jupyter environment.

Stopping the Services

To stop and remove the containers:

make down

Restarting the Services

To restart the services:

make restart