Skip to content

Commit

Permalink
Merge pull request #566 from sujee/getting-started-1
Browse files Browse the repository at this point in the history
Getting started instructions and code tweak
  • Loading branch information
Bytes-Explorer authored Sep 4, 2024
2 parents c81ed0d + 39a7567 commit 8b0ff7c
Show file tree
Hide file tree
Showing 4 changed files with 910 additions and 14 deletions.
52 changes: 38 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ The goal is to offer high-level APIs for developers to quickly get started in wo


## 📝 Table of Contents

- [About](#about)
- [Run your first transform](#gettingstarted)
- [Scaling transforms from laptop to cluster](#laptop_cluster)
Expand All @@ -32,6 +33,7 @@ The goal is to offer high-level APIs for developers to quickly get started in wo
- [Papers and Talks](#talks_papers)

## &#x1F4D6; About <a name = "about"></a>

Data Prep Kit is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning, RAG or instruction-tuning.
Data Prep Kit contributes a set of modules that the developer can get started with to easily build data pipelines suitable for their use case.
These modules have been tested while producing pre-training datasets for the [Granite open source LLM models](https://huggingface.co/ibm-granite).
Expand Down Expand Up @@ -92,6 +94,7 @@ conda create -n data-prep-kit -y python=3.11
conda activate data-prep-kit
python --version
```

Check if the python version is 3.11.

If you are using a linux system, install gcc using the below commands:
Expand All @@ -100,28 +103,44 @@ If you are using a linux system, install gcc using the below commands:
conda install gcc_linux-64
conda install gxx_linux-64
```

Next, install the data prep toolkit library. This library installs both the python and ray versions of the transforms.

```bash
pip3 install data-prep-toolkit-transforms-ray
pip3 install jupyterlab
pip3 install data-prep-toolkit-transforms-ray
pip3 install jupyterlab ipykernel ipywidgets

## install custom kernel
python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"
```

Test, your installation. If you are able to import these data-prep-kit libraries successfully in python, your installation has succeeded.

```bash
from data_processing_ray.runtime.ray import RayTransformLauncher
from data_processing.runtime.pure_python import PythonTransformLauncher
## start python interpretter
$ python

# import DPK libraries
>>> from data_processing_ray.runtime.ray import RayTransformLauncher
>>> from data_processing.runtime.pure_python import PythonTransformLauncher
```

If there are no errors, you are good to go!

### Run your first transform

Let's try a simple transform to extract content from PDF files. This [notebook](examples/notebooks/Run_your_first_transform.ipynb) demonstrates how to run a data preparation transformation that extracts content from PDF files using the data-prep-kit, leveraging Ray for parallel execution while still allowing local processing. To run this notebook, launch jupyter from the same virtual environment using the command below.
Let's try a simple transform to extract content from PDF files. We have following notebooks that demonstrate how to run a data preparation transformation that extracts content from PDF files using the data-prep-kit.

```bash
conda install ipykernel
# Add the Conda environment to Jupyter
python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"
jupyter-lab
```
After opening the jupyter notebook, change the kernel to dataprepkit.
- Option 1: Pure python notebook : [examples/notebooks/Run_your_first_transform_python.ipynb](examples/notebooks/Run_your_first_transform_python.ipynb) - easiest to get started
- Option 2: This one uses Ray framework for parallel execution while still allowing local processing : [examples/notebooks/Run_your_first_transform_ray.ipynb](examples/notebooks/Run_your_first_transform_ray.ipynb)

You can try either one, or both 😄

To run the notebooks, launch jupyter from the same virtual environment you created using the command below.

`jupyter lab`

After opening the jupyter notebook, change the kernel to `dataprepkit`, so all libraries will be properly loaded.

Explore more examples [here](examples/notebooks).

Expand All @@ -140,24 +159,28 @@ The annotator design also allows a user to verify the results of the processing

- **Filter** A filter transform processes the data and outputs the transformed data, e.g., exact deduplication.
A general purpose [SQL-based filter transform](transforms/universal/filter) enables a powerful mechanism for identifying columns and rows of interest for downstream processing.

For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms).

One can leverage Python-based processing logic and the Data Processing Library to easily build and contribute new transforms. We have provided an [example transform](transforms/universal/noop) that can serve as a template to add new simple transforms. Follow the step by step [tutorial](data-processing-lib/doc/simplest-transform-tutorial.md) to help you add your own new transform.

For a deeper understanding of the library's architecture, its transforms, and available runtimes, we encourage the reader to consult the comprehensive [overview document] (../../data-processing-lib/doc/overview.md) alongside dedicated sections on [transforms] (../../data-processing-lib/doc/transforms.md) and [runtimes] (../../data-processing-lib/doc/transform-runtimes.md).
For a deeper understanding of the library's architecture, its transforms, and available runtimes, we encourage the reader to consult the comprehensive [overview document](../../data-processing-lib/doc/overview.md) alongside dedicated sections on [transforms](../../data-processing-lib/doc/transforms.md) and [runtimes](../../data-processing-lib/doc/transform-runtimes.md).

Additionally, check out our video tutorial (https://www.youtube.com/watch?v=0WUMG6HIgMg) for a visual, example-driven guide on adding custom modules.
Additionally, check out our [video tutorial](https://www.youtube.com/watch?v=0WUMG6HIgMg) for a visual, example-driven guide on adding custom modules.


## 💻 -> 🖥️☁️ From laptop to cluster <a name = "laptop_cluster"></a>
Data-prep-kit provides the flexibility to transition your projects from proof-of-concept (PoC) stage to full-scale production mode, offering all the necessary tools to run your data transformations at high volume. In this section, we enable you how to run your transforms at scale and how to automate them.

### Scaling of Transforms

To enable processing of large data volumes leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html)
or [Spark](https://spark.apache.org) wrappers are provided, to readily scale out the Python implementations.

A generalized workflow is shown [here](doc/data-processing.md).

### Automation

The toolkit also supports transform execution automation based on
[Kubeflow pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (KFP),
tested on a locally deployed [Kind cluster](https://kind.sigs.k8s.io/) and external OpenShift clusters. There is an
Expand All @@ -179,6 +202,7 @@ You can run transforms via docker image or using virtual environments. This [doc


## 🎤 + 📄 Talks and Papers <a name = "talks_papers"></a>

1. [Granite Code Models: A Family of Open Foundation Models for Code Intelligence](https://arxiv.org/abs/2405.04324)
2. [Scaling Granite Code Models to 128K Context](https://arxiv.org/abs/2407.13739)
3. Talk on "Building Successful LLM Apps: The Power of high quality data" [Video](https://www.youtube.com/watch?v=u_2uiZBBVIE) [Slides](https://www.slideshare.net/slideshow/data_prep_techniques_challenges_methods-pdf-a190/271527890)
Expand Down
2 changes: 2 additions & 0 deletions examples/notebooks/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Input-Test-Data
Output-Test-Data
Loading

0 comments on commit 8b0ff7c

Please sign in to comment.