Skip to content

Commit

Permalink
Merge branch 'develop' into inference-endpoints-structured-gen
Browse files Browse the repository at this point in the history
  • Loading branch information
alvarobartt committed May 31, 2024
2 parents a934ff0 + 1624b1e commit c3cc487
Show file tree
Hide file tree
Showing 53 changed files with 2,711 additions and 466 deletions.
42 changes: 42 additions & 0 deletions .github/workflows/codspeed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: Benchmarks

on:
push:
branches:
- "main"
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
benchmarks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.12"
# Looks like it's not working very well for other people:
# https://github.com/actions/setup-python/issues/436
# cache: "pip"
# cache-dependency-path: pyproject.toml

- uses: actions/cache@v3
id: cache
with:
path: ${{ env.pythonLocation }}
key: ${{ runner.os }}-python-${{ env.pythonLocation }}-${{ hashFiles('pyproject.toml') }}-benchmarks-v00

- name: Install dependencies
if: steps.cache.outputs.cache-hit != 'true'
run: ./scripts/install_dependencies.sh

- name: Run benchmarks
uses: CodSpeedHQ/action@v2
with:
token: ${{ secrets.CODSPEED_TOKEN }}
run: pytest tests/ --codspeed
18 changes: 8 additions & 10 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ on:
types:
- opened
- synchronize
workflow_dispatch:
inputs:
tmate_session:
description: Starts the workflow with tmate enabled.
required: false
default: "false"

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
Expand All @@ -19,7 +25,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
fail-fast: false

steps:
Expand All @@ -42,14 +48,7 @@ jobs:

- name: Install dependencies
if: steps.cache.outputs.cache-hit != 'true'
run: |
python_version=$(python -c "import sys; print(sys.version_info[:2])")
pip install -e .[dev,tests,anthropic,argilla,cohere,groq,hf-inference-endpoints,hf-transformers,litellm,llama-cpp,ollama,openai,outlines,vertexai,vllm]
if [ "${python_version}" != "(3, 8)" ]; then
pip install -e .[mistralai]
fi;
pip install git+https://github.com/argilla-io/LLM-Blender.git
run: ./scripts/install_dependencies.sh

- name: Lint
run: make lint
Expand All @@ -59,4 +58,3 @@ jobs:

- name: Integration Tests
run: make integration-tests
timeout-minutes: 5
5 changes: 2 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,10 @@ repos:
- --fuzzy-match-generates-todo

- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.1.4
rev: v0.4.5
hooks:
- id: ruff
args:
- --fix
args: [--fix]
- id: ruff-format

ci:
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@ sources = src/distilabel tests

.PHONY: format
format:
ruff --fix $(sources)
ruff check --fix $(sources)
ruff format $(sources)

.PHONY: lint
lint:
ruff $(sources)
ruff check $(sources)
ruff format --check $(sources)

.PHONY: unit-tests
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 38 additions & 0 deletions docs/sections/learn/advanced/distiset.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,44 @@ distiset.push_to_hub(
)
```

### Save and load from disk

Saves the [`Distiset`][distilabel.distiset.Distiset] to disk, and optionally (will be done by default) saves the dataset card, the pipeline config file and logs:

```python
distiset.save_to_disk(
"my-dataset",
save_card=True,
save_pipeline_config=True,
save_pipeline_log=True
)
```

And load a [`Distiset`][distilabel.distiset.Distiset] that was saved using [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk] just the same way:

```python
from distilabel.distiset import Distiset

distiset = Distiset.load_from_disk("my-dataset")
```

or from your cloud provider if that's where it was stored:

```python
distiset = Distiset.load_from_disk(
"s3://path/to/my_dataset", # gcs:// or any filesystem tolerated by fsspec
storage_options={
"key": os.environ["S3_ACCESS_KEY"],
"secret": os.environ["S3_SECRET_KEY"],
...
}
)
```

Take into account that these methods work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card). You can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).

Take a look at the remaining arguments at [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk] and [`Distiset.load_from_disk`][distilabel.distiset.Distiset.load_from_disk].

## Dataset card

Having this special type of dataset comes with an added advantage when calling [`Distiset.push_to_hub`][distilabel.distiset.Distiset], which is the automatically generated dataset card in the Hugging Face Hub. Note that it is enabled by default, but can be disabled by setting `generate_card=False`:
Expand Down
24 changes: 24 additions & 0 deletions docs/sections/learn/advanced/fs_to_pass_data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Using a file system to pass data of batches between steps

In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, `distilabel` uses [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the `run` method of the `distilabel` pipelines:

```python
from distilabel.pipeline import Pipeline

with Pipeline(name="my-pipeline") as pipeline:
...

if __name__ == "__main__":
distiset = pipeline.run(
...,
storage_parameters={"protocol": "gcs", "path": "gcs://my-bucket"},
use_fs_to_pass_data=True
)
```

The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system.The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system.

!!! NOTE

As `GlobalStep`s receives all the data from the previous steps in one single batch accumulating all the data, it's very likely that the data of the batch will be too big to be passed using the queue. In this case and even if `use_fs_to_pass_data==False`, `distilabel` will use the file system to pass the data to the `GlobalStep`.

76 changes: 75 additions & 1 deletion docs/sections/learn/advanced/structured_generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@

The [`LLM`][distilabel.llms.LLM] has an argument named `structured_output`[^1] that determines how we can generate structured outputs with it, let's see an example using [`LlamaCppLLM`][distilabel.llms.LlamaCppLLM].

!!! Note

For `outlines` integration to work you may need to install the corresponding dependencies:

```bash
pip install distilabel[outlines]
```

### JSON

We will start with a JSON example, where we initially define a `pydantic.BaseModel` schema to guide the generation of the structured output.
Expand Down Expand Up @@ -101,7 +109,7 @@ if match:

These were some simple examples, but one can see the options this opens.

!!! NOTE
!!! Tip
A full pipeline example can be seen in the following script:
[`examples/structured_generation_with_outlines.py`](../../pipeline_samples/examples/index.md#llama-cpp-with-outlines)

Expand All @@ -119,6 +127,72 @@ These were some simple examples, but one can see the options this opens.
curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf
```

## Instructor

When working with model providers behind an API, there's no direct way of accesing the internal logit processor as `outlines` does, but thanks to [`instructor`](https://python.useinstructor.com/) we can generate structured output from LLM providers. We have integrated `instructor` to deal with the [`AsyncLLM`][distilabel.llms.AsyncLLM], so you can work with the following LLMs: [`OpenAILLM`][distilabel.llms.OpenAILLM], [`AzureOpenAILLM`][distilabel.llms.AzureOpenAILLM], [`CohereLLM`][distilabel.llms.CohereLLM], [`GroqLLM`][distilabel.llms.GroqLLM], [`LiteLLM`][distilabel.llms.LiteLLM] and [`MistralLLM`][distilabel.llms.MistralLLM].

`instructor` works with `pydantic.BaseModel` objects internally but in `distilabel` the examples generated would result in the string representation of them, from which the `BaseModel` object can be regenerated.

!!! Note
For `instructor` integration to work you may need to install the corresponding dependencies:

```bash
pip install distilabel[instructor]
```

!!! Note
Take a look at [`InstructorStructuredOutputType`][distilabel.steps.tasks.structured_outputs.instructor.InstructorStructuredOutputType] to see the expected format
of the `structured_output` dict variable.

The following is the same example you can see with `outlines`'s `JSON` section for comparison purposes.

```python
from pydantic import BaseModel

class User(BaseModel):
name: str
last_name: str
id: int
```

And then we provide that schema to the `structured_output` argument of the LLM:

!!! Note
In this example we are using *open-mixtral-8x22b*, keep in mind not all the models work with the function calling functionality required for this example to work.

```python
from distilabel.llms import MistralLLM

llm = MistralLLM(
model="open-mixtral-8x22b",
structured_output={"schema": User}
)
llm.load()
```

And we are ready to pass our instruction as usual:

```python
import json

result = llm.generate(
[[{"role": "user", "content": "Create a user profile for the following marathon"}]],
max_new_tokens=256
)

data = json.loads(result[0][0])
data
# {'name': 'John', 'last_name': 'Doe', 'id': 12345}
User(**data)
# User(name='John', last_name='Doe', id=12345)
```

We get back a Python dictionary (formatted as a string) that we can parse using `json.loads`, or validate it directly using the `User`, which is a `pydantic.BaseModel` instance.

!!! Tip
A full pipeline example can be seen in the following script:
[`examples/structured_generation_with_instructor.py`](../../pipeline_samples/examples/index.md#mistralai-with-instructor)

## OpenAI JSON

OpenAI offers a [JSON Mode](https://platform.openai.com/docs/guides/text-generation/json-mode) to deal with structured output via their API, let's see how to make use of them. The JSON mode instructs the model to always return a JSON object following the instruction required.
Expand Down
41 changes: 40 additions & 1 deletion docs/sections/pipeline_samples/examples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This section contains different example pipelines that showcase different tasks, maybe you can take inspiration from them.

### [llama.cpp with outlines](#llama-cpp-with-outlines)
### [llama.cpp with `outlines`](#llama-cpp-with-outlines)

Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`.

Expand All @@ -21,3 +21,42 @@ Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `dis
```python title="structured_generation_with_outlines.py"
--8<-- "examples/structured_generation_with_outlines.py"
```


### [MistralAI with `instructor`](#mistralai-with-instructor)

Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`.

??? Example "See example"

This script makes use of [`MistralLLM`][distilabel.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics.

This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook.

??? Run

```python
python examples/structured_generation_with_instructor.py
```

```python title="structured_generation_with_instructor.py"
--8<-- "examples/structured_generation_with_instructor.py"
```

??? "Visualizing the graphs"

Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look:

!!! NOTE

This example uses graphviz to render the graph, you can install with `pip` in the following way:

```console
pip install graphviz
```

```python
python examples/draw_kg.py 2 # You can pass 0,1,2 to visualize each of the samples.
```

![Knowledge graph figure](../../../assets/images/sections/examples/knowledge-graph-example.png)
Loading

0 comments on commit c3cc487

Please sign in to comment.