Merge branch 'develop' into inference-endpoints-structured-gen

argilla-io · May 31, 2024 · c3cc487 · c3cc487
2 parents a934ff0 + 1624b1e
commit c3cc487
Show file tree

Hide file tree

Showing 53 changed files with 2,711 additions and 466 deletions.
diff --git a/.github/workflows/codspeed.yml b/.github/workflows/codspeed.yml
@@ -0,0 +1,42 @@
+name: Benchmarks
+
+on:
+  push:
+    branches:
+      - "main"
+  pull_request:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  benchmarks:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.12"
+          # Looks like it's not working very well for other people:
+          # https://github.com/actions/setup-python/issues/436
+          # cache: "pip"
+          # cache-dependency-path: pyproject.toml
+
+      - uses: actions/cache@v3
+        id: cache
+        with:
+          path: ${{ env.pythonLocation }}
+          key: ${{ runner.os }}-python-${{ env.pythonLocation }}-${{ hashFiles('pyproject.toml') }}-benchmarks-v00
+
+      - name: Install dependencies
+        if: steps.cache.outputs.cache-hit != 'true'
+        run: ./scripts/install_dependencies.sh
+
+      - name: Run benchmarks
+        uses: CodSpeedHQ/action@v2
+        with:
+          token: ${{ secrets.CODSPEED_TOKEN }}
+          run: pytest tests/ --codspeed
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -9,6 +9,12 @@ on:
     types:
       - opened
       - synchronize
+  workflow_dispatch:
+    inputs:
+      tmate_session:
+        description: Starts the workflow with tmate enabled.
+        required: false
+        default: "false"
 
 concurrency:
   group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
@@ -19,7 +25,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.8", "3.9", "3.10", "3.11"]
+        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
       fail-fast: false
 
     steps:
@@ -42,14 +48,7 @@ jobs:
 
       - name: Install dependencies
         if: steps.cache.outputs.cache-hit != 'true'
-        run: |
-          python_version=$(python -c "import sys; print(sys.version_info[:2])")
-
-          pip install -e .[dev,tests,anthropic,argilla,cohere,groq,hf-inference-endpoints,hf-transformers,litellm,llama-cpp,ollama,openai,outlines,vertexai,vllm]
-          if [ "${python_version}" != "(3, 8)" ]; then
-            pip install -e .[mistralai]
-          fi;
-          pip install git+https://github.com/argilla-io/LLM-Blender.git
+        run: ./scripts/install_dependencies.sh
 
       - name: Lint
         run: make lint
@@ -59,4 +58,3 @@ jobs:
 
       - name: Integration Tests
         run: make integration-tests
-        timeout-minutes: 5
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -11,11 +11,10 @@ repos:
           - --fuzzy-match-generates-todo
 
   - repo: https://github.com/charliermarsh/ruff-pre-commit
-    rev: v0.1.4
+    rev: v0.4.5
     hooks:
       - id: ruff
-        args:
-          - --fix
+        args: [--fix]
       - id: ruff-format
 
 ci:

diff --git a/Makefile b/Makefile
@@ -2,12 +2,12 @@ sources = src/distilabel tests
 
 .PHONY: format
 format:
-	ruff --fix $(sources)
+	ruff check --fix $(sources)
 	ruff format $(sources)
 
 .PHONY: lint
 lint:
-	ruff $(sources)
+	ruff check $(sources)
 	ruff format --check $(sources)
 
 .PHONY: unit-tests

diff --git a/docs/assets/images/sections/examples/knowledge-graph-example.png b/docs/assets/images/sections/examples/knowledge-graph-example.png
diff --git a/docs/sections/learn/advanced/distiset.md b/docs/sections/learn/advanced/distiset.md
@@ -70,6 +70,44 @@ distiset.push_to_hub(
 )
 ```
 
+### Save and load from disk
+
+Saves the [`Distiset`][distilabel.distiset.Distiset] to disk, and optionally (will be done by default) saves the dataset card, the pipeline config file and logs:
+
+```python
+distiset.save_to_disk(
+    "my-dataset",
+    save_card=True,
+    save_pipeline_config=True,
+    save_pipeline_log=True
+)
+```
+
+And load a [`Distiset`][distilabel.distiset.Distiset] that was saved using [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk] just the same way:
+
+```python
+from distilabel.distiset import Distiset
+
+distiset = Distiset.load_from_disk("my-dataset")
+```
+
+or from your cloud provider if that's where it was stored:
+
+```python
+distiset = Distiset.load_from_disk(
+    "s3://path/to/my_dataset",  # gcs:// or any filesystem tolerated by fsspec
+    storage_options={
+        "key": os.environ["S3_ACCESS_KEY"],
+        "secret": os.environ["S3_SECRET_KEY"],
+        ...
+    }
+)
+```
+
+Take into account that these methods work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card). You can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).
+
+Take a look at the remaining arguments at [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk] and [`Distiset.load_from_disk`][distilabel.distiset.Distiset.load_from_disk].
+
 ## Dataset card
 
 Having this special type of dataset comes with an added advantage when calling [`Distiset.push_to_hub`][distilabel.distiset.Distiset], which is the automatically generated dataset card in the Hugging Face Hub. Note that it is enabled by default, but can be disabled by setting `generate_card=False`:

diff --git a/docs/sections/learn/advanced/fs_to_pass_data.md b/docs/sections/learn/advanced/fs_to_pass_data.md
@@ -0,0 +1,24 @@
+# Using a file system to pass data of batches between steps
+
+In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, `distilabel` uses [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the `run` method of the `distilabel` pipelines:
+
+```python
+from distilabel.pipeline import Pipeline
+
+with Pipeline(name="my-pipeline") as pipeline:
+  ...
+
+if __name__ == "__main__":
+    distiset = pipeline.run(
+        ..., 
+        storage_parameters={"protocol": "gcs", "path": "gcs://my-bucket"},
+        use_fs_to_pass_data=True
+    )
+```
+
+The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system.The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system.
+
+!!! NOTE
+
+    As `GlobalStep`s receives all the data from the previous steps in one single batch accumulating all the data, it's very likely that the data of the batch will be too big to be passed using the queue. In this case and even if `use_fs_to_pass_data==False`, `distilabel` will use the file system to pass the data to the `GlobalStep`. 
+
diff --git a/docs/sections/learn/advanced/structured_generation.md b/docs/sections/learn/advanced/structured_generation.md
@@ -8,6 +8,14 @@
 
 The [`LLM`][distilabel.llms.LLM] has an argument named `structured_output`[^1] that determines how we can generate structured outputs with it, let's see an example using [`LlamaCppLLM`][distilabel.llms.LlamaCppLLM].
 
+!!! Note
+
+    For `outlines` integration to work you may need to install the corresponding dependencies:
+
+    ```bash
+    pip install distilabel[outlines]
+    ```
+
 ### JSON
 
 We will start with a JSON example, where we initially define a `pydantic.BaseModel` schema to guide the generation of the structured output.
@@ -101,7 +109,7 @@ if match:
 
 These were some simple examples, but one can see the options this opens.
 
-!!! NOTE
+!!! Tip
     A full pipeline example can be seen in the following script:
     [`examples/structured_generation_with_outlines.py`](../../pipeline_samples/examples/index.md#llama-cpp-with-outlines)
 
@@ -119,6 +127,72 @@ These were some simple examples, but one can see the options this opens.
     curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf
     ```
 
+## Instructor
+
+When working with model providers behind an API, there's no direct way of accesing the internal logit processor as `outlines` does, but thanks to [`instructor`](https://python.useinstructor.com/) we can generate structured output from LLM providers. We have integrated `instructor` to deal with the [`AsyncLLM`][distilabel.llms.AsyncLLM], so you can work with the following LLMs: [`OpenAILLM`][distilabel.llms.OpenAILLM], [`AzureOpenAILLM`][distilabel.llms.AzureOpenAILLM], [`CohereLLM`][distilabel.llms.CohereLLM], [`GroqLLM`][distilabel.llms.GroqLLM], [`LiteLLM`][distilabel.llms.LiteLLM] and [`MistralLLM`][distilabel.llms.MistralLLM].
+
+`instructor` works with `pydantic.BaseModel` objects internally but in `distilabel` the examples generated would result in the string representation of them, from which the `BaseModel` object can be regenerated.
+
+!!! Note
+    For `instructor` integration to work you may need to install the corresponding dependencies:
+
+    ```bash
+    pip install distilabel[instructor]
+    ```
+
+!!! Note
+    Take a look at [`InstructorStructuredOutputType`][distilabel.steps.tasks.structured_outputs.instructor.InstructorStructuredOutputType] to see the expected format
+    of the `structured_output` dict variable.
+
+The following is the same example you can see with `outlines`'s `JSON` section for comparison purposes.
+
+```python
+from pydantic import BaseModel
+
+class User(BaseModel):
+    name: str
+    last_name: str
+    id: int
+```
+
+And then we provide that schema to the `structured_output` argument of the LLM:
+
+!!! Note
+    In this example we are using *open-mixtral-8x22b*, keep in mind not all the models work with the function calling functionality required for this example to work.
+
+```python
+from distilabel.llms import MistralLLM
+
+llm = MistralLLM(
+    model="open-mixtral-8x22b",
+    structured_output={"schema": User}
+)
+llm.load()
+```
+
+And we are ready to pass our instruction as usual:
+
+```python
+import json
+
+result = llm.generate(
+    [[{"role": "user", "content": "Create a user profile for the following marathon"}]],
+    max_new_tokens=256
+)
+
+data = json.loads(result[0][0])
+data
+# {'name': 'John', 'last_name': 'Doe', 'id': 12345}
+User(**data)
+# User(name='John', last_name='Doe', id=12345)
+```
+
+We get back a Python dictionary (formatted as a string) that we can parse using `json.loads`, or validate it directly using the `User`, which is a `pydantic.BaseModel` instance.
+
+!!! Tip
+    A full pipeline example can be seen in the following script:
+    [`examples/structured_generation_with_instructor.py`](../../pipeline_samples/examples/index.md#mistralai-with-instructor)
+
 ## OpenAI JSON
 
 OpenAI offers a [JSON Mode](https://platform.openai.com/docs/guides/text-generation/json-mode) to deal with structured output via their API, let's see how to make use of them. The JSON mode instructs the model to always return a JSON object following the instruction required.

diff --git a/docs/sections/pipeline_samples/examples/index.md b/docs/sections/pipeline_samples/examples/index.md
@@ -2,7 +2,7 @@
 
 This section contains different example pipelines that showcase different tasks, maybe you can take inspiration from them.
 
-### [llama.cpp with outlines](#llama-cpp-with-outlines)
+### [llama.cpp with `outlines`](#llama-cpp-with-outlines)
 
 Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`.
 
@@ -21,3 +21,42 @@ Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `dis
     ```python title="structured_generation_with_outlines.py"
     --8<-- "examples/structured_generation_with_outlines.py"
     ```
+
+
+### [MistralAI with `instructor`](#mistralai-with-instructor)
+
+Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`.
+
+??? Example "See example"
+
+    This script makes use of [`MistralLLM`][distilabel.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics.
+
+    This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook.
+
+    ??? Run
+
+        ```python
+        python examples/structured_generation_with_instructor.py
+        ```
+
+    ```python title="structured_generation_with_instructor.py"
+    --8<-- "examples/structured_generation_with_instructor.py"
+    ```
+
+    ??? "Visualizing the graphs"
+
+        Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look:
+
+        !!! NOTE
+
+            This example uses graphviz to render the graph, you can install with `pip` in the following way:
+
+            ```console
+            pip install graphviz
+            ```
+
+        ```python
+        python examples/draw_kg.py 2  # You can pass 0,1,2 to visualize each of the samples.
+        ```
+
+        ![Knowledge graph figure](../../../assets/images/sections/examples/knowledge-graph-example.png)