Skip to content

Commit

Permalink
Document num_generations and group_generations attributes (#739)
Browse files Browse the repository at this point in the history
* Add warning about having to install specific `fsspec` implementation

* Remove unused stuff

* Add documentation for `num_generations` and `group_generations`
attributes

* Remove unused image

* Update docs/sections/how_to_guides/basic/task/index.md

Co-authored-by: Agus <[email protected]>

---------

Co-authored-by: Agus <[email protected]>
  • Loading branch information
gabrielmbmb and plaguss authored Jun 18, 2024
1 parent d736dd7 commit 9ea6d2e
Show file tree
Hide file tree
Showing 4 changed files with 98 additions and 17 deletions.
Binary file not shown.
14 changes: 12 additions & 2 deletions docs/sections/how_to_guides/advanced/fs_to_pass_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@

In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, `distilabel` uses [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the `run` method of the `distilabel` pipelines:

!!! WARNING

In order to use a specific file system/cloud storage, you will need to install the specific package providing the `fsspec` implementation for that file system. For instance, to use Google Cloud Storage you will need to install `gcsfs`:

```bash
pip install gcsfs
```

Check the available implementations: [fsspec - Other known implementations](https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations)

```python
from distilabel.pipeline import Pipeline

Expand All @@ -11,12 +21,12 @@ with Pipeline(name="my-pipeline") as pipeline:
if __name__ == "__main__":
distiset = pipeline.run(
...,
storage_parameters={"protocol": "gcs", "path": "gcs://my-bucket"},
storage_parameters={"path": "gcs://my-bucket"},
use_fs_to_pass_data=True
)
```

The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system.The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system.
The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system. The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system.

!!! NOTE

Expand Down
12 changes: 4 additions & 8 deletions docs/sections/how_to_guides/basic/pipeline/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ with Pipeline("pipe-name", description="My first pipe") as pipeline:
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.5-pro"),
):
task = TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm) # (1)
task.connect(load_dataset) # (2)
task = TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
task.connect(load_dataset)

...
```
Expand All @@ -68,7 +68,7 @@ from distilabel.steps.tasks import TextGeneration
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadDataFromHub(name="load_dataset")

combine_generations = CombineColumns( # (1)
combine_generations = CombineColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
Expand Down Expand Up @@ -284,11 +284,7 @@ if __name__ == "__main__":

### Stopping the pipeline

In case you want to stop the pipeline while it's running using the `Ctrl+C` (`Cmd+C` in MacOS), and the outputs will be stored in the cache. Repeating the command 2 times will force the pipeline to close.

!!! Note
When pushing sending the signal to kill the process, you can expect to see the following log messages:
![Pipeline ctrl+c](../../../../assets/images/sections/pipeline/pipeline-ctrlc.png)
In case you want to stop the pipeline while it's running, you can press ++ctrl+c++ or ++cmd+c++ depending on your OS (or send a `SIGINT` to the main process), and the outputs will be stored in the cache. Pressing an additional time will force the pipeline to stop its execution, but this can lead to losing the generated outputs for certain batches.

## Cache

Expand Down
89 changes: 82 additions & 7 deletions docs/sections/how_to_guides/basic/task/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,25 @@ The [`Task`][distilabel.steps.tasks.Task] is a special kind of [`Step`][distilab
For example, the most basic task is the [`TextGeneration`][distilabel.steps.tasks.TextGeneration] task, which generates text based on a given instruction.

```python
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps.tasks import TextGeneration

task = TextGeneration(
name="text-generation",
llm=OpenAILLM(model="gpt-4"),
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
)
task.load()

next(task.process([{"instruction": "What's the capital of Spain?"}]))
# [
# {
# "instruction": "What's the capital of Spain?",
# "generation": "The capital of Spain is Madrid.",
# "model_name": "gpt-4",
# "distilabel_metadata": {
# "raw_output_text-generation": "The capital of Spain is Madrid"
# }
# 'instruction': "What's the capital of Spain?",
# 'generation': 'The capital of Spain is Madrid.',
# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
# }
# ]
```
Expand All @@ -33,6 +35,79 @@ next(task.process([{"instruction": "What's the capital of Spain?"}]))

As shown above, the [`TextGeneration`][distilabel.steps.tasks.TextGeneration] task adds a `generation` based on the `instruction`. Additionally, it provides some metadata about the LLM call through `distilabel_metadata`. This can be disabled by setting the `add_raw_output` attribute to `False` when creating the task.

## Specifying the number of generations and grouping generations

All the `Task`s have a `num_generations` attribute that allows defining the number of generations that we want to have per input. We can update the example above to generate 3 completions per input:

```python
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps.tasks import TextGeneration

task = TextGeneration(
name="text-generation",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
num_generations=3,
)
task.load()

next(task.process([{"instruction": "What's the capital of Spain?"}]))
# [
# {
# 'instruction': "What's the capital of Spain?",
# 'generation': 'The capital of Spain is Madrid.',
# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
# },
# {
# 'instruction': "What's the capital of Spain?",
# 'generation': 'The capital of Spain is Madrid.',
# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
# },
# {
# 'instruction': "What's the capital of Spain?",
# 'generation': 'The capital of Spain is Madrid.',
# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
# }
# ]
```

In addition, we might want to group the generations in a single output row as maybe one downstream step expects a single row with multiple generations. We can achieve this by setting the `group_generations` attribute to `True`:

```python
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps.tasks import TextGeneration

task = TextGeneration(
name="text-generation",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
num_generations=3,
group_generations=True
)
task.load()

next(task.process([{"instruction": "What's the capital of Spain?"}]))
# [
# {
# 'instruction': "What's the capital of Spain?",
# 'generation': ['The capital of Spain is Madrid.', 'The capital of Spain is Madrid.', 'The capital of Spain is Madrid.'],
# 'distilabel_metadata': [
# {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
# {'raw_output_text-generation': 'The capital of Spain is Madrid.'}
# ],
# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
# }
# ]
```

## Defining custom Tasks

We can define a custom step by creating a new subclass of the [`Task`][distilabel.steps.tasks.Task] and defining the following:
Expand Down

0 comments on commit 9ea6d2e

Please sign in to comment.