Document num_generations and group_generations attributes (#739)

* Add warning about having to install specific `fsspec` implementation * Remove unused stuff * Add documentation for `num_generations` and `group_generations` attributes * Remove unused image * Update docs/sections/how_to_guides/basic/task/index.md Co-authored-by: Agus <[email protected]> --------- Co-authored-by: Agus <[email protected]>
argilla-io · Jun 18, 2024 · 9ea6d2e · 9ea6d2e
1 parent d736dd7
commit 9ea6d2e
Show file tree

Hide file tree

Showing 4 changed files with 98 additions and 17 deletions.
diff --git a/docs/assets/images/sections/pipeline/pipeline-ctrlc.png b/docs/assets/images/sections/pipeline/pipeline-ctrlc.png
diff --git a/docs/sections/how_to_guides/advanced/fs_to_pass_data.md b/docs/sections/how_to_guides/advanced/fs_to_pass_data.md
@@ -2,6 +2,16 @@
 
 In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, `distilabel` uses [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the `run` method of the `distilabel` pipelines:
 
+!!! WARNING
+
+    In order to use a specific file system/cloud storage, you will need to install the specific package providing the `fsspec` implementation for that file system. For instance, to use Google Cloud Storage you will need to install `gcsfs`:
+
+    ```bash
+    pip install gcsfs
+    ```
+
+    Check the available implementations: [fsspec - Other known implementations](https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations)
+
 ```python
 from distilabel.pipeline import Pipeline
 
@@ -11,12 +21,12 @@ with Pipeline(name="my-pipeline") as pipeline:
 if __name__ == "__main__":
     distiset = pipeline.run(
         ..., 
-        storage_parameters={"protocol": "gcs", "path": "gcs://my-bucket"},
+        storage_parameters={"path": "gcs://my-bucket"},
         use_fs_to_pass_data=True
     )
 ```
 
-The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system.The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system.
+The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system. The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system.
 
 !!! NOTE
 

diff --git a/docs/sections/how_to_guides/basic/pipeline/index.md b/docs/sections/how_to_guides/basic/pipeline/index.md
@@ -48,8 +48,8 @@ with Pipeline("pipe-name", description="My first pipe") as pipeline:
         MistralLLM(model="mistral-large-2402"),
         VertexAILLM(model="gemini-1.5-pro"),
     ):
-        task = TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)  # (1)
-        task.connect(load_dataset)  # (2)
+        task = TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
+        task.connect(load_dataset)
 
     ...
 ```
@@ -68,7 +68,7 @@ from distilabel.steps.tasks import TextGeneration
 with Pipeline("pipe-name", description="My first pipe") as pipeline:
     load_dataset = LoadDataFromHub(name="load_dataset")
 
-    combine_generations = CombineColumns(  # (1)
+    combine_generations = CombineColumns(
         name="combine_generations",
         columns=["generation", "model_name"],
         output_columns=["generations", "model_names"],
@@ -284,11 +284,7 @@ if __name__ == "__main__":
 
 ### Stopping the pipeline
 
-In case you want to stop the pipeline while it's running using the `Ctrl+C` (`Cmd+C` in MacOS), and the outputs will be stored in the cache. Repeating the command 2 times will force the pipeline to close.
-
-!!! Note
-    When pushing sending the signal to kill the process, you can expect to see the following log messages:
-    ![Pipeline ctrl+c](../../../../assets/images/sections/pipeline/pipeline-ctrlc.png)
+In case you want to stop the pipeline while it's running, you can press ++ctrl+c++ or ++cmd+c++ depending on your OS (or send a `SIGINT` to the main process), and the outputs will be stored in the cache. Pressing an additional time will force the pipeline to stop its execution, but this can lead to losing the generated outputs for certain batches.
 
 ## Cache
 

diff --git a/docs/sections/how_to_guides/basic/task/index.md b/docs/sections/how_to_guides/basic/task/index.md
@@ -7,23 +7,25 @@ The [`Task`][distilabel.steps.tasks.Task] is a special kind of [`Step`][distilab
 For example, the most basic task is the [`TextGeneration`][distilabel.steps.tasks.TextGeneration] task, which generates text based on a given instruction.
 
 ```python
+from distilabel.llms import InferenceEndpointsLLM
 from distilabel.steps.tasks import TextGeneration
 
 task = TextGeneration(
     name="text-generation",
-    llm=OpenAILLM(model="gpt-4"),
+    llm=InferenceEndpointsLLM(
+        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
+        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
+    ),
 )
 task.load()
 
 next(task.process([{"instruction": "What's the capital of Spain?"}]))
 # [
 #     {
-#         "instruction": "What's the capital of Spain?",
-#         "generation": "The capital of Spain is Madrid.",
-#         "model_name": "gpt-4",
-#         "distilabel_metadata": {
-#             "raw_output_text-generation": "The capital of Spain is Madrid"
-#         }
+#         'instruction': "What's the capital of Spain?",
+#         'generation': 'The capital of Spain is Madrid.',
+#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
+#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
 #     }
 # ]
 ```
@@ -33,6 +35,79 @@ next(task.process([{"instruction": "What's the capital of Spain?"}]))
 
 As shown above, the [`TextGeneration`][distilabel.steps.tasks.TextGeneration] task adds a `generation` based on the `instruction`. Additionally, it provides some metadata about the LLM call through `distilabel_metadata`. This can be disabled by setting the `add_raw_output` attribute to `False` when creating the task.
 
+## Specifying the number of generations and grouping generations
+
+All the `Task`s have a `num_generations` attribute that allows defining the number of generations that we want to have per input. We can update the example above to generate 3 completions per input:
+
+```python
+from distilabel.llms import InferenceEndpointsLLM
+from distilabel.steps.tasks import TextGeneration
+
+task = TextGeneration(
+    name="text-generation",
+    llm=InferenceEndpointsLLM(
+        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
+        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
+    ),
+    num_generations=3,
+)
+task.load()
+
+next(task.process([{"instruction": "What's the capital of Spain?"}]))
+# [
+#     {
+#         'instruction': "What's the capital of Spain?",
+#         'generation': 'The capital of Spain is Madrid.',
+#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
+#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
+#     },
+#     {
+#         'instruction': "What's the capital of Spain?",
+#         'generation': 'The capital of Spain is Madrid.',
+#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
+#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
+#     },
+#     {
+#         'instruction': "What's the capital of Spain?",
+#         'generation': 'The capital of Spain is Madrid.',
+#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
+#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
+#     }
+# ]
+```
+
+In addition, we might want to group the generations in a single output row as maybe one downstream step expects a single row with multiple generations. We can achieve this by setting the `group_generations` attribute to `True`:
+
+```python
+from distilabel.llms import InferenceEndpointsLLM
+from distilabel.steps.tasks import TextGeneration
+
+task = TextGeneration(
+    name="text-generation",
+    llm=InferenceEndpointsLLM(
+        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
+        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
+    ),
+    num_generations=3,
+    group_generations=True
+)
+task.load()
+
+next(task.process([{"instruction": "What's the capital of Spain?"}]))
+# [
+#     {
+#         'instruction': "What's the capital of Spain?",
+#         'generation': ['The capital of Spain is Madrid.', 'The capital of Spain is Madrid.', 'The capital of Spain is Madrid.'],
+#         'distilabel_metadata': [
+#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
+#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
+#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'}
+#         ],
+#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
+#     }
+# ]
+```
+
 ## Defining custom Tasks
 
 We can define a custom step by creating a new subclass of the [`Task`][distilabel.steps.tasks.Task] and defining the following: