diff --git a/docs/assets/images/sections/pipeline/pipeline-ctrlc.png b/docs/assets/images/sections/pipeline/pipeline-ctrlc.png deleted file mode 100644 index 33b5b171ae..0000000000 Binary files a/docs/assets/images/sections/pipeline/pipeline-ctrlc.png and /dev/null differ diff --git a/docs/sections/how_to_guides/advanced/fs_to_pass_data.md b/docs/sections/how_to_guides/advanced/fs_to_pass_data.md index 2851c3bc3c..178b3e5eac 100644 --- a/docs/sections/how_to_guides/advanced/fs_to_pass_data.md +++ b/docs/sections/how_to_guides/advanced/fs_to_pass_data.md @@ -2,6 +2,16 @@ In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, `distilabel` uses [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the `run` method of the `distilabel` pipelines: +!!! WARNING + + In order to use a specific file system/cloud storage, you will need to install the specific package providing the `fsspec` implementation for that file system. For instance, to use Google Cloud Storage you will need to install `gcsfs`: + + ```bash + pip install gcsfs + ``` + + Check the available implementations: [fsspec - Other known implementations](https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations) + ```python from distilabel.pipeline import Pipeline @@ -11,12 +21,12 @@ with Pipeline(name="my-pipeline") as pipeline: if __name__ == "__main__": distiset = pipeline.run( ..., - storage_parameters={"protocol": "gcs", "path": "gcs://my-bucket"}, + storage_parameters={"path": "gcs://my-bucket"}, use_fs_to_pass_data=True ) ``` -The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system.The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system. +The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system. The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system. !!! NOTE diff --git a/docs/sections/how_to_guides/basic/pipeline/index.md b/docs/sections/how_to_guides/basic/pipeline/index.md index 3f14cee4c7..f4abcfee63 100644 --- a/docs/sections/how_to_guides/basic/pipeline/index.md +++ b/docs/sections/how_to_guides/basic/pipeline/index.md @@ -48,8 +48,8 @@ with Pipeline("pipe-name", description="My first pipe") as pipeline: MistralLLM(model="mistral-large-2402"), VertexAILLM(model="gemini-1.5-pro"), ): - task = TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm) # (1) - task.connect(load_dataset) # (2) + task = TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm) + task.connect(load_dataset) ... ``` @@ -68,7 +68,7 @@ from distilabel.steps.tasks import TextGeneration with Pipeline("pipe-name", description="My first pipe") as pipeline: load_dataset = LoadDataFromHub(name="load_dataset") - combine_generations = CombineColumns( # (1) + combine_generations = CombineColumns( name="combine_generations", columns=["generation", "model_name"], output_columns=["generations", "model_names"], @@ -284,11 +284,7 @@ if __name__ == "__main__": ### Stopping the pipeline -In case you want to stop the pipeline while it's running using the `Ctrl+C` (`Cmd+C` in MacOS), and the outputs will be stored in the cache. Repeating the command 2 times will force the pipeline to close. - -!!! Note - When pushing sending the signal to kill the process, you can expect to see the following log messages: - ![Pipeline ctrl+c](../../../../assets/images/sections/pipeline/pipeline-ctrlc.png) +In case you want to stop the pipeline while it's running, you can press ++ctrl+c++ or ++cmd+c++ depending on your OS (or send a `SIGINT` to the main process), and the outputs will be stored in the cache. Pressing an additional time will force the pipeline to stop its execution, but this can lead to losing the generated outputs for certain batches. ## Cache diff --git a/docs/sections/how_to_guides/basic/task/index.md b/docs/sections/how_to_guides/basic/task/index.md index 54c04483dc..70b118c3ea 100644 --- a/docs/sections/how_to_guides/basic/task/index.md +++ b/docs/sections/how_to_guides/basic/task/index.md @@ -7,23 +7,25 @@ The [`Task`][distilabel.steps.tasks.Task] is a special kind of [`Step`][distilab For example, the most basic task is the [`TextGeneration`][distilabel.steps.tasks.TextGeneration] task, which generates text based on a given instruction. ```python +from distilabel.llms import InferenceEndpointsLLM from distilabel.steps.tasks import TextGeneration task = TextGeneration( name="text-generation", - llm=OpenAILLM(model="gpt-4"), + llm=InferenceEndpointsLLM( + model_id="meta-llama/Meta-Llama-3-70B-Instruct", + tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct", + ), ) task.load() next(task.process([{"instruction": "What's the capital of Spain?"}])) # [ # { -# "instruction": "What's the capital of Spain?", -# "generation": "The capital of Spain is Madrid.", -# "model_name": "gpt-4", -# "distilabel_metadata": { -# "raw_output_text-generation": "The capital of Spain is Madrid" -# } +# 'instruction': "What's the capital of Spain?", +# 'generation': 'The capital of Spain is Madrid.', +# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'}, +# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct' # } # ] ``` @@ -33,6 +35,79 @@ next(task.process([{"instruction": "What's the capital of Spain?"}])) As shown above, the [`TextGeneration`][distilabel.steps.tasks.TextGeneration] task adds a `generation` based on the `instruction`. Additionally, it provides some metadata about the LLM call through `distilabel_metadata`. This can be disabled by setting the `add_raw_output` attribute to `False` when creating the task. +## Specifying the number of generations and grouping generations + +All the `Task`s have a `num_generations` attribute that allows defining the number of generations that we want to have per input. We can update the example above to generate 3 completions per input: + +```python +from distilabel.llms import InferenceEndpointsLLM +from distilabel.steps.tasks import TextGeneration + +task = TextGeneration( + name="text-generation", + llm=InferenceEndpointsLLM( + model_id="meta-llama/Meta-Llama-3-70B-Instruct", + tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct", + ), + num_generations=3, +) +task.load() + +next(task.process([{"instruction": "What's the capital of Spain?"}])) +# [ +# { +# 'instruction': "What's the capital of Spain?", +# 'generation': 'The capital of Spain is Madrid.', +# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'}, +# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct' +# }, +# { +# 'instruction': "What's the capital of Spain?", +# 'generation': 'The capital of Spain is Madrid.', +# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'}, +# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct' +# }, +# { +# 'instruction': "What's the capital of Spain?", +# 'generation': 'The capital of Spain is Madrid.', +# 'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'}, +# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct' +# } +# ] +``` + +In addition, we might want to group the generations in a single output row as maybe one downstream step expects a single row with multiple generations. We can achieve this by setting the `group_generations` attribute to `True`: + +```python +from distilabel.llms import InferenceEndpointsLLM +from distilabel.steps.tasks import TextGeneration + +task = TextGeneration( + name="text-generation", + llm=InferenceEndpointsLLM( + model_id="meta-llama/Meta-Llama-3-70B-Instruct", + tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct", + ), + num_generations=3, + group_generations=True +) +task.load() + +next(task.process([{"instruction": "What's the capital of Spain?"}])) +# [ +# { +# 'instruction': "What's the capital of Spain?", +# 'generation': ['The capital of Spain is Madrid.', 'The capital of Spain is Madrid.', 'The capital of Spain is Madrid.'], +# 'distilabel_metadata': [ +# {'raw_output_text-generation': 'The capital of Spain is Madrid.'}, +# {'raw_output_text-generation': 'The capital of Spain is Madrid.'}, +# {'raw_output_text-generation': 'The capital of Spain is Madrid.'} +# ], +# 'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct' +# } +# ] +``` + ## Defining custom Tasks We can define a custom step by creating a new subclass of the [`Task`][distilabel.steps.tasks.Task] and defining the following: