`distilabel` v1.2.0 #659

alvarobartt · 2024-05-22T06:33:06Z

No description provided.

* Fix anchor in `structured_generation.md` * Fix reference to `ultrafeedback.md` in `argilla.md` * Add `prometheus.md` * Apply suggestions from code review Co-authored-by: Agus <[email protected]> --------- Co-authored-by: Agus <[email protected]>

* Use `orjson` to serialize * Fix `KeyError` when leaf step didn't produce any data * Add `get_data` method * Update save and load methods for `_BatchManager` so is fast * Fix no cache after receiving batch from generator step * Override `from_json` instead of `from_dict` * Add `cache` and `load_from_cache` methods * Update unit tests * Add missing unit tests * Fix key has to be `str`

* include svg logos * update badge font * fix dark mode logo * update theme color and remove scrollbar bg * remove file

* Fix circular import due to DISTILABEL_METADATA_KEY * Update src/distilabel/distiset.py Co-authored-by: Alvaro Bartolome <[email protected]> --------- Co-authored-by: Alvaro Bartolome <[email protected]>

…Generation` (#676) * Deprecate conversation support in `TextGeneration` * Fix linting issue from `develop` merge

* Add functionality to load/save distisets to/from disk * Add tests for saving/loading distiset from disk * Add functionality to load/save distisets to/from disk * Update docs * Include code blocks from Examples in docstrings * Add tests for the dataset card * Fix call to yaml.safe_load found in code review * Copy path movements from hugging face load_from_disk definition * Add universal_pathlib dependency to better deal with remote paths when calling Distiset.load_from_disk * Fix download of distiset and add option to write the data to a user specified dir * Remove parameter in test as it isn't really tested with a remote filesystem * Remove unnecessary markdown extension and fix type from variables * Update src/distilabel/distiset.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/distiset.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/distiset.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/distiset.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Cast Path to str --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* New module for the integration with instructor * Mode common functions related to structured outputs to it's own module * Draft instructor integration with openai * Add tests for openai integration * Add unit tests for the instructor integrations * Add tests for anthropic integration * Fix including anthropic wrapper * Update llms to deal with instructor * Update dependencies with instructor * Run tests with instructor only on python>=3.9 * Fix circular import with create_distiset * Define _prepare_structured_output as staticmethod * Remove rewritten variable * Remove dead code * Check on Enum.value instead of Enum class as it isn't pickleable * Add tests for utilities related to generation of BaseModel objects from json schema dicts * Add fix to deal with nested BaseModel objects * Fix call from instructor, this should be done on instructor end, but works for the moment * Add docstirngs and typing info * Add script to generate a sample dataset and visualize the result * Update the docstring of the structured output expected format * Add reference in the docs to structured outputs with instructor * Add reference to the dependency installation * Update typing info * Fix test with new mocked client for mistral * Update docs/sections/learn/advanced/structured_generation.md Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update docs/sections/learn/advanced/structured_generation.md Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/tasks/structured_outputs/instructor.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/tasks/structured_outputs/instructor.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/tasks/structured_outputs/utils.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/tasks/structured_outputs/instructor.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/tasks/structured_outputs/utils.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/tasks/structured_outputs/utils.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/tasks/structured_outputs/utils.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Add changes from code review * Fix type hint per code review * Update docs/sections/learn/advanced/structured_generation.md Co-authored-by: Alvaro Bartolome <[email protected]> * Remove repeated line --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]> Co-authored-by: Alvaro Bartolome <[email protected]>

* Refactor `BasePipeline.__init__` method * Setup `_WriteBuffer` in base and do not cache if `dry_run` * Add `write_batch` method * Add methods to write and read from filesystem * Make `log_queue` optional * Update unit tests * Add docs for passing data with file system * Add integration tests for fs * Remove integration tests timeout * Make integration test lighter * Remove test * Update `mkdocs.yml` * Verbose * Add `tmate` for debugging * Clean handlers only if not pytest * Fix loading batch manager * Add testing fs to pass data again This reverts commit 9246c4c. * Remove verbose * Fix test * Increase tests timeout * `join` after `terminate` * Set thread daemon * Proper termination of manager and pool * Fix pytest hanging because of queue was never closed and the thread of the queue was being kept alive (frustration) * Terminate pool first and call `stop_logging` * Remove `protocol` Co-authored-by: plaguss <[email protected]> --------- Co-authored-by: plaguss <[email protected]>

* Add Python 3.12 * Add `install_dependencies.sh` script * Update to `ruff==0.4.5` * Apply format * Update commands * Update to `argilla >= 1.29.0` * Update to setup tmate in 3.12 * Update `vllm` dependency * Use `uv` to install dependencies * Update dependencies * Fix regex message for 3.12

* Add `codspeed` benchmarks * Make the test lighter * Make test ultra light * Use `python==3.12` for `codspeed` * Add concurrency config for `codspeed` workflow

codspeed-hq · 2024-05-31T12:07:47Z

CodSpeed Performance Report

Merging #659 will not alter performance

_{Comparing develop (44bd633) with develop (ff3f484)}

Summary

✅ 1 untouched benchmarks

…ceEndpointsLLM` (#680) * Fix linting issue from `develop` branch * Add `grammar` arg in `agenerate` (WIP) * Run `codespell` in `src/` and `docs/` * Add support for `StructuredGeneration` (WIP) - Now the `generate` method in the `LLM` can receive either a chat or a tuple with the chat and the grammar for that chat - `grammar` is an arg at `LLM` level - The `grammar` can be specified per row via the `StructuredGeneration`, while when specifying a global `grammar` then the `grammar` arg within the `LLM` can be used via the `TextGeneration` task instead * Add `flatten_dict` to avoid `pyarrow` issues with nested dicts * Handle `pyarrow.lib.ArrowInvalid` when nested unaligned dicts * Add `StructuredGeneration` docstrings * Fix `TextGeneration` docstring for `model_name` output * Rename `DefaultInput` to `StandardInput` and add missing docstrings * Update `LLM` subclasses type-hints * Add `StructuredGeneration` import in `distilabel.steps.tasks` * Add `InferenceEndpointsLLM` and `StructuredGeneration`

* Fix `RuntimeError` closing event loop if not created by `AsyncLLM` * Update `InferenceEndpointsLLM` so it uses cached token * Fix test

* Add `GenerateSentencePair` task * Update task to use system prompt * Fix `setup_logging` file location * Update `add_raw_output` to be `RuntimeParamater` and `True` by default * Fix system prompt for negative sentences * Add `GenerateSentencePair` unit tests * Fix unit tests after updating `add_raw_output` * Update docs to mention `add_raw_output` attribute * Update `add_raw_output` description Co-authored-by: alvarobartt <[email protected]> * Fix columns Co-authored-by: alvarobartt <[email protected]> * Add missing docstrings * Fix tests * Add `answer` generation action * Fix examples not being correctly rendered * Add examples --------- Co-authored-by: alvarobartt <[email protected]>

* Add `built_batches` attribute * Fix saving `built_batches` and tests

…onse` in template (#703) * Remove selecting final index from `response` for `EvolQuality.apply_mutation_template` * Add `_apply_random_mutation` unit test --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

…ifferent formats (#691) * Add a GeneratorStep to read files from disk as datasets * Add tests for the new LoadFromDisk loader * Refactor generator step classes to new naming * Add deprecation warnings for previous loaders * Add assertion to remind removing the deprecated classes * Add docstrings for the new steps * Apply comments from code review and update dataset info read using exposed function from datasets * Fix dataloader tests with new class names * Fix import tests

* Move navigation to top tabs instead of left side and include links to socials * Change site name to Distilabel Docs * Update fonts to use argilla ones

`set -e` will exit on every non-zero status code

* Add context to guide the generate sentence pair task if informed * Include example of how to add context to generate sentence pairs * Invert order of anchor/context in prompt template

* Add example for TransformersLLM * Add examples in the LLMs docstrings * Fix typo from code review

…en is None. (#707) * Add a way to automatically gather the HF_TOKEN when calling distiset.push_to_hub and mode constant value to distilabel.utils module * Update src/distilabel/distiset.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Refactor function to obtain huggingface token and move it to it's module --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Set `input` as optional in `format_output` * Implement "Improving Text Embeddings with LLMs" (WIP) * Implement "Improving Text Embeddings with LLMs" (WIP) * Add `model_name` at the end of each batch * Move `text_embeddings.py` to `improving_text_embeddings.py` * Fix `re.sub` to also capture `\t` and `\r` * Add `MonolingualTripletGenerator` and `BitextRetrievalGenerator` * Move all `templates` from `str` to `jinja2` files * Update class naming and imports * Add some docstrings and fix `jinja2` file paths * Fix `prompt` accross tasks * Add missing docstrings * Fix `process` method in `EmbeddingTaskGenerator` * Add unit tests for `...Generator` tasks * Add remaining unit tests * Remove duplicated imports in `distilabel.steps.tasks` * Add examples in docstrings and add notes

* Fix `ChatGeneration.format_input` exception * Bump `datasets` to 2.16.0 or higher To be able to efficiently use the cache via `load_dataset` whenever there's no connection * Add `benchmarks/arena_hard.py` (WIP) * Update `_get_hf_dataset_info` error message * Add `ArenaHard` and `ArenaHardResults` docstrings * Catch `ImportError` on `ArenaHardResults.load` * Add `ArenaHard` and `ArenaHardEval` imports * Add `arena-hard` extras * Install `arena-hard` extra in `test.yml` * Update `arena-hard` extra dependencies * Fix circular import in `arena_hard.py` * Apply suggestions from code review * Add missing examples in docstrings & fix type-hints * Add some future TODOs * Update `install_dependencies.sh` to install `arena-hard` extra * Add `ArenaHard` and `ArenaHardResults` unit tests

* Move classes to different files * Add `_send_to_step` abstractmethod * Move `_request_initial_batches` method to `BasePipeline` * Move `_notify_step_to_stop` method to `BasePipeline` * Move `_handle_batch_on_stop` method to `BasePipeline` * Move `LAST_BATCH_FLAG_SENT` constant * Move `_request_more_batches_if_needed` method to `BasePipeline` * Move `_register_batch` method to `BasePipeline` * Move `_get_successors` method to `BasePipeline` * Move `_get_step_from_batch` method to `BasePipeline` * Move `_manage_batch_flow` method to `BasePipeline` * Add `_get_from_step` abstract method * Add `_add_batches_back_to_batch_manager` method * Add `_consume_output_queue` method * Add `_create_step_input_queue` method * Add `_run_step` abstract method * Move `_handle_keyboard_interrupt` method * Add `_load_queue` * Add `_init_steps_load_status` method * Move `_all_steps_loaded` method * Move `_check_step_not_loaded_or_finished` method * Move `_handle_stop` method * Move `_run_output_queue_loop` method * Remove unused variables * Fix unit tests * Remove shared dict info and update `CudaDevicePlacementMixin` * Add `unload` method * Add `portalocker` dependency * Add missing unload * Add `_OLD_IMPORT_MODULE_ATTR` dict * Fix `override` import * Remove log message * Add missing call to `unload`

…nternal class (#725)

* Update highlight colors to match the alembics elixir * Add examples for the combine step * Add examples of the steps for the components gallery

…n a dataset (#688) * Fix linting issue from `develop` branch * Add `grammar` arg in `agenerate` (WIP) * Run `codespell` in `src/` and `docs/` * Add support for `StructuredGeneration` (WIP) - Now the `generate` method in the `LLM` can receive either a chat or a tuple with the chat and the grammar for that chat - `grammar` is an arg at `LLM` level - The `grammar` can be specified per row via the `StructuredGeneration`, while when specifying a global `grammar` then the `grammar` arg within the `LLM` can be used via the `TextGeneration` task instead * Add `flatten_dict` to avoid `pyarrow` issues with nested dicts * Handle `pyarrow.lib.ArrowInvalid` when nested unaligned dicts * Update grammar argument to structured_output for consistency * Update tests to check the structured outputs on serialization * Add tests for the structured generation class * Update typing and testing according to instructor or outlines structured output parameters * Fix passing grammar via structured outputs * Remove debug log * Add tests for the new minibatches of the structured generation * Update outlines based structured generation * Update tests with new keyword for structured generation * Update api based llms to run with structured generation * Fix tests after refactor * Fix test after refactor * Fix vllm batch sorting mechanism * Fix error on vllm with sorting bathces --------- Co-authored-by: Alvaro Bartolome <[email protected]>

* Update mkdocs version * Update align documentation with argilla SDK 2.0 * Updated naming of basices Moved CLI to advanced * Delete unneeded index pages * Update naming * Update navigation and content edit * Update naming of How to guides * Add popular issue and community page * Update GITHUB_ACCESS_TOKEN to GH_ACCESS_TOKEN due to protected naming * Update scoped reqs for token * Add GH_ACCESS_TOKEN to workflow * Delete literate nav * Update jinja templates to hide unrendered navigation * Update navigation orderin for API reference * Update docs/sections/how_to_guides/advanced/structured_generation.md Co-authored-by: Alvaro Bartolome <[email protected]> * docs: prose in guides (#721) * docs: make argilla prose talk about argilla * docs: simplify prose in generator and global steps * Update LLM page * Update LLM docs * Update Pipeline docs * Avoid using "function" * Update Step documentation * Update docs/sections/how_to_guides/basic/step/index.md Co-authored-by: Alvaro Bartolome <[email protected]> * Update docs/sections/how_to_guides/basic/step/index.md Co-authored-by: Alvaro Bartolome <[email protected]> * Update `Task` page * Update definiton of `GeneratorTask` * Update Step documentation * Update advanced documentation --------- Co-authored-by: davidberenstein1957 <[email protected]> Co-authored-by: Agus <[email protected]> Co-authored-by: Alvaro Bartolome <[email protected]> * Make `GH_ACCESS_TOKEN` optional * Add `pandas>=2.0` to `docs` * Fix typo * Update default signature GeneratorStep * Update missing `mkdocs_autorefs` within API reference * Update API page * Update CHATML_TEMPLATE formatting to avoid autodoc issues * Add reference to token scopes required --------- Co-authored-by: Alvaro Bartolome <[email protected]> Co-authored-by: burtenshaw <[email protected]> Co-authored-by: Agus <[email protected]> Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Add index page to how-to guides * Apply suggestions from code review Co-authored-by: burtenshaw <[email protected]> --------- Co-authored-by: burtenshaw <[email protected]>

Remove emojis and connect steps via rshift instead

* Update `RuntimeParametersMixin` to handle `list`s * Check if `generation_kwargs` is present * Add `get_generation_kwargs` method * Add `_num_generation_param_supported` attribute to avoid code duplication * Refactor `OllamaLLM` and `VertexAILLM` * Add `MixtureOfAgents` llm * Add docstrings * Fix unit tests * Update docstrings * Fix missing # * Update Arena Hard tasks docstrings * Fix cross-reference * Update unit tests * Rename to `MixtureOfAgentsLLM` * Update `_extra_serializable_fields` to work with `List[_Serializable]` attributes * Remove `from_dict` method for `Step` * Update to render list * Update handling `List[RuntimeParametersMixin]` attributes * Fix unit tests * Remove test code * Add `MixtureOfAgentsLLM` docstring example * Add alias for runtime parameters names

* Add warning about having to install specific `fsspec` implementation * Remove unused stuff * Add documentation for `num_generations` and `group_generations` attributes * Remove unused image * Update docs/sections/how_to_guides/basic/task/index.md Co-authored-by: Agus <[email protected]> --------- Co-authored-by: Agus <[email protected]>

* Remove `arena-hard` extras * Remove `ArenaHard` and `ArenaHardResults` * Add `examples/arena_hard.py` * Add `arena_hard.py` example to `docs` * Remove files included due to merge conflict

plaguss and others added 2 commits May 20, 2024 16:23

Prepare branch for v1.2.0

fac5fdc

Add prometheus.md (#656)

7ba83d2

* Fix anchor in `structured_generation.md` * Fix reference to `ultrafeedback.md` in `argilla.md` * Add `prometheus.md` * Apply suggestions from code review Co-authored-by: Agus <[email protected]> --------- Co-authored-by: Agus <[email protected]>

alvarobartt added the release label May 22, 2024

alvarobartt added this to the 1.2.0 milestone May 22, 2024

alvarobartt and others added 11 commits May 22, 2024 08:33

Merge branch 'main' into develop

942cacd

[DOCS] Update theme styles and images (#667)

6eddaf8

* include svg logos * update badge font * fix dark mode logo * update theme color and remove scrollbar bg * remove file

Fix circular import due to DISTILABEL_METADATA_KEY (#675)

c9623fc

* Fix circular import due to DISTILABEL_METADATA_KEY * Update src/distilabel/distiset.py Co-authored-by: Alvaro Bartolome <[email protected]> --------- Co-authored-by: Alvaro Bartolome <[email protected]>

Deprecate conversation support in TextGeneration in favour of `Chat…

bce7da1

…Generation` (#676) * Deprecate conversation support in `TextGeneration` * Fix linting issue from `develop` merge

Fix docs of saving/loading distiset from disk (#679)

37f970e

Add codspeed benchmarks (#674)

0dc464e

* Add `codspeed` benchmarks * Make the test lighter * Make test ultra light * Use `python==3.12` for `codspeed` * Add concurrency config for `codspeed` workflow

gabrielmbmb and others added 14 commits May 31, 2024 14:27

Use pytest decorator for benchmark

1624b1e

Fix InferenceEndpointsLLM not using cached token (#690)

e61b598

* Fix `RuntimeError` closing event loop if not created by `AsyncLLM` * Update `InferenceEndpointsLLM` so it uses cached token * Fix test

Fix prepend batches (#696)

062f4fb

* Add `built_batches` attribute * Fix saving `built_batches` and tests

Add citation in README to simplify citing from academia (#712)

e5320a3

Move navigation to top bar (#708)

f7eef99

* Move navigation to top tabs instead of left side and include links to socials * Change site name to Distilabel Docs * Update fonts to use argilla ones

Add set -e to install_dependencies.sh (#713)

893cfa3

`set -e` will exit on every non-zero status code

Add context to guide the generate sentence pair task if informed (#706)

23b3b41

* Add context to guide the generate sentence pair task if informed * Include example of how to add context to generate sentence pairs * Invert order of anchor/context in prompt template

Add examples to the LLMs to be shown in the components gallery (#714)

1d53ee8

* Add example for TransformersLLM * Add examples in the LLMs docstrings * Fix typo from code review

alvarobartt and others added 10 commits June 12, 2024 12:34

Fix AzureOpenAILLM load method setting the correct path to mock the i…

3822c73

…nternal class (#725)

Components examples steps (#715)

ae6d7fa

* Update highlight colors to match the alembics elixir * Add examples for the combine step * Add examples of the steps for the components gallery

Add basic examples for tasks to show in the components gallery (#724)

ce8dde8

Update typing API reference (#729)

9d63f4a

docs: 730 docs add an index to the guide overview (#731)

806fd57

* Add index page to how-to guides * Apply suggestions from code review Co-authored-by: burtenshaw <[email protected]> --------- Co-authored-by: burtenshaw <[email protected]>

Update README.md

9d6a152

Remove emojis and connect steps via rshift instead

alvarobartt assigned alvarobartt, gabrielmbmb and plaguss Jun 14, 2024

gabrielmbmb and others added 2 commits June 18, 2024 12:00

gabrielmbmb force-pushed the develop branch from 9ea6d2e to 356a4a3 Compare June 18, 2024 10:09

alvarobartt and others added 3 commits June 18, 2024 14:00

Add examples/arena_hard.py and remove from distilabel core (#741)

6bf14d0

* Remove `arena-hard` extras * Remove `ArenaHard` and `ArenaHardResults` * Add `examples/arena_hard.py` * Add `arena_hard.py` example to `docs` * Remove files included due to merge conflict

Add serving LLM docs (#742)

2430e62

Merge branch 'main' into develop

63ee8c5

gabrielmbmb marked this pull request as ready for review June 18, 2024 12:26

gabrielmbmb merged commit 3910aca into main Jun 18, 2024
13 checks passed

gabrielmbmb deleted the develop branch June 18, 2024 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`distilabel` v1.2.0 #659

`distilabel` v1.2.0 #659

alvarobartt commented May 22, 2024 •

edited by gabrielmbmb

Loading

codspeed-hq bot commented May 31, 2024 •

edited

Loading

distilabel v1.2.0 #659

distilabel v1.2.0 #659

Conversation

alvarobartt commented May 22, 2024 • edited by gabrielmbmb Loading

codspeed-hq bot commented May 31, 2024 • edited Loading

CodSpeed Performance Report

Merging #659 will not alter performance

Summary

`distilabel` v1.2.0 #659

`distilabel` v1.2.0 #659

alvarobartt commented May 22, 2024 •

edited by gabrielmbmb

Loading

codspeed-hq bot commented May 31, 2024 •

edited

Loading