Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distilabel v1.2.0 #659

Merged
merged 42 commits into from
Jun 18, 2024
Merged

distilabel v1.2.0 #659

merged 42 commits into from
Jun 18, 2024

Conversation

alvarobartt
Copy link
Member

@alvarobartt alvarobartt commented May 22, 2024

No description provided.

plaguss and others added 2 commits May 20, 2024 16:23
* Fix anchor in `structured_generation.md`

* Fix reference to `ultrafeedback.md` in `argilla.md`

* Add `prometheus.md`

* Apply suggestions from code review

Co-authored-by: Agus <[email protected]>

---------

Co-authored-by: Agus <[email protected]>
@alvarobartt alvarobartt added this to the 1.2.0 milestone May 22, 2024
alvarobartt and others added 11 commits May 22, 2024 08:33
* Use `orjson` to serialize

* Fix `KeyError` when leaf step didn't produce any data

* Add `get_data` method

* Update save and load methods for `_BatchManager` so is fast

* Fix no cache after receiving batch from generator step

* Override `from_json` instead of `from_dict`

* Add `cache` and `load_from_cache` methods

* Update unit tests

* Add missing unit tests

* Fix key has to be `str`
* include svg logos

* update badge font

* fix dark mode logo

* update theme color and remove scrollbar bg

* remove file
* Fix circular import due to DISTILABEL_METADATA_KEY

* Update src/distilabel/distiset.py

Co-authored-by: Alvaro Bartolome <[email protected]>

---------

Co-authored-by: Alvaro Bartolome <[email protected]>
…Generation` (#676)

* Deprecate conversation support in `TextGeneration`

* Fix linting issue from `develop` merge
* Add functionality to load/save distisets to/from disk

* Add tests for saving/loading distiset from disk

* Add functionality to load/save distisets to/from disk

* Update docs

* Include code blocks from Examples in docstrings

* Add tests for the dataset card

* Fix call to yaml.safe_load found in code review

* Copy path movements from hugging face load_from_disk definition

* Add universal_pathlib dependency to better deal with remote paths when calling Distiset.load_from_disk

* Fix download of distiset and add option to write the data to a user specified dir

* Remove parameter in test as it isn't really tested with a remote filesystem

* Remove unnecessary markdown extension and fix type from variables

* Update src/distilabel/distiset.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/distiset.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/distiset.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/distiset.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Cast Path to str

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* New module for the integration with instructor

* Mode common functions related to structured outputs to it's own module

* Draft instructor integration with openai

* Add tests for openai integration

* Add unit tests for the instructor integrations

* Add tests for anthropic integration

* Fix including anthropic wrapper

* Update llms to deal with instructor

* Update dependencies with instructor

* Run tests with instructor only on python>=3.9

* Fix circular import with create_distiset

* Define _prepare_structured_output as staticmethod

* Remove rewritten variable

* Remove dead code

* Check on Enum.value instead of Enum class as it isn't pickleable

* Add tests for utilities related to generation of BaseModel objects from json schema dicts

* Add fix to deal with nested BaseModel objects

* Fix call from instructor, this should be done on instructor end, but works for the moment

* Add docstirngs and typing info

* Add script to generate a sample dataset and visualize the result

* Update the docstring of the structured output expected format

* Add reference in the docs to structured outputs with instructor

* Add reference to the dependency installation

* Update typing info

* Fix test with new mocked client for mistral

* Update docs/sections/learn/advanced/structured_generation.md

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update docs/sections/learn/advanced/structured_generation.md

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/steps/tasks/structured_outputs/instructor.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/steps/tasks/structured_outputs/instructor.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/steps/tasks/structured_outputs/utils.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/steps/tasks/structured_outputs/instructor.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/steps/tasks/structured_outputs/utils.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/steps/tasks/structured_outputs/utils.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/steps/tasks/structured_outputs/utils.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Add changes from code review

* Fix type hint per code review

* Update docs/sections/learn/advanced/structured_generation.md

Co-authored-by: Alvaro Bartolome <[email protected]>

* Remove repeated line

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
Co-authored-by: Alvaro Bartolome <[email protected]>
* Refactor `BasePipeline.__init__` method

* Setup `_WriteBuffer` in base and do not cache if `dry_run`

* Add `write_batch` method

* Add methods to write and read from filesystem

* Make `log_queue` optional

* Update unit tests

* Add docs for passing data with file system

* Add integration tests for fs

* Remove integration tests timeout

* Make integration test lighter

* Remove test

* Update `mkdocs.yml`

* Verbose

* Add `tmate` for debugging

* Clean handlers only if not pytest

* Fix loading batch manager

* Add testing fs to pass data again

This reverts commit 9246c4c.

* Remove verbose

* Fix test

* Increase tests timeout

* `join` after `terminate`

* Set thread daemon

* Proper termination of manager and pool

* Fix pytest hanging because of queue was never closed and the thread of
the queue was being kept alive (frustration)

* Terminate pool first and call `stop_logging`

* Remove `protocol`

Co-authored-by: plaguss <[email protected]>

---------

Co-authored-by: plaguss <[email protected]>
* Add Python 3.12

* Add `install_dependencies.sh` script

* Update to `ruff==0.4.5`

* Apply format

* Update commands

* Update to `argilla >= 1.29.0`

* Update to setup tmate in 3.12

* Update `vllm` dependency

* Use `uv` to install dependencies

* Update dependencies

* Fix regex message for 3.12
* Add `codspeed` benchmarks

* Make the test lighter

* Make test ultra light

* Use `python==3.12` for `codspeed`

* Add concurrency config for `codspeed` workflow
Copy link

codspeed-hq bot commented May 31, 2024

CodSpeed Performance Report

Merging #659 will not alter performance

Comparing develop (44bd633) with develop (ff3f484)

Summary

✅ 1 untouched benchmarks

gabrielmbmb and others added 14 commits May 31, 2024 14:27
…ceEndpointsLLM` (#680)

* Fix linting issue from `develop` branch

* Add `grammar` arg in `agenerate` (WIP)

* Run `codespell` in `src/` and `docs/`

* Add support for `StructuredGeneration` (WIP)

- Now the `generate` method in the `LLM` can receive either a chat or a tuple with the chat and the grammar for that chat
- `grammar` is an arg at `LLM` level
- The `grammar` can be specified per row via the `StructuredGeneration`, while when specifying a global `grammar` then the `grammar` arg within the `LLM` can be used via the `TextGeneration` task instead

* Add `flatten_dict` to avoid `pyarrow` issues with nested dicts

* Handle `pyarrow.lib.ArrowInvalid` when nested unaligned dicts

* Add `StructuredGeneration` docstrings

* Fix `TextGeneration` docstring for `model_name` output

* Rename `DefaultInput` to `StandardInput` and add missing docstrings

* Update `LLM` subclasses type-hints

* Add `StructuredGeneration` import in `distilabel.steps.tasks`

* Add `InferenceEndpointsLLM` and `StructuredGeneration`
* Fix `RuntimeError` closing event loop if not created by `AsyncLLM`

* Update `InferenceEndpointsLLM` so it uses cached token

* Fix test
* Add `GenerateSentencePair` task

* Update task to use system prompt

* Fix `setup_logging` file location

* Update `add_raw_output` to be `RuntimeParamater` and `True` by default

* Fix system prompt for negative sentences

* Add `GenerateSentencePair` unit tests

* Fix unit tests after updating `add_raw_output`

* Update docs to mention `add_raw_output` attribute

* Update `add_raw_output` description

Co-authored-by: alvarobartt <[email protected]>

* Fix columns

Co-authored-by: alvarobartt <[email protected]>

* Add missing docstrings

* Fix tests

* Add `answer` generation action

* Fix examples not being correctly rendered

* Add examples

---------

Co-authored-by: alvarobartt <[email protected]>
* Add `built_batches` attribute

* Fix saving `built_batches` and tests
…onse` in template (#703)

* Remove selecting final index from `response` for `EvolQuality.apply_mutation_template`

* Add `_apply_random_mutation` unit test

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
…ifferent formats (#691)

* Add a GeneratorStep to read files from disk as datasets

* Add tests for the new LoadFromDisk loader

* Refactor generator step classes to new naming

* Add deprecation warnings for previous loaders

* Add assertion to remind removing the deprecated classes

* Add docstrings for the new steps

* Apply comments from code review and update dataset info read using exposed function from datasets

* Fix dataloader tests with new class names

* Fix import tests
* Move navigation to top tabs instead of left side and include links to socials

* Change site name to Distilabel Docs

* Update fonts to use argilla ones
`set -e` will exit on every non-zero status code
* Add context to guide the generate sentence pair task if informed

* Include example of how to add context to generate sentence pairs

* Invert order of anchor/context in prompt template
* Add example for TransformersLLM

* Add examples in the LLMs docstrings

* Fix typo from code review
…en is None. (#707)

* Add a way to automatically gather the HF_TOKEN when calling distiset.push_to_hub and mode constant value to distilabel.utils module

* Update src/distilabel/distiset.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Refactor function to obtain huggingface token and move it to it's module

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Set `input` as optional in `format_output`

* Implement "Improving Text Embeddings with LLMs" (WIP)

* Implement "Improving Text Embeddings with LLMs" (WIP)

* Add `model_name` at the end of each batch

* Move `text_embeddings.py` to `improving_text_embeddings.py`

* Fix `re.sub` to also capture `\t` and `\r`

* Add `MonolingualTripletGenerator` and `BitextRetrievalGenerator`

* Move all `templates` from `str` to `jinja2` files

* Update class naming and imports

* Add some docstrings and fix `jinja2` file paths

* Fix `prompt` accross tasks

* Add missing docstrings

* Fix `process` method in `EmbeddingTaskGenerator`

* Add unit tests for `...Generator` tasks

* Add remaining unit tests

* Remove duplicated imports in `distilabel.steps.tasks`

* Add examples in docstrings and add notes
alvarobartt and others added 10 commits June 12, 2024 12:34
* Fix `ChatGeneration.format_input` exception

* Bump `datasets` to 2.16.0 or higher

To be able to efficiently use the cache via `load_dataset` whenever there's no connection

* Add `benchmarks/arena_hard.py` (WIP)

* Update `_get_hf_dataset_info` error message

* Add `ArenaHard` and `ArenaHardResults` docstrings

* Catch `ImportError` on `ArenaHardResults.load`

* Add `ArenaHard` and `ArenaHardEval` imports

* Add `arena-hard` extras

* Install `arena-hard` extra in `test.yml`

* Update `arena-hard` extra dependencies

* Fix circular import in `arena_hard.py`

* Apply suggestions from code review

* Add missing examples in docstrings & fix type-hints

* Add some future TODOs

* Update `install_dependencies.sh` to install `arena-hard` extra

* Add `ArenaHard` and `ArenaHardResults` unit tests
* Move classes to different files

* Add `_send_to_step` abstractmethod

* Move `_request_initial_batches` method to `BasePipeline`

* Move `_notify_step_to_stop` method to `BasePipeline`

* Move `_handle_batch_on_stop` method to `BasePipeline`

* Move `LAST_BATCH_FLAG_SENT` constant

* Move `_request_more_batches_if_needed` method to `BasePipeline`

* Move `_register_batch` method to `BasePipeline`

* Move `_get_successors` method to `BasePipeline`

* Move `_get_step_from_batch` method to `BasePipeline`

* Move `_manage_batch_flow` method to `BasePipeline`

* Add `_get_from_step` abstract method

* Add `_add_batches_back_to_batch_manager` method

* Add `_consume_output_queue` method

* Add `_create_step_input_queue` method

* Add `_run_step` abstract method

* Move `_handle_keyboard_interrupt` method

* Add `_load_queue`

* Add `_init_steps_load_status` method

* Move `_all_steps_loaded` method

* Move `_check_step_not_loaded_or_finished` method

* Move `_handle_stop` method

* Move `_run_output_queue_loop` method

* Remove unused variables

* Fix unit tests

* Remove shared dict info and update `CudaDevicePlacementMixin`

* Add `unload` method

* Add `portalocker` dependency

* Add missing unload

* Add `_OLD_IMPORT_MODULE_ATTR` dict

* Fix `override` import

* Remove log message

* Add missing call to `unload`
* Update highlight colors to match the alembics elixir

* Add examples for the combine step

* Add examples of the steps for the components gallery
…n a dataset (#688)

* Fix linting issue from `develop` branch

* Add `grammar` arg in `agenerate` (WIP)

* Run `codespell` in `src/` and `docs/`

* Add support for `StructuredGeneration` (WIP)

- Now the `generate` method in the `LLM` can receive either a chat or a tuple with the chat and the grammar for that chat
- `grammar` is an arg at `LLM` level
- The `grammar` can be specified per row via the `StructuredGeneration`, while when specifying a global `grammar` then the `grammar` arg within the `LLM` can be used via the `TextGeneration` task instead

* Add `flatten_dict` to avoid `pyarrow` issues with nested dicts

* Handle `pyarrow.lib.ArrowInvalid` when nested unaligned dicts

* Update grammar argument to structured_output for consistency

* Update tests to check the structured outputs on serialization

* Add tests for the structured generation class

* Update typing and testing according to instructor or outlines structured output parameters

* Fix passing grammar via structured outputs

* Remove debug log

* Add tests for the new minibatches of the structured generation

* Update outlines based structured generation

* Update tests with new keyword for structured generation

* Update api based llms to run with structured generation

* Fix tests after refactor

* Fix test after refactor

* Fix vllm batch sorting mechanism

* Fix error on vllm with sorting bathces

---------

Co-authored-by: Alvaro Bartolome <[email protected]>
* Update mkdocs version

* Update align documentation with argilla SDK 2.0

* Updated naming of basices
Moved CLI to advanced

* Delete unneeded index pages

* Update naming

* Update navigation and content edit

* Update naming of How to guides

* Add popular issue and community page

* Update GITHUB_ACCESS_TOKEN to GH_ACCESS_TOKEN due to protected naming

* Update scoped reqs for token

* Add GH_ACCESS_TOKEN to workflow

* Delete literate nav

* Update jinja templates to hide unrendered navigation

* Update navigation orderin for API reference

* Update docs/sections/how_to_guides/advanced/structured_generation.md

Co-authored-by: Alvaro Bartolome <[email protected]>

* docs: prose in guides (#721)

* docs: make argilla prose talk about argilla

* docs: simplify prose in generator and global steps

* Update LLM page

* Update LLM docs

* Update Pipeline docs

* Avoid using "function"

* Update Step documentation

* Update docs/sections/how_to_guides/basic/step/index.md

Co-authored-by: Alvaro Bartolome <[email protected]>

* Update docs/sections/how_to_guides/basic/step/index.md

Co-authored-by: Alvaro Bartolome <[email protected]>

* Update `Task` page

* Update definiton of `GeneratorTask`

* Update Step documentation

* Update advanced documentation

---------

Co-authored-by: davidberenstein1957 <[email protected]>
Co-authored-by: Agus <[email protected]>
Co-authored-by: Alvaro Bartolome <[email protected]>

* Make `GH_ACCESS_TOKEN` optional

* Add `pandas>=2.0` to `docs`

* Fix typo

* Update default signature GeneratorStep

* Update missing `mkdocs_autorefs` within API reference

* Update API page

* Update CHATML_TEMPLATE formatting to avoid autodoc issues

* Add reference to token scopes required

---------

Co-authored-by: Alvaro Bartolome <[email protected]>
Co-authored-by: burtenshaw <[email protected]>
Co-authored-by: Agus <[email protected]>
Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Add index page to how-to guides

* Apply suggestions from code review

Co-authored-by: burtenshaw <[email protected]>

---------

Co-authored-by: burtenshaw <[email protected]>
Remove emojis and connect steps via rshift instead
gabrielmbmb and others added 2 commits June 18, 2024 12:00
* Update `RuntimeParametersMixin` to handle `list`s

* Check if `generation_kwargs` is present

* Add `get_generation_kwargs` method

* Add `_num_generation_param_supported` attribute to avoid code
duplication

* Refactor `OllamaLLM` and `VertexAILLM`

* Add `MixtureOfAgents` llm

* Add docstrings

* Fix unit tests

* Update docstrings

* Fix missing #

* Update Arena Hard tasks docstrings

* Fix cross-reference

* Update unit tests

* Rename to `MixtureOfAgentsLLM`

* Update `_extra_serializable_fields` to work with `List[_Serializable]`
attributes

* Remove `from_dict` method for `Step`

* Update to render list

* Update handling `List[RuntimeParametersMixin]` attributes

* Fix unit tests

* Remove test code

* Add `MixtureOfAgentsLLM` docstring example

* Add alias for runtime parameters names
* Add warning about having to install specific `fsspec` implementation

* Remove unused stuff

* Add documentation for `num_generations` and `group_generations`
attributes

* Remove unused image

* Update docs/sections/how_to_guides/basic/task/index.md

Co-authored-by: Agus <[email protected]>

---------

Co-authored-by: Agus <[email protected]>
alvarobartt and others added 3 commits June 18, 2024 14:00
* Remove `arena-hard` extras

* Remove `ArenaHard` and `ArenaHardResults`

* Add `examples/arena_hard.py`

* Add `arena_hard.py` example to `docs`

* Remove files included due to merge conflict
@gabrielmbmb gabrielmbmb marked this pull request as ready for review June 18, 2024 12:26
@gabrielmbmb gabrielmbmb merged commit 3910aca into main Jun 18, 2024
13 checks passed
@gabrielmbmb gabrielmbmb deleted the develop branch June 18, 2024 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants