Skip to content

Commit

Permalink
Merge branch 'develop' into argilla-2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
alvarobartt committed Jun 12, 2024
2 parents d77dd11 + 0e8c752 commit d6f7131
Show file tree
Hide file tree
Showing 44 changed files with 4,944 additions and 102 deletions.
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,3 +153,15 @@ If you build something cool with `distilabel` consider adding one of these badge

To directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).

## Citation

```bibtex
@misc{distilabel-argilla-2024,
author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero},
title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/argilla-io/distilabel}}
}
```
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
description: Distilabel is an AI Feedback (AIF) framework for building datasets with and for LLMs.
hide:
- toc
- navigation
---

<style>.md-typeset h1, .md-content__button { display: none;}</style>
Expand Down
5 changes: 4 additions & 1 deletion docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
@import url('https://fonts.googleapis.com/css2?family=Inter:[email protected]&display=swap');

:root {
--md-primary-fg-color: #84b0c1;
--md-primary-fg-color--light: #84b0c1;
--md-primary-fg-color--dark: #84b0c1;
--md-text-font: "Inter";
}
[data-md-color-scheme="default"] {
--md-primary-fg-color: #000000;
Expand All @@ -16,4 +19,4 @@

.md-sidebar__scrollwrap:focus-within, .md-sidebar__scrollwrap:hover {
scrollbar-color: var(--md-default-fg-color--lighter) #0000;
}
}
Binary file added docs/stylesheets/fonts/FontAwesome.otf
Binary file not shown.
Binary file added docs/stylesheets/fonts/fontawesome-webfont.eot
Binary file not shown.
2,671 changes: 2,671 additions & 0 deletions docs/stylesheets/fonts/fontawesome-webfont.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/stylesheets/fonts/fontawesome-webfont.ttf
Binary file not shown.
Binary file added docs/stylesheets/fonts/fontawesome-webfont.woff
Binary file not shown.
Binary file added docs/stylesheets/fonts/fontawesome-webfont.woff2
Binary file not shown.
144 changes: 77 additions & 67 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Project information
site_name: distilabel
site_name: Distilabel Docs
site_url: https://argilla-io.github.io/distilabel
site_author: Argilla, Inc.
site_description: Distilabel is an AI Feedback (AIF) framework for building datasets with and for LLMs.
Expand All @@ -11,6 +11,15 @@ repo_url: https://github.com/argilla-io/distilabel
extra:
version:
provider: mike
social:
- icon: fontawesome/brands/linkedin
link: https://www.linkedin.com/company/argilla-io
- icon: fontawesome/brands/x-twitter
link: https://twitter.com/argilla_io
- icon: fontawesome/brands/youtube
link: https://www.youtube.com/channel/UCAIz8TmvQQrLqbD7sd-5S2A
- icon: fontawesome/brands/slack
link: https://join.slack.com/t/rubrixworkspace/shared_invite/zt-20wllqq29-Z11~kp2SeFYjJ0qevJRiPg

extra_css:
- stylesheets/extra.css
Expand All @@ -29,6 +38,7 @@ theme:
features:
- navigation.sections # Sections are included in the navigation on the left.
# - toc.integrate # # Table of contents is integrated on the left; does not appear separately on the right.
- navigation.tabs
- header.autohide # header disappears as you scroll
- content.code.copy
- content.code.annotate
Expand Down Expand Up @@ -118,74 +128,74 @@ plugins:
add_after_page: Learn

nav:
- Introduction: "index.md"
- Distilabel: "index.md"
- Getting started:
- Installation: "sections/installation.md"
- How-to-Guide: "sections/how_to_guide.md"
- Installation: "sections/installation.md"
- How-to-Guide: "sections/how_to_guide.md"
- Learn:
- "sections/learn/index.md"
- Tutorial:
- "sections/learn/tutorial/index.md"
- Step:
- "sections/learn/tutorial/step/index.md"
- GeneratorStep: "sections/learn/tutorial/step/generator_step.md"
- GlobalStep: "sections/learn/tutorial/step/global_step.md"
- Task:
- "sections/learn/tutorial/task/index.md"
- GeneratorTask: "sections/learn/tutorial/task/generator_task.md"
- LLM: "sections/learn/tutorial/llm/index.md"
- Pipeline: "sections/learn/tutorial/pipeline/index.md"
- CLI: "sections/learn/tutorial/cli/index.md"
- Advanced:
- "sections/learn/advanced/index.md"
- Argilla: "sections/learn/advanced/argilla.md"
- Caching: "sections/learn/advanced/caching.md"
- Distiset: "sections/learn/advanced/distiset.md"
- Structured Generation: "sections/learn/advanced/structured_generation.md"
- Using the file system to pass batch data: "sections/learn/advanced/fs_to_pass_data.md"
- "sections/learn/index.md"
- Tutorial:
- "sections/learn/tutorial/index.md"
- Step:
- "sections/learn/tutorial/step/index.md"
- GeneratorStep: "sections/learn/tutorial/step/generator_step.md"
- GlobalStep: "sections/learn/tutorial/step/global_step.md"
- Task:
- "sections/learn/tutorial/task/index.md"
- GeneratorTask: "sections/learn/tutorial/task/generator_task.md"
- LLM: "sections/learn/tutorial/llm/index.md"
- Pipeline: "sections/learn/tutorial/pipeline/index.md"
- CLI: "sections/learn/tutorial/cli/index.md"
- Advanced:
- "sections/learn/advanced/index.md"
- Argilla: "sections/learn/advanced/argilla.md"
- Caching: "sections/learn/advanced/caching.md"
- Distiset: "sections/learn/advanced/distiset.md"
- Structured Generation: "sections/learn/advanced/structured_generation.md"
- Using the file system to pass batch data: "sections/learn/advanced/fs_to_pass_data.md"
- Pipeline Samples:
- "sections/pipeline_samples/index.md"
- Examples: "sections/pipeline_samples/examples/index.md"
- Papers:
- "sections/pipeline_samples/papers/index.md"
- DEITA: "sections/pipeline_samples/papers/deita.md"
- Instruction Backtranslation: "sections/pipeline_samples/papers/instruction_backtranslation.md"
- Prometheus 2: "sections/pipeline_samples/papers/prometheus.md"
- UltraFeedback: "sections/pipeline_samples/papers/ultrafeedback.md"
- "sections/pipeline_samples/index.md"
- Examples: "sections/pipeline_samples/examples/index.md"
- Papers:
- "sections/pipeline_samples/papers/index.md"
- DEITA: "sections/pipeline_samples/papers/deita.md"
- Instruction Backtranslation: "sections/pipeline_samples/papers/instruction_backtranslation.md"
- Prometheus 2: "sections/pipeline_samples/papers/prometheus.md"
- UltraFeedback: "sections/pipeline_samples/papers/ultrafeedback.md"
- FAQ: "sections/faq.md"
- API Reference:
- Pipeline:
- "api/pipeline/index.md"
- Routing Batch Function: "api/pipeline/routing_batch_function.md"
- Typing: "api/pipeline/typing.md"
- Utils: "api/pipeline/utils.md"
- Step:
- "api/step/index.md"
- GeneratorStep: "api/step/generator_step.md"
- GlobalStep: "api/step/global_step.md"
- "@step": "api/step/decorator.md"
- Step Gallery:
- Argilla: "api/step_gallery/argilla.md"
- Columns: "api/step_gallery/columns.md"
- Extra: "api/step_gallery/extra.md"
- Task:
- "api/task/index.md"
- GeneratorTask: "api/task/generator_task.md"
- Task Gallery: "api/task_gallery/index.md"
- LLM:
- "api/llm/index.md"
- LLM Gallery:
- Anthropic: "api/llm/anthropic.md"
- Anyscale: "api/llm/anyscale.md"
- Azure (via OpenAI): "api/llm/azure.md"
- Groq: "api/llm/groq.md"
- Hugging Face: "api/llm/huggingface.md"
- LiteLLM: "api/llm/litellm.md"
- llama.cpp: "api/llm/llamacpp.md"
- Mistral: "api/llm/mistral.md"
- Ollama: "api/llm/ollama.md"
- OpenAI: "api/llm/openai.md"
- Together AI: "api/llm/together.md"
- Google Vertex AI: "api/llm/vertexai.md"
- vLLM: "api/llm/vllm.md"
- CLI: "api/cli.md"
- Pipeline:
- "api/pipeline/index.md"
- Routing Batch Function: "api/pipeline/routing_batch_function.md"
- Typing: "api/pipeline/typing.md"
- Utils: "api/pipeline/utils.md"
- Step:
- "api/step/index.md"
- GeneratorStep: "api/step/generator_step.md"
- GlobalStep: "api/step/global_step.md"
- "@step": "api/step/decorator.md"
- Step Gallery:
- Argilla: "api/step_gallery/argilla.md"
- Columns: "api/step_gallery/columns.md"
- Extra: "api/step_gallery/extra.md"
- Task:
- "api/task/index.md"
- GeneratorTask: "api/task/generator_task.md"
- Task Gallery: "api/task_gallery/index.md"
- LLM:
- "api/llm/index.md"
- LLM Gallery:
- Anthropic: "api/llm/anthropic.md"
- Anyscale: "api/llm/anyscale.md"
- Azure (via OpenAI): "api/llm/azure.md"
- Groq: "api/llm/groq.md"
- Hugging Face: "api/llm/huggingface.md"
- LiteLLM: "api/llm/litellm.md"
- llama.cpp: "api/llm/llamacpp.md"
- Mistral: "api/llm/mistral.md"
- Ollama: "api/llm/ollama.md"
- OpenAI: "api/llm/openai.md"
- Together AI: "api/llm/together.md"
- Google Vertex AI: "api/llm/vertexai.md"
- vLLM: "api/llm/vllm.md"
- CLI: "api/cli.md"
2 changes: 2 additions & 0 deletions scripts/install_dependencies.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/bin/bash

set -e

python_version=$(python -c "import sys; print(sys.version_info[:2])")

python -m pip install uv
Expand Down
7 changes: 7 additions & 0 deletions src/distilabel/distiset.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
size_categories_parser,
)
from distilabel.utils.files import list_files_in_dir
from distilabel.utils.huggingface import get_hf_token

DISTISET_CONFIG_FOLDER: Final[str] = "distiset_configs"
PIPELINE_CONFIG_FILENAME: Final[str] = "pipeline.yaml"
Expand Down Expand Up @@ -81,7 +82,13 @@ def push_to_hub(
Whether to generate a dataset card or not. Defaults to True.
**kwargs:
Additional keyword arguments to pass to the `push_to_hub` method of the `datasets.Dataset` object.
Raises:
ValueError: If no token is provided and couldn't be retrieved automatically.
"""
if token is None:
token = get_hf_token(self.__class__.__name__, "token")

for name, dataset in self.items():
dataset.push_to_hub(
repo_id=repo_id,
Expand Down
44 changes: 44 additions & 0 deletions src/distilabel/llms/anthropic.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,50 @@ class AnthropicLLM(AsyncLLM):
- `timeout`: the maximum time in seconds to wait for a response. Defaults to `600.0`.
- `max_retries`: the maximum number of times to retry the request before failing.
Defaults to `6`.
Examples:
Generate text:
```python
from distilabel.llms import AnthropicLLM
llm = AnthropicLLM(model="claude-3-opus-20240229", api_key="api.key")
llm.load()
# Synchronous request
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
# Asynchronous request
output = await llm.agenerate(input=[{"role": "user", "content": "Hello world!"}])
```
Generate structured data:
```python
from pydantic import BaseModel
from distilabel.llms import AnthropicLLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = AnthropicLLM(
model="claude-3-opus-20240229",
api_key="api.key",
structured_output={"schema": User}
)
llm.load()
# Synchronous request
output = llm.generate(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
# Asynchronous request
output = await llm.agenerate(input=[{"role": "user", "content": "Create a user profile for the following marathon"}])
```
"""

model: str
Expand Down
18 changes: 18 additions & 0 deletions src/distilabel/llms/anyscale.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,24 @@ class AnyscaleLLM(OpenAILLM):
`None` if not set.
_api_key_env_var: the name of the environment variable to use for the API key.
It is meant to be used internally.
Examples:
Generate text:
```python
from distilabel.llms import AnyscaleLLM
llm = AnyscaleLLM(model="google/gemma-7b-it", api_key="api.key")
llm.load()
# Synchronous request
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
# Asynchronous request
output = await llm.agenerate(input=[{"role": "user", "content": "Hello world!"}])
```
"""

base_url: Optional[RuntimeParameter[str]] = Field(
Expand Down
63 changes: 63 additions & 0 deletions src/distilabel/llms/azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,69 @@ class AzureOpenAILLM(OpenAILLM):
Icon:
`:simple-microsoftazure:`
Examples:
Generate text:
```python
from distilabel.llms import AzureOpenAILLM
llm = AzureOpenAILLM(model="gpt-4-turbo", api_key="api.key")
llm.load()
# Synchrounous request
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
# Asynchronous request
output = await llm.agenerate(input=[{"role": "user", "content": "Hello world!"}])
```
Generate text from a custom endpoint following the OpenAI API:
```python
from distilabel.llms import AzureOpenAILLM
llm = AzureOpenAILLM(
model="prometheus-eval/prometheus-7b-v2.0",
base_url=r"http://localhost:8080/v1"
)
llm.load()
# Synchronous request
output = llm.generate(inputs=[[{"role": "user", "content": "Hello world!"}]])
# Asynchronous request
output = await llm.agenerate(input=[{"role": "user", "content": "Hello world!"}])
```
Generate structured data:
```python
from pydantic import BaseModel
from distilabel.llms import AzureOpenAILLM
class User(BaseModel):
name: str
last_name: str
id: int
llm = AzureOpenAILLM(
model="gpt-4-turbo",
api_key="api.key",
structured_output={"schema": User}
)
llm.load()
# Synchronous request
output = llm.generate(inputs=[[{"role": "user", "content": "Create a user profile for the following marathon"}]])
# Asynchronous request
output = await llm.agenerate(input=[{"role": "user", "content": "Create a user profile for the following marathon"}])
```
"""

base_url: Optional[RuntimeParameter[str]] = Field(
Expand Down
Loading

0 comments on commit d6f7131

Please sign in to comment.