Release 1.1.0 · argilla-io/distilabel

Distilabel 1.1.0

Two new tasks implemented!

`Genstruct` task (#600)

You can now use Genstruct task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:

from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromDicts
from distilabel.steps.tasks import Genstruct

with Pipeline(name="harry-potter-genstruct") as pipeline:
    load_hub_dataset = LoadDataFromDicts(
        name="load_dataset",
        data=[
            {
                "title": "Harry Potter and the Sorcerer's Stone",
                "content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.",
            },
            {
                "title": "Harry Potter and the Chamber of Secrets",
                "content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.",
            },
        ],
    )

    task = Genstruct(
        name="task",
        llm=TransformersLLM(
            model="NousResearch/Genstruct-7B",
            torch_dtype="float16",
            chat_template="{{ messages[0]['content'] }}",
            device="cuda:0",
        ),
        num_generations=2,
        group_generations=False,
        output_mappings={"model_name": "model"},
    )

`PrometheusEval` task (#610)

A new PrometheusEval task, based on the recently published paper "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models":

from distilabel.steps.tasks import PrometheusEval

with Pipeline(name="prometheus") as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        repo_id="HuggingFaceH4/instruction-dataset",
        split="test",
        output_mappings={"prompt": "instruction", "completion": "generation"},
    )

    task = PrometheusEval(
        name="task",
        llm=vLLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
        ),
        mode="absolute",
        rubric="factual-validity",
        reference=False,
        num_generations=1,
        group_generations=False,
    )
    
    load_dataset >> task

Connect the steps in the pipeline with `>>` (#490)

Now you can connect your steps using the binary shift operator in python:

from distilabel.pipeline import Pipeline
from distilabel.steps.generators.huggingface import LoadHubDataset
from distilabel.steps.task.evol_instruct.base import EvolInstruct
from distilabel.steps.combine import CombineColumns

with Pipeline(name="Pipe name") as pipeline:
    load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
    evol_instruction_complexity_1 = EvolInstruct(
        llm=OpenAILLM(model="gpt-3.5-turbo"),
    )
    evol_instruction_complexity_2 = EvolInstruct(
        llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"),
    )

    combine_columns = CombineColumns(
        columns=["response"],
        output_columns=["candidates"],
    )

    (
        load_hub_dataset 
        >> [evol_instruction_complexity_1, evol_instruction_complexity_2] 
        >> combine_columns
    )

Routing batch function (#595)

Thanks to the new routing_batch_function, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a sample_n_steps routing batch function, making easier replicating the definition of the original UltraFeedback paper:

import random
from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
from distilabel.steps import CombineColumns, LoadHubDataset
from distilabel.steps.tasks import TextGeneration

@routing_batch_function()
def sample_two_steps(steps: list[str]) -> list[str]:
    return random.sample(steps, 2)

with Pipeline("pipe-name", description="My first pipe") as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    tasks = []
    for llm in (
        OpenAILLM(model="gpt-4-0125-preview"),
        MistralLLM(model="mistral-large-2402"),
        VertexAILLM(model="gemini-1.0-pro"),
    ):
        tasks.append(
            TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
        )

    combine_generations = CombineColumns(
        name="combine_generations",
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"],
    )

    load_dataset >> sample_two_steps >> tasks >> combine_generations

Generate structured outputs using `outlines` (#601)

You can generate JSON or regex using TransformersLLM, LlamaCppLLM or vLLM thanks to the integration with [outlines](https://github.com/outlines-dev/outlines)

from enum import Enum

from distilabel.llms import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, StringConstraints, conint
from typing_extensions import Annotated

class Weapon(str, Enum):
    sword = "sword"
    axe = "axe"
    mace = "mace"
    spear = "spear"
    bow = "bow"
    crossbow = "crossbow"

class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"
    mithril = "mithril"

class Character(BaseModel):
    name: Annotated[str, StringConstraints(max_length=30)]
    age: conint(gt=1, lt=3000)
    armor: Armor
    weapon: Weapon

with Pipeline("RPG-characters") as pipeline:
    system_prompt = (
        "You are a leading role play gamer. You have seen thousands of different characters and their attributes."
        " Please return a JSON object with common attributes of an RPG character."
    )

    load_dataset = LoadDataFromDicts(
        name="load_instructions",
        data=[
            {
                "system_prompt": system_prompt,
                "instruction": f"Give me a character description for a {char}",
            }
            for char in ["dwarf", "elf", "human", "ork"]
        ],
    )

    text_generation = TextGeneration(
        name="text_generation_rpg",
        llm=LlamaCppLLM(
            model_path="model/path",  # type: ignore
            structured_output={"format": "json", "schema": Character},
        ),
    )
    load_dataset >> text_generation

New `GroqLLM` (#583)

New integration with groq, special mention to @kcentric which did the initial work prior to the refactor for 1.0.0

from distilabel.llms.groq import GroqLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration

with Pipeline(name="text-generation-groq") as pipeline:
		...
    text_generation_with_groq = TextGeneration(
        llm=GroqLLM(model="llama3-70b-8192"),
    )
    ...

Easily test your pipeline doing a `dry_run` (#635)

with Pipeline(...) as pipeline:
    ...
    distiset = pipeline.dry_run(
        parameters=...,  # The same argument as `Pipeline.run`
        batch_size=1  # Optional, will be set to 1 by default.
    )

[05/13/24 16:22:30] INFO     ['distilabel.pipeline.local'] 🌵  Dry run mode                                                                                                                                                                local.py:103
                    INFO     ['distilabel.pipeline.local'] 📝 Pipeline data will be ...                                    local.py:125

`Pipeline.log` file is dumped to the Hugging Face repository (#568)

Now on when you call distiset.push_to_hub, the pipeline.log file will be automatically dumped to your dataset repository with the pipeline.yaml to keep track of the execution.

New `distilabel_metadata` column to store internal data (#586)

You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via format_output , keep the original output to avoid lossing anything.

You can include the metadata at the task level as:

TextGeneration(..., add_raw_output=True|False)

And directly determine whether you want this column in your final Distiset:

with Pipeline(...,enable_metadata=True|False):
    ...

This way we can decide to remove all the column altogether.

All the changes in this PR

Allow nested connect calls and overload rshift method to connect steps by @plaguss in #490
Fix llm_blender installation by @alvarobartt in #557
Warn user about unknown runtime parameters by @plaguss in #555
Add missing model_name, update docstrings, and add *.jinja2 templates to Task subclasses by @alvarobartt in #560
Split ChatGeneration from TextGeneration by @alvarobartt in #558
Set extra="forbid" in {_Step,LLM}.model_config by @alvarobartt in #577
Infer step name by @plaguss in #575
Change the context of subprocesses depending on the platform by @plaguss in #578
Dump logs within a file in .cache/distilabel/pipelines dir by @plaguss in #568
Fix empty batches causing missaligment when branching by @gabrielmbmb in #590
Add GroqLLM by @alvarobartt in #583
Add Format{Chat,Text}Generation{DPO,SFT} by @alvarobartt in #584
Fix title in RatingQuestion of PreferenceToArgilla by @alvarobartt in #597
Set streaming=False and add num_examples to LoadHubDataset by @plaguss in #565
Make pipeline argument of Step optional by @plaguss in #566
Extend LLM kwargs to align with counterparts by @alvarobartt in #594
Add Genstruct task by @alvarobartt in #600
Fix num_examples to be optional in LoadHubDataset by @plaguss in #603
Fix list_files_in_dir returning unsorted files by @gabrielmbmb in #609
Add PrometheusEval task by @alvarobartt in #610
Update ValueError on missing inputs message by @alvarobartt in #617
Add routing_batch_function by @gabrielmbmb in #595
Fix pipeline.log inconsistency & include LLM info in signature by @plaguss in #598
Add custom rubrics attribute to PrometheusEval by @alvarobartt in #621
Update UltraFeedback paper replication to use routing_batch_function by @gabrielmbmb in #620
Add distilabel_metadata column to the datasets to include general data by @plaguss in #586
Add the option of passing the multiprocessing context via env var by @plaguss in #604
Add name of the pipeline to group the hashed folders by it by @plaguss in #626
Add routing_batch_function serialization by @gabrielmbmb in #628
Excluding model path in serialization of llamacpp by @ignacioct in #633
Fix problem with sorting method in list_files_in_dir function by @plaguss in #622
Add dry_run method to the pipelines to run with a single example. by @plaguss in #635
[FEATURE] Add structured outputs using outlines by @plaguss in #601
Force pipeline stop after 2 SIGINT signals caught by @plaguss in #630
Refactor and update docs by @alvarobartt in #634
Export components info & components gallery in docs by @gabrielmbmb in #640
Documentation updates by @plaguss in #646
Refactor docs 1.1.0 by @plaguss in #650
Fix routing batch function deadlocks and unordered batches by @gabrielmbmb in #649

Full Changelog: 1.0.3...1.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.1.0

Distilabel 1.1.0

Two new tasks implemented!

`Genstruct` task (#600)

`PrometheusEval` task (#610)

Connect the steps in the pipeline with `>>` (#490)

Routing batch function (#595)

Generate structured outputs using `outlines` (#601)

New `GroqLLM` (#583)

Easily test your pipeline doing a `dry_run` (#635)

`Pipeline.log` file is dumped to the Hugging Face repository (#568)

New `distilabel_metadata` column to store internal data (#586)

All the changes in this PR

Contributors

1.1.0

Distilabel 1.1.0

Two new tasks implemented!

Genstruct task (#600)

PrometheusEval task (#610)

Connect the steps in the pipeline with >> (#490)

Routing batch function (#595)

Generate structured outputs using outlines (#601)

New GroqLLM (#583)

Easily test your pipeline doing a dry_run (#635)

Pipeline.log file is dumped to the Hugging Face repository (#568)

New distilabel_metadata column to store internal data (#586)

All the changes in this PR

Contributors

`Genstruct` task (#600)

`PrometheusEval` task (#610)

Connect the steps in the pipeline with `>>` (#490)

Generate structured outputs using `outlines` (#601)

New `GroqLLM` (#583)

Easily test your pipeline doing a `dry_run` (#635)

`Pipeline.log` file is dumped to the Hugging Face repository (#568)

New `distilabel_metadata` column to store internal data (#586)