Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Refactor of structured generation and use schemas defined in a dataset #688

Merged
merged 24 commits into from
Jun 13, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented May 31, 2024

Description

This PR updates the code to generate structured outputs either from a dataset with schemas via StructuredGeneration or via a single structured output schema via TextGeneration:

The following pipeline works with the 3 different types of LLMs (llamacpp with outlines, inference endpoints which uses outlines but abstracted, and mistral that works with instructor):

    with Pipeline(name="structured-generation") as pipeline:
        load_data = LoadDataFromDicts(
            name="load_data",
            data=[
                {
                    "instruction": "Generate a character from a RPG game.",
                    "structured_output": {
                        "format": "json",
                        "schema": Character.model_json_schema(),
                    },
                },
                {
                    "instruction": "Generate an animal from a zoo.",
                    "structured_output": {
                        "format": "json",
                        "schema": Animal.model_json_schema(),
                    },
                },
                # regex doesn't work with instructor, it's related to outlines based engines
                # {
                #     "instruction": "What's the weather like today in Seattle in Celsius degrees?",
                #     "structured_output": {
                #         "format": "regex",
                #         "schema": "(\\d{1,2})°C",
                #     },
                # },
            ],
        )
        task = StructuredGeneration(
            # llm = LlamaCppLLM(
            #     model_path=str(Path.home() / model_path),  # type: ignore
            #     n_gpu_layers=-1,
            #     n_ctx=1024,
            # ),
            # llm=InferenceEndpointsLLM(
            #     model_id="CohereForAI/c4ai-command-r-plus",
            #     tokenizer_id="CohereForAI/c4ai-command-r-plus",
            #     api_key=os.getenv("HF_API_TOKEN"),  # type: ignore
            # ),
            llm=MistralLLM(
                model="open-mixtral-8x22b",
                api_key=os.getenv("MISTRAL_API_KEY"),  # type: ignore
            ),
            use_system_prompt=False,
            output_mappings={"model_name": "generation_model"},
        )

        load_data >> task

alvarobartt and others added 10 commits May 29, 2024 08:30
- Now the `generate` method in the `LLM` can receive either a chat or a tuple with the chat and the grammar for that chat
- `grammar` is an arg at `LLM` level
- The `grammar` can be specified per row via the `StructuredGeneration`, while when specifying a global `grammar` then the `grammar` arg within the `LLM` can be used via the `TextGeneration` task instead
…to inference-endpoints-structured-gen-grammar
@plaguss plaguss marked this pull request as draft May 31, 2024 11:59
@plaguss plaguss changed the base branch from inference-endpoints-structured-gen to develop May 31, 2024 12:34
Copy link

codspeed-hq bot commented Jun 4, 2024

CodSpeed Performance Report

Merging #688 will not alter performance

Comparing inference-endpoints-structured-gen-grammar (170066a) with develop (ce8dde8)

Summary

✅ 1 untouched benchmarks

@plaguss plaguss changed the title Inference endpoints structured gen grammar [FEATURE] Inference endpoints structured gen grammar Jun 6, 2024
@plaguss plaguss marked this pull request as ready for review June 6, 2024 10:48
@plaguss plaguss added enhancement New feature or request refactor labels Jun 10, 2024
@plaguss plaguss self-assigned this Jun 10, 2024
@plaguss plaguss added this to the 1.2.0 milestone Jun 10, 2024
@plaguss plaguss requested review from alvarobartt and gabrielmbmb and removed request for alvarobartt June 11, 2024 19:21
@plaguss plaguss changed the title [FEATURE] Inference endpoints structured gen grammar [FEATURE] Refactor of structured generation and use schemas defined in a dataset Jun 11, 2024
@plaguss plaguss requested a review from alvarobartt June 11, 2024 19:23
@plaguss plaguss merged commit 2f245c6 into develop Jun 13, 2024
7 checks passed
@plaguss plaguss deleted the inference-endpoints-structured-gen-grammar branch June 13, 2024 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add more flexibility to generate structured data from multiple schemas
2 participants