Implement "Improving Text Embeddings with LLMs" #683

alvarobartt · 2024-05-30T10:23:15Z

Description

This PR implements all the tasks mentioned in the paper Improving Text Embeddings with Large Language Models, so that one can reproduce the data generation process for training embedding models with sentence-transformers.

Closes #682

Example

Find a complete example below with all the tasks implemented and how to connect them:

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks.improving_text_embeddings import (
    BitextRetrievalGenerator,
    EmbeddingTaskGenerator,
    GenerateLongTextMatchingData,
    GenerateShortTextMatchingData,
    GenerateTextClassificationData,
    GenerateTextRetrievalData,
    MonolingualTripletGenerator,
)

with Pipeline(name="improving-text-embeddings-with-llms") as pipeline:
    brainstorm_retrieval = EmbeddingTaskGenerator(
        category="text-retrieval",
        flatten_tasks=True,
        llm=InferenceEndpointsLLM(
            model_id="CohereForAI/c4ai-command-r-plus",
            tokenizer_id="CohereForAI/c4ai-command-r-plus",
        ),
        num_generations=1,
        group_generations=True,
        output_mappings={"model_name": "brainstorm_model"},
    )

    generate_retrieval = GenerateTextRetrievalData(
        add_raw_output=True,
        llm=InferenceEndpointsLLM(
            model_id="CohereForAI/c4ai-command-r-plus",
            tokenizer_id="CohereForAI/c4ai-command-r-plus",
        ),
        output_mappings={"model_name": "generation_model"},
    )

    brainstorm_retrieval >> generate_retrieval  # type: ignore

    brainstorm_classification = EmbeddingTaskGenerator(
        category="text-classification",
        flatten_tasks=True,
        llm=InferenceEndpointsLLM(
            model_id="CohereForAI/c4ai-command-r-plus",
            tokenizer_id="CohereForAI/c4ai-command-r-plus",
        ),
        num_generations=1,
        group_generations=True,
        output_mappings={"model_name": "brainstorm_model"},
    )

    generate_classification = GenerateTextClassificationData(
        add_raw_output=True,
        llm=InferenceEndpointsLLM(
            model_id="CohereForAI/c4ai-command-r-plus",
            tokenizer_id="CohereForAI/c4ai-command-r-plus",
        ),
        output_mappings={"model_name": "generation_model"},
    )

    brainstorm_classification >> generate_classification  # type: ignore

    brainstorm_matching_short = EmbeddingTaskGenerator(
        category="text-matching-short",
        flatten_tasks=True,
        llm=InferenceEndpointsLLM(
            model_id="CohereForAI/c4ai-command-r-plus",
            tokenizer_id="CohereForAI/c4ai-command-r-plus",
        ),
        num_generations=1,
        group_generations=True,
        output_mappings={"model_name": "brainstorm_model"},
    )

    generate_matching_short = GenerateShortTextMatchingData(
        add_raw_output=True,
        llm=InferenceEndpointsLLM(
            model_id="CohereForAI/c4ai-command-r-plus",
            tokenizer_id="CohereForAI/c4ai-command-r-plus",
        ),
        output_mappings={"model_name": "generation_model"},
    )

    brainstorm_matching_short >> generate_matching_short  # type: ignore

    brainstorm_matching_long = EmbeddingTaskGenerator(
        category="text-matching-long",
        flatten_tasks=True,
        llm=InferenceEndpointsLLM(
            model_id="CohereForAI/c4ai-command-r-plus",
            tokenizer_id="CohereForAI/c4ai-command-r-plus",
        ),
        num_generations=1,
        group_generations=True,
        output_mappings={"model_name": "brainstorm_model"},
    )

    generate_matching_long = GenerateLongTextMatchingData(
        add_raw_output=True,
        llm=InferenceEndpointsLLM(
            model_id="CohereForAI/c4ai-command-r-plus",
            tokenizer_id="CohereForAI/c4ai-command-r-plus",
        ),
        output_mappings={"model_name": "generation_model"},
    )

    brainstorm_matching_long >> generate_matching_long  # type: ignore

    bitext_retrieval_generator = BitextRetrievalGenerator(
        llm=InferenceEndpointsLLM(
            model_id="CohereForAI/c4ai-command-r-plus",
            tokenizer_id="CohereForAI/c4ai-command-r-plus",
        ),
        output_mappings={"model_name": "bitext_model"},
    )

    monolingual_triplet_generator = MonolingualTripletGenerator(
        llm=InferenceEndpointsLLM(
            model_id="CohereForAI/c4ai-command-r-plus",
            tokenizer_id="CohereForAI/c4ai-command-r-plus",
        ),
        output_mappings={"model_name": "monolingual_model"},
    )


if __name__ == "__main__":
    distiset = pipeline.run(
        parameters={
            step_name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 4096,
                        "stop_sequences": ["<EOS_TOKEN>", "<|END_OF_TURN_TOKEN|>"],
                    }
                }
            }
            for step_name in pipeline.dag
        },
    )
    if distiset is not None:
        distiset.push_to_hub(
            "distilabel-internal-testing/alvarobartt-improving-text-embeddings-with-llms-full",
        )

What's missing?

Add docstrings for the implemented tasks
Add unit tests for the implemented tasks
Improve structuring to avoid code duplication
Move the templates to separate files rather that having those as plain strings
Fix the naming (cc @gabrielmbmb @plaguss for help)
And run a couple more experiments using the structured_output arg within the InferenceEndpointsLLM as an example

codspeed-hq · 2024-05-31T12:34:50Z

CodSpeed Performance Report

Merging #683 will not alter performance

_{Comparing improving-text-embeddings-with-llms (3c97218) with develop (a0d7e93)}

Summary

✅ 1 untouched benchmarks

plaguss

@alvarobartt, I'm fine with the naming. Maybe I would prefer moving the prompts to jinja templates as we have with other cases, but looks good to me anyway!

alvarobartt · 2024-06-04T10:40:09Z

@alvarobartt, I'm fine with the naming. Maybe I would prefer moving the prompts to jinja templates as we have with other cases, but looks good to me anyway!

Yes, see the ## What's missing? section in the PR description to see what's missing other than the naming 🙂

src/distilabel/steps/tasks/__init__.py

alvarobartt added 2 commits May 30, 2024 12:21

Set input as optional in format_output

5051992

Implement "Improving Text Embeddings with LLMs" (WIP)

39c9275

alvarobartt added the integrations label May 30, 2024

alvarobartt added this to the 1.2.0 milestone May 30, 2024

alvarobartt requested review from gabrielmbmb and plaguss May 30, 2024 10:23

alvarobartt self-assigned this May 30, 2024

alvarobartt added 4 commits May 31, 2024 08:46

Implement "Improving Text Embeddings with LLMs" (WIP)

1e2ab2b

Add model_name at the end of each batch

9d4e410

Move text_embeddings.py to improving_text_embeddings.py

59471c1

Fix re.sub to also capture \t and \r

287ed8a

alvarobartt added 2 commits June 3, 2024 14:28

Add MonolingualTripletGenerator and BitextRetrievalGenerator

1a4903e

Merge branch 'develop' into improving-text-embeddings-with-llms

811fab4

alvarobartt marked this pull request as ready for review June 4, 2024 08:47

plaguss approved these changes Jun 4, 2024

View reviewed changes

alvarobartt and others added 6 commits June 4, 2024 14:34

Move all templates from str to jinja2 files

11a5fa8

Update class naming and imports

c99ef7f

Add some docstrings and fix jinja2 file paths

2097fab

Merge branch 'develop' into improving-text-embeddings-with-llms

eadc75a

Fix prompt accross tasks

3f444c8

Add missing docstrings

c4026c7

alvarobartt changed the title ~~[WIP] Implement "Improving Text Embeddings with LLMs"~~ Implement "Improving Text Embeddings with LLMs" Jun 7, 2024

alvarobartt added 3 commits June 10, 2024 13:49

Fix process method in EmbeddingTaskGenerator

b304bfe

Add unit tests for ...Generator tasks

bacc544

Add remaining unit tests

6581ef1

alvarobartt commented Jun 10, 2024

View reviewed changes

src/distilabel/steps/tasks/__init__.py Outdated Show resolved Hide resolved

Remove duplicated imports in distilabel.steps.tasks

1a71793

alvarobartt linked an issue Jun 11, 2024 that may be closed by this pull request

[FEATURE] Implement "Improving Text Embeddings with LLMs" #682

Closed

Add examples in docstrings and add notes

3c97218

alvarobartt merged commit 0e8c752 into develop Jun 12, 2024
7 checks passed

alvarobartt deleted the improving-text-embeddings-with-llms branch June 12, 2024 08:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement "Improving Text Embeddings with LLMs" #683

Implement "Improving Text Embeddings with LLMs" #683

alvarobartt commented May 30, 2024 •

edited

Loading

codspeed-hq bot commented May 31, 2024 •

edited

Loading

plaguss left a comment

alvarobartt commented Jun 4, 2024

Implement "Improving Text Embeddings with LLMs" #683

Implement "Improving Text Embeddings with LLMs" #683

Conversation

alvarobartt commented May 30, 2024 • edited Loading

Description

Example

What's missing?

codspeed-hq bot commented May 31, 2024 • edited Loading

CodSpeed Performance Report

Merging #683 will not alter performance

Summary

plaguss left a comment

Choose a reason for hiding this comment

alvarobartt commented Jun 4, 2024

alvarobartt commented May 30, 2024 •

edited

Loading

codspeed-hq bot commented May 31, 2024 •

edited

Loading