Add `GenerateSentencePair` task #689

gabrielmbmb · 2024-05-31T15:15:17Z

Description

This PR adds a new task called GenerateSentencePair that allows building datasets that can be used to train embedding models. The task can be used to generate a positive sentence based on the provided anchor sentence, and if triplet attribute is True, then it will generate a negative sentence too. The task can be used to paraphrase the anchor, generate semantically similar content with respect to the anchor, or to generate a query for the anchor.

In addition, this PR has updated the add_raw_output attribute so it's now a RuntimeParameter, and it has now True as default value, so the raw outputs of the LLMs are stored by default in the final dataset.

codspeed-hq · 2024-05-31T15:22:00Z

CodSpeed Performance Report

Merging #689 will not alter performance

_{Comparing embedding-dataset-tasks (6cd3ca6) with develop (e61b598)}

Summary

✅ 1 untouched benchmarks

src/distilabel/steps/tasks/sentence_transformers.py

docs/sections/learn/tutorial/task/index.md

alvarobartt · 2024-06-04T10:36:03Z

src/distilabel/steps/tasks/__init__.py

@@ -30,17 +30,15 @@
 from distilabel.steps.tasks.prometheus_eval import PrometheusEval
 from distilabel.steps.tasks.quality_scorer import QualityScorer
 from distilabel.steps.tasks.self_instruct import SelfInstruct
+from distilabel.steps.tasks.sentence_transformers import GenerateSentencePair
 from distilabel.steps.tasks.structured_generation import StructuredGeneration
 from distilabel.steps.tasks.text_generation import ChatGeneration, TextGeneration
 from distilabel.steps.tasks.typing import ChatItem, ChatType
 from distilabel.steps.tasks.ultrafeedback import UltraFeedback

 __all__ = [


The import order here was alphabetical, what's the rationale behind this change? Maybe we should change this in other places too to make sure we're aligned?

The rationale was to have the imports ordered in __all__ by the order in which the were imported, which is more common than having them alphabetically.

src/distilabel/steps/tasks/base.py

src/distilabel/steps/tasks/sentence_transformers.py

alvarobartt

LGTM! Just read the comments above, and note that some docstrings are missing!

Co-authored-by: alvarobartt <[email protected]>

gabrielmbmb force-pushed the embedding-dataset-tasks branch from 05b3886 to fd9d64a Compare May 31, 2024 15:16

Add GenerateSentencePair task

2e5d9f8

gabrielmbmb force-pushed the embedding-dataset-tasks branch from fd9d64a to 2e5d9f8 Compare May 31, 2024 15:37

alvarobartt reviewed Jun 3, 2024

View reviewed changes

src/distilabel/steps/tasks/sentence_transformers.py Outdated Show resolved Hide resolved

gabrielmbmb added 6 commits June 3, 2024 13:18

Merge branch 'develop' into embedding-dataset-tasks

b7cd785

Update task to use system prompt

53019bc

Fix setup_logging file location

9be1e29

Update add_raw_output to be RuntimeParamater and True by default

e619f82

Fix system prompt for negative sentences

e43a984

Add GenerateSentencePair unit tests

e950db8

gabrielmbmb requested a review from plaguss June 4, 2024 10:09

gabrielmbmb self-assigned this Jun 4, 2024

gabrielmbmb added the enhancement New feature or request label Jun 4, 2024

gabrielmbmb added this to the 1.2.0 milestone Jun 4, 2024

Fix unit tests after updating add_raw_output

06924a9

gabrielmbmb force-pushed the embedding-dataset-tasks branch from d1e00be to 06924a9 Compare June 4, 2024 10:22

gabrielmbmb marked this pull request as ready for review June 4, 2024 10:22

Update docs to mention add_raw_output attribute

578ccb2

alvarobartt reviewed Jun 4, 2024

View reviewed changes

docs/sections/learn/tutorial/task/index.md Show resolved Hide resolved

alvarobartt reviewed Jun 4, 2024

View reviewed changes

src/distilabel/steps/tasks/base.py Outdated Show resolved Hide resolved

alvarobartt reviewed Jun 4, 2024

View reviewed changes

src/distilabel/steps/tasks/sentence_transformers.py Outdated Show resolved Hide resolved

alvarobartt reviewed Jun 4, 2024

View reviewed changes

Update add_raw_output description

3a1be83

Co-authored-by: alvarobartt <[email protected]>

gabrielmbmb force-pushed the embedding-dataset-tasks branch from 03aa740 to 3a1be83 Compare June 4, 2024 10:46

gabrielmbmb and others added 4 commits June 4, 2024 13:04

Fix columns

3e30f16

Co-authored-by: alvarobartt <[email protected]>

Add missing docstrings

74064ff

Fix tests

aac845c

Add answer generation action

ff2b025

gabrielmbmb added 2 commits June 4, 2024 14:52

Fix examples not being correctly rendered

1ce6b93

Add examples

6cd3ca6

gabrielmbmb merged commit e4a9609 into develop Jun 4, 2024
7 checks passed

gabrielmbmb deleted the embedding-dataset-tasks branch June 4, 2024 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `GenerateSentencePair` task #689

Add `GenerateSentencePair` task #689

gabrielmbmb commented May 31, 2024 •

edited

Loading

codspeed-hq bot commented May 31, 2024 •

edited

Loading

alvarobartt Jun 4, 2024

gabrielmbmb Jun 4, 2024

alvarobartt left a comment

Add GenerateSentencePair task #689

Add GenerateSentencePair task #689

Conversation

gabrielmbmb commented May 31, 2024 • edited Loading

Description

codspeed-hq bot commented May 31, 2024 • edited Loading

CodSpeed Performance Report

Merging #689 will not alter performance

Summary

alvarobartt Jun 4, 2024

Choose a reason for hiding this comment

gabrielmbmb Jun 4, 2024

Choose a reason for hiding this comment

alvarobartt left a comment

Choose a reason for hiding this comment

Add `GenerateSentencePair` task #689

Add `GenerateSentencePair` task #689

gabrielmbmb commented May 31, 2024 •

edited

Loading

codspeed-hq bot commented May 31, 2024 •

edited

Loading