Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GenerateSentencePair task #689

Merged
merged 16 commits into from
Jun 4, 2024
Merged

Add GenerateSentencePair task #689

merged 16 commits into from
Jun 4, 2024

Conversation

gabrielmbmb
Copy link
Member

@gabrielmbmb gabrielmbmb commented May 31, 2024

Description

This PR adds a new task called GenerateSentencePair that allows building datasets that can be used to train embedding models. The task can be used to generate a positive sentence based on the provided anchor sentence, and if triplet attribute is True, then it will generate a negative sentence too. The task can be used to paraphrase the anchor, generate semantically similar content with respect to the anchor, or to generate a query for the anchor.

In addition, this PR has updated the add_raw_output attribute so it's now a RuntimeParameter, and it has now True as default value, so the raw outputs of the LLMs are stored by default in the final dataset.

Copy link

codspeed-hq bot commented May 31, 2024

CodSpeed Performance Report

Merging #689 will not alter performance

Comparing embedding-dataset-tasks (6cd3ca6) with develop (e61b598)

Summary

✅ 1 untouched benchmarks

@gabrielmbmb gabrielmbmb requested a review from plaguss June 4, 2024 10:09
@gabrielmbmb gabrielmbmb self-assigned this Jun 4, 2024
@gabrielmbmb gabrielmbmb added the enhancement New feature or request label Jun 4, 2024
@gabrielmbmb gabrielmbmb added this to the 1.2.0 milestone Jun 4, 2024
@gabrielmbmb gabrielmbmb marked this pull request as ready for review June 4, 2024 10:22
@@ -30,17 +30,15 @@
from distilabel.steps.tasks.prometheus_eval import PrometheusEval
from distilabel.steps.tasks.quality_scorer import QualityScorer
from distilabel.steps.tasks.self_instruct import SelfInstruct
from distilabel.steps.tasks.sentence_transformers import GenerateSentencePair
from distilabel.steps.tasks.structured_generation import StructuredGeneration
from distilabel.steps.tasks.text_generation import ChatGeneration, TextGeneration
from distilabel.steps.tasks.typing import ChatItem, ChatType
from distilabel.steps.tasks.ultrafeedback import UltraFeedback

__all__ = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import order here was alphabetical, what's the rationale behind this change? Maybe we should change this in other places too to make sure we're aligned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale was to have the imports ordered in __all__ by the order in which the were imported, which is more common than having them alphabetically.

Copy link
Member

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just read the comments above, and note that some docstrings are missing!

@gabrielmbmb gabrielmbmb merged commit e4a9609 into develop Jun 4, 2024
7 checks passed
@gabrielmbmb gabrielmbmb deleted the embedding-dataset-tasks branch June 4, 2024 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants