Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Include new GeneratorStep classes to load datasets from different formats #691

Merged
merged 9 commits into from
Jun 7, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Jun 3, 2024

Description

This PR includes a new GeneratorStep to load data from files in disk, and renames the generator classes for consistency.

Step renames:

LoadDataFromDicts -> LoadFromBuffer
LoadHubDataset -> LoadFromHub

The old ones will throw a DeprecationWarning to be removed in 1.3.0.

Two new classes

  • LoadFromFileSystem:
from distilabel.steps import LoadFromFileSystem

load_dataset = LoadFromFileSystem(data_files="path/to/my_dataset.jsonl")

The idea for this new Step is to allow reading data from files in disk, like what you would have with load_dataset:

ds = load_dataset("csv", data_files="path/to/my_dataset.csv")

The file extension (the initial csv) will be inferred internally if possible. It works with files in remote filesystems (s3, gcs, etc).

To read remote files, for gcs for example (assuming the credentials are already stored):

load_dataset = LoadFromFileSystem(
    data_files="gcs://bucket/my_dataset.jsonl",
    storage_options={"projects": "your_project_name"}
)
  • LoadFromDisk:

This is the way to read Distisets saved in disk

from distilabel.steps import LoadFromDisk

load_dataset = LoadFromDisk(dataset_path="path/to/my_dataset")

@plaguss plaguss self-assigned this Jun 3, 2024
@plaguss plaguss added enhancement New feature or request improvement labels Jun 3, 2024
@plaguss plaguss added this to the 1.2.0 milestone Jun 3, 2024
@plaguss plaguss requested review from alvarobartt and gabrielmbmb and removed request for alvarobartt June 3, 2024 14:02
@plaguss plaguss marked this pull request as ready for review June 4, 2024 09:29
@plaguss plaguss changed the title Dataloaders [FEATURE] Include new GeneratorStep classes to load datasets from different formats Jun 4, 2024
Copy link
Member

@gabrielmbmb gabrielmbmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The only thing I'm missing is updating how we get the dataset info.

src/distilabel/steps/generators/data.py Outdated Show resolved Hide resolved
Copy link

codspeed-hq bot commented Jun 6, 2024

CodSpeed Performance Report

Merging #691 will not alter performance

Comparing dataloaders (122abe2) with develop (20aa24e)

Summary

✅ 1 untouched benchmarks

@plaguss plaguss merged commit 34ac772 into develop Jun 7, 2024
7 checks passed
@plaguss plaguss deleted the dataloaders branch June 7, 2024 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants