Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Make LoadHubDataset more general to read local files #687

Closed
plaguss opened this issue May 31, 2024 · 3 comments
Closed

[FEATURE] Make LoadHubDataset more general to read local files #687

plaguss opened this issue May 31, 2024 · 3 comments
Assignees
Milestone

Comments

@plaguss
Copy link
Contributor

plaguss commented May 31, 2024

Is your feature request related to a problem? Please describe.
I created a custom step to read data from a jsonlines file in the local fileystem, but noted that load_dataset already has the functionality, we only need to expose it.

Describe the solution you'd like
Allow reading local files using LoadHubDataset, something similar to the following

load_dataset = LoadHubDataset(
    filetype="json",
    filename="path/to/dataset.jsonl"
)

Describe alternatives you've considered
Creating a custom step when needed.

Additional context
Ref: https://huggingface.co/docs/datasets/loading#local-and-remote-files

@plaguss plaguss added this to the 1.3.0 milestone May 31, 2024
@alvarobartt
Copy link
Member

I also wanted to bring up the fact that we now have LoadDataFromDicts and LoadHubDataset; most of the times, myself, as an user, end up writing LoadDataFromHub assuming the standard is LoadDataFrom ...; maybe this is a nice time to unify everything under the same naming pattern? I believe this will be easier and avoid conflicts in the future (we can set a deprecation warning on LoadHubDataset for the next couple of releases to avoid breaking any existing pipeline).

@rasdani
Copy link
Contributor

rasdani commented May 31, 2024

After the merge of #673 there also needs to be a LoadS3Dataset or not?

Like this:
#665 (comment)

@plaguss plaguss self-assigned this Jun 7, 2024
@plaguss plaguss modified the milestones: 1.3.0, 1.2.0 Jun 7, 2024
@plaguss
Copy link
Contributor Author

plaguss commented Jun 7, 2024

Closing this with #691. @rasdani will let you know once we add the docs, but the you could use the new LoadDataFromDisk to work with the s3 datasets saved to disk.

@plaguss plaguss closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants