Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 501 (Not Implemented) Response when trying to load HF dataset #684

Closed
jphme opened this issue May 30, 2024 · 7 comments
Closed

[BUG] 501 (Not Implemented) Response when trying to load HF dataset #684

jphme opened this issue May 30, 2024 · 7 comments
Milestone

Comments

@jphme
Copy link
Contributor

jphme commented May 30, 2024

Distilabel Version 1.1.1.

Trying to load a private HF Dataset with LoadHubDataset .

I get

│ ╭─────────────────────────────────── locals ───────────────────────────────────╮                 │
│ │   config = None                                                              │                 │
│ │  headers = {'Authorization': 'Bearer hf_correct_token'} │                 │
│ │   params = {'dataset': 'xyz/private_dataset'}                             │                 │
│ │  repo_id = 'xyz/private_dataset'                                         │                 │
│ │ response = <Response [501]>                                                  │                 │
│ ╰──────────────────────────────────────────────────────────────────────────────╯                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: Failed to get 'xyz/private_dataset' dataset info. Make sure you have set the HF_TOKEN environment variable if it is a private dataset.

I have HF_Token in my enf, the Bearer token shown is correct and I can load the dataset just fine with load_dataset.

I REALLY think the hacky custom loading implementation should be disabled by default, this caused us so much headaches already... why don't add some opt-in streaming HF Dataset loading and just use load_dataset for everything else?

@alvarobartt
Copy link
Member

Hi here @jphme sorry for the inconvenience! I believe that may be due to the fact that your dataset is private and Hugging Face now set the Datasets Server on private repositories for Pro users only I'm afraid :/

@gabrielmbmb
Copy link
Member

Hi @jphme, we will work on something to avoid using the API.

@gabrielmbmb gabrielmbmb added this to the 1.2.0 milestone May 30, 2024
@jphme
Copy link
Contributor Author

jphme commented May 30, 2024

Hi here @jphme sorry for the inconvenience! I believe that may be due to the fact that your dataset is private and Hugging Face now set the Datasets Server on private repositories for Pro users only I'm afraid :/

Ah, that might be it, thanks 👍 .

We are even Corporate Pro users. But don't you think the current implementation is flawed when
a) the library can't be used properly by the 90%+ of non-pro users and
b) the error message is completely non-descript

and all this for a nice-to-have-feature (don't having to load the dataset in full) that doesn't affect ~95+% (for us so far 100%) of usecases? ;-)

Sorry don't want to sound too negative, I just spent the better part of an hour trying to figure out why this isn't working.

@alvarobartt
Copy link
Member

We are even Corporate Pro users. But don't you think the current implementation is flawed when
a) the library can't be used properly by the 90%+ of non-pro users and
b) the error message is completely non-descript

Fair, indeed the message has already been updated as we noticed about this a couple days ago and should roll out in the next distilabel release, but yes, we'll try to do better and offer a better (less hacky and, so on, most likely less efficient) solution.

@rasdani
Copy link
Contributor

rasdani commented May 30, 2024

One solution could be to load the dataset in streaming mode and fetch a single row.
This way you get the features and don't have to load the whole dataset in advance.

Maybe you can get the features even without fetching a row.

@jphme
Copy link
Contributor Author

jphme commented May 30, 2024

We are even Corporate Pro users. But don't you think the current implementation is flawed when
a) the library can't be used properly by the 90%+ of non-pro users and
b) the error message is completely non-descript

Fair, indeed the message has already been updated as we noticed about this a couple days ago and should roll out in the next distilabel release, but yes, we'll try to do better and offer a better (less hacky and, so on, most likely less efficient) solution.

Sounds great, many thanks!. I think sometimes the tradeoff between optimizing performance + optimizing usability should be more in the direct of usability; but this is often very hard to know in advance...

@plaguss
Copy link
Contributor

plaguss commented Jun 7, 2024

Closing this with #691, needs some more examples in the docs still.
The idea now is trying to download the data using get_dataset_info and if that fails, call load_dataset.

@plaguss plaguss closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants