Skip to content

Commit

Permalink
Fix docs of saving/loading distiset from disk (#679)
Browse files Browse the repository at this point in the history
  • Loading branch information
plaguss authored May 29, 2024
1 parent 01b4292 commit 37f970e
Showing 1 changed file with 17 additions and 4 deletions.
21 changes: 17 additions & 4 deletions docs/sections/learn/advanced/distiset.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,17 +83,30 @@ distiset.save_to_disk(
)
```

And load a [`Distiset`][distilabel.distiset.Distiset] that was saved using [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk] from disk just the same way:
And load a [`Distiset`][distilabel.distiset.Distiset] that was saved using [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk] just the same way:

```python
from distilabel.distiset import Distiset

distiset = Distiset.save_to_disk("my-dataset")
distiset = Distiset.load_from_disk("my-dataset")
```

Take into account that these methods pass work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card), you can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).
or from your cloud provider if that's where it was stored:

Take a look at the remaining arguments at [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk].
```python
distiset = Distiset.load_from_disk(
"s3://path/to/my_dataset", # gcs:// or any filesystem tolerated by fsspec
storage_options={
"key": os.environ["S3_ACCESS_KEY"],
"secret": os.environ["S3_SECRET_KEY"],
...
}
)
```

Take into account that these methods work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card). You can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).

Take a look at the remaining arguments at [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk] and [`Distiset.load_from_disk`][distilabel.distiset.Distiset.load_from_disk].

## Dataset card

Expand Down

0 comments on commit 37f970e

Please sign in to comment.