Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace "DataSet" with "Dataset" in Markdown files #2735

Merged
merged 22 commits into from
Aug 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
74a98f5
LambdaDataSet->LambdaDataset in .md files
deepyaman Jun 27, 2023
4a172cd
MemoryDataSet->MemoryDataset in .md files
deepyaman Jun 27, 2023
c881154
PartitionedDataSet->PartitionedDataset in .md files
deepyaman Jun 27, 2023
8136be0
IncrementalDataSet->IncrementalDataset in .md files
deepyaman Jun 27, 2023
35f3e78
CachedDataSet->CachedDataset in .md files
deepyaman Jun 27, 2023
942d7cd
DataSetError->DatasetError in .md files
deepyaman Jun 27, 2023
7ebe6c3
DataSetNotFoundError->DatasetNotFoundError in .md files
deepyaman Jun 27, 2023
4156fef
Replace "DataSet" with "Dataset" in Markdown files
deepyaman Jun 27, 2023
eff96f4
Update RELEASE.md
deepyaman Jun 27, 2023
e46fc4f
Merge branch 'main' into docs/rename-datasets
deepyaman Jun 28, 2023
082c3de
Merge branch 'main' into docs/rename-datasets
deepyaman Jun 29, 2023
7565519
Merge branch 'main' into docs/rename-datasets
deepyaman Jun 30, 2023
8a6e502
Merge branch 'main' into docs/rename-datasets
deepyaman Jul 3, 2023
212b7d9
Merge branch 'main' into docs/rename-datasets
deepyaman Aug 14, 2023
715d140
Fix remaining instance of "*DataSet*"->"*Dataset*"
deepyaman Aug 16, 2023
e9dee82
`find . -name '*.md' -print0 | xargs -0 sed -i "" "s/\([^A-Za-z]\)Dat…
deepyaman Aug 16, 2023
ad2c7b4
Change non-class instances of Dataset to dataset
deepyaman Aug 16, 2023
94dd319
Merge branch 'main' into docs/rename-datasets
stichbury Aug 18, 2023
dba7503
Merge branch 'main' into docs/rename-datasets
deepyaman Aug 18, 2023
76732ea
Replace any remaining instances of DataSet in docs
deepyaman Aug 18, 2023
65f84c2
Fix a broken link to docs for `PartitionedDataset`
deepyaman Aug 18, 2023
aefbe54
Merge branch 'main' into docs/rename-datasets
astrojuanlu Aug 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/source/configuration/advanced_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ From version 0.17.0, `TemplatedConfigLoader` also supports the [Jinja2](https://
```
{% for speed in ['fast', 'slow'] %}
{{ speed }}-trains:
type: MemoryDataSet
type: MemoryDataset

{{ speed }}-cars:
type: pandas.CSVDataSet
Expand All @@ -197,13 +197,13 @@ The output Python dictionary will look as follows:

```python
{
"fast-trains": {"type": "MemoryDataSet"},
"fast-trains": {"type": "MemoryDataset"},
"fast-cars": {
"type": "pandas.CSVDataSet",
"filepath": "s3://my_s3_bucket/fast-cars.csv",
"save_args": {"index": True},
},
"slow-trains": {"type": "MemoryDataSet"},
"slow-trains": {"type": "MemoryDataset"},
"slow-cars": {
"type": "pandas.CSVDataSet",
"filepath": "s3://my_s3_bucket/slow-cars.csv",
Expand Down
2 changes: 1 addition & 1 deletion docs/source/configuration/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ node(
)
```

In both cases, under the hood parameters are added to the Data Catalog through the method `add_feed_dict()` in [`DataCatalog`](/kedro.io.DataCatalog), where they live as `MemoryDataSet`s. This method is also what the `KedroContext` class uses when instantiating the catalog.
In both cases, under the hood parameters are added to the Data Catalog through the method `add_feed_dict()` in [`DataCatalog`](/kedro.io.DataCatalog), where they live as `MemoryDataset`s. This method is also what the `KedroContext` class uses when instantiating the catalog.

```{note}
You can use `add_feed_dict()` to inject any other entries into your `DataCatalog` as per your use case.
Expand Down
10 changes: 5 additions & 5 deletions docs/source/data/advanced_data_catalog_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ gear = cars["gear"].values
The following steps happened behind the scenes when `load` was called:

- The value `cars` was located in the Data Catalog
- The corresponding `AbstractDataSet` object was retrieved
- The corresponding `AbstractDataset` object was retrieved
- The `load` method of this dataset was called
- This `load` method delegated the loading to the underlying pandas `read_csv` function

Expand All @@ -70,9 +70,9 @@ This pattern is not recommended unless you are using platform notebook environme
To save data using an API similar to that used to load data:

```python
from kedro.io import MemoryDataSet
from kedro.io import MemoryDataset

memory = MemoryDataSet(data=None)
memory = MemoryDataset(data=None)
io.add("cars_cache", memory)
io.save("cars_cache", "Memory can store anything.")
io.load("cars_cache")
Expand Down Expand Up @@ -190,7 +190,7 @@ io.save("test_data_set", data1)
reloaded = io.load("test_data_set")
assert data1.equals(reloaded)

# raises DataSetError since the path
# raises DatasetError since the path
# data/01_raw/test.csv/my_exact_version/test.csv already exists
io.save("test_data_set", data2)
```
Expand Down Expand Up @@ -219,7 +219,7 @@ io = DataCatalog({"test_data_set": test_data_set})

io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency

# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv
# raises DatasetError since the data/01_raw/test.csv/exact_load_version/test.csv
# file does not exist
reloaded = io.load("test_data_set")
```
5 changes: 3 additions & 2 deletions docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ In the example above, the `catalog.yml` file contains references to credentials

### Dataset versioning


Kedro enables dataset and ML model versioning through the `versioned` definition. For example:

```yaml
Expand All @@ -144,9 +145,9 @@ kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ
```
where `--load-version` is dataset name and version timestamp separated by `:`.

A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataset) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively.
A dataset offers versioning support if it extends the [`AbstractVersionedDataset`](/kedro.io.AbstractVersionedDataset) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively.

To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataSet`. For instance, if you encounter a class like `CSVDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning.
To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataset`. For instance, if you encounter a class like `CSVDataSet(AbstractVersionedDataset[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning.

```{note}
Note that HTTP(S) is a supported file system in the dataset implementations, but if you it, you can't also use versioning.
Expand Down
6 changes: 3 additions & 3 deletions docs/source/data/data_catalog_yaml_examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -397,12 +397,12 @@ for loading, so the first node outputs a `pyspark.sql.DataFrame`, while the seco

You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file).

This creates a `<conf_root>/<env>/catalog/<pipeline_name>.yml` configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`.
This creates a `<conf_root>/<env>/catalog/<pipeline_name>.yml` configuration file with `MemoryDataset` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`.

```yaml
# <conf_root>/<env>/catalog/<pipeline_name>.yml
rockets:
type: MemoryDataSet
type: MemoryDataset
scooters:
type: MemoryDataSet
type: MemoryDataset
```
16 changes: 8 additions & 8 deletions docs/source/data/how_to_create_a_custom_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

[Kedro supports many datasets](/kedro_datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data.

## AbstractDataSet
## AbstractDataset

For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataSet` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataSet` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation.
For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataset` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataset` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.


## Scenario
Expand Down Expand Up @@ -267,19 +267,19 @@ class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
```
</details>

## Integration with `PartitionedDataSet`
## Integration with `PartitionedDataset`

Currently, the `ImageDataSet` only works with a single image, but this example needs to load all Pokemon images from the raw data directory for further processing.

Kedro's [`PartitionedDataSet`](./partitioned_and_incremental_datasets.md) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory.
Kedro's [`PartitionedDataset`](/kedro.io.PartitionedDataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory.

To use `PartitionedDataSet` with `ImageDataSet` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataSet` loads all PNG files from the data directory using `ImageDataSet`:
To use `PartitionedDataset` with `ImageDataSet` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataset` loads all PNG files from the data directory using `ImageDataSet`:

```yaml
# in conf/base/catalog.yml

pokemon:
type: PartitionedDataSet
type: PartitionedDataset
dataset: kedro_pokemon.extras.datasets.image_dataset.ImageDataSet
path: data/01_raw/pokemon-images-and-types/images/images
filename_suffix: ".png"
Expand All @@ -305,11 +305,11 @@ $ ls -la data/01_raw/pokemon-images-and-types/images/images/*.png | wc -l
### How to implement versioning in your dataset

```{note}
Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at the same time.
Versioning doesn't work with `PartitionedDataset`. You can't use both of them at the same time.
```

To add versioning support to the new dataset we need to extend the
[AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataset) to:
[AbstractVersionedDataset](/kedro.io.AbstractVersionedDataset) to:

* Accept a `version` keyword argument as part of the constructor
* Adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively
Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/kedro_dataset_factories.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ The matches are ranked according to the following criteria:

## How to override the default dataset creation with dataset factories

You can use dataset factories to define a catch-all pattern which will overwrite the default [`MemoryDataSet`](/kedro.io.MemoryDataset) creation.
You can use dataset factories to define a catch-all pattern which will overwrite the default [`MemoryDataset`](/kedro.io.MemoryDataset) creation.

```yaml
"{default_dataset}":
Expand Down
Loading