Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove PartitionedDataset and IncrementalDataset from kedro.io #3187

Merged
merged 16 commits into from
Oct 24, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/data/how_to_create_a_custom_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@

Currently, the `ImageDataset` only works with a single image, but this example needs to load all Pokemon images from the raw data directory for further processing.

Kedro's [`PartitionedDataset`](/kedro.io.PartitionedDataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory.
Kedro's [`PartitionedDataset`](/kedro_datasets.partitions.PartitionedDataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory.

Check warning on line 274 in docs/source/data/how_to_create_a_custom_dataset.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/how_to_create_a_custom_dataset.md#L274

[Kedro.toowordy] 'multiple' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'multiple' is too wordy", "location": {"path": "docs/source/data/how_to_create_a_custom_dataset.md", "range": {"start": {"line": 274, "column": 107}}}, "severity": "WARNING"}

To use `PartitionedDataset` with `ImageDataset` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataset` loads all PNG files from the data directory using `ImageDataset`:

Expand Down
6 changes: 3 additions & 3 deletions docs/source/data/partitioned_and_incremental_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Distributed systems play an increasingly important role in ETL data pipelines. They increase the processing throughput, enabling us to work with much larger volumes of input data. A situation may arise where your Kedro node needs to read the data from a directory full of uniform files of the same type like JSON or CSV. Tools like `PySpark` and the corresponding [SparkDataset](/kedro_datasets.spark.SparkDataset) cater for such use cases but may not always be possible.

This is why Kedro provides a built-in [PartitionedDataset](/kedro.io.PartitionedDataset), with the following features:
This is why Kedro provides a built-in [PartitionedDataset](/kedro_datasets.partitions.PartitionedDataset), with the following features:

* `PartitionedDataset` can recursively load/save all or specific files from a given location.
* It is platform agnostic, and can work with any filesystem implementation supported by [fsspec](https://filesystem-spec.readthedocs.io/) including local, S3, GCS, and many more.
Expand Down Expand Up @@ -240,7 +240,7 @@

## Incremental datasets

[IncrementalDataset](/kedro.io.IncrementalDataset) is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs.
[IncrementalDataset](/kedro_datasets.partitions.IncrementalDataset) is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs.

Check warning on line 243 in docs/source/data/partitioned_and_incremental_datasets.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/partitioned_and_incremental_datasets.md#L243

[Kedro.abbreviations] Use 'that is' instead of abbreviations like 'i.e.'.
Raw output
{"message": "[Kedro.abbreviations] Use 'that is' instead of abbreviations like 'i.e.'.", "location": {"path": "docs/source/data/partitioned_and_incremental_datasets.md", "range": {"start": {"line": 243, "column": 299}}}, "severity": "WARNING"}

Check warning on line 243 in docs/source/data/partitioned_and_incremental_datasets.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/partitioned_and_incremental_datasets.md#L243

[Kedro.toowordy] 'subsequent' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'subsequent' is too wordy", "location": {"path": "docs/source/data/partitioned_and_incremental_datasets.md", "range": {"start": {"line": 243, "column": 309}}}, "severity": "WARNING"}

Check warning on line 243 in docs/source/data/partitioned_and_incremental_datasets.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/partitioned_and_incremental_datasets.md#L243

[Kedro.weaselwords] 'only' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/data/partitioned_and_incremental_datasets.md", "range": {"start": {"line": 243, "column": 340}}}, "severity": "WARNING"}
SajidAlamQB marked this conversation as resolved.
Show resolved Hide resolved

This checkpoint, by default, is persisted to the location of the data partitions. For example, for `IncrementalDataset` instantiated with path `s3://my-bucket-name/path/to/folder`, the checkpoint will be saved to `s3://my-bucket-name/path/to/folder/CHECKPOINT`, unless [the checkpoint configuration is explicitly overwritten](#checkpoint-configuration).

Expand Down Expand Up @@ -309,7 +309,7 @@

Important notes about the confirmation operation:

* Confirming a partitioned dataset does not affect any subsequent loads within the same run. All downstream nodes that input the same partitioned dataset as input will all receive the _same_ partitions. Partitions that are created externally during the run will also not affect the dataset loads and won't appear in the list of loaded partitions until the next run or until the [`release()`](/kedro.io.IncrementalDataset) method is called on the dataset object.
* Confirming a partitioned dataset does not affect any subsequent loads within the same run. All downstream nodes that input the same partitioned dataset as input will all receive the _same_ partitions. Partitions that are created externally during the run will also not affect the dataset loads and won't appear in the list of loaded partitions until the next run or until the [`release()`](/kedro_datasets.partitions.IncrementalDataset) method is called on the dataset object.

Check warning on line 312 in docs/source/data/partitioned_and_incremental_datasets.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/partitioned_and_incremental_datasets.md#L312

[Kedro.toowordy] 'subsequent' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'subsequent' is too wordy", "location": {"path": "docs/source/data/partitioned_and_incremental_datasets.md", "range": {"start": {"line": 312, "column": 56}}}, "severity": "WARNING"}
* A pipeline cannot contain more than one node confirming the same dataset.


Expand Down
2 changes: 0 additions & 2 deletions docs/source/kedro.io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,8 @@ kedro.io
kedro.io.AbstractVersionedDataset
kedro.io.CachedDataset
kedro.io.DataCatalog
kedro.io.IncrementalDataset
kedro.io.LambdaDataset
kedro.io.MemoryDataset
kedro.io.PartitionedDataset
kedro.io.Version

.. rubric:: Exceptions
Expand Down
6 changes: 0 additions & 6 deletions kedro/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,6 @@
from .data_catalog import DataCatalog
from .lambda_dataset import LambdaDataset
from .memory_dataset import MemoryDataset
from .partitioned_dataset import (
IncrementalDataset,
PartitionedDataset,
)

__all__ = [
"AbstractDataset",
Expand All @@ -28,9 +24,7 @@
"DatasetAlreadyExistsError",
"DatasetError",
"DatasetNotFoundError",
"IncrementalDataset",
"LambdaDataset",
"MemoryDataset",
"PartitionedDataset",
"Version",
]
Loading
Loading