From 74a98f5f7e280fc2758db3e4970438a5dc314ba4 Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Tue, 27 Jun 2023 10:52:09 -0500 Subject: [PATCH 01/14] LambdaDataSet->LambdaDataset in .md files Signed-off-by: Deepyaman Datta --- docs/source/nodes_and_pipelines/run_a_pipeline.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/nodes_and_pipelines/run_a_pipeline.md b/docs/source/nodes_and_pipelines/run_a_pipeline.md index c9a9790302..76ea1afc77 100644 --- a/docs/source/nodes_and_pipelines/run_a_pipeline.md +++ b/docs/source/nodes_and_pipelines/run_a_pipeline.md @@ -243,7 +243,7 @@ Out[11]: {'v': 0.666666666666667} ## Output to a file -We can also use IO to save outputs to a file. In this example, we define a custom `LambdaDataSet` that would serialise the output to a file locally: +We can also use IO to save outputs to a file. In this example, we define a custom `LambdaDataset` that would serialise the output to a file locally:
Click to expand @@ -260,14 +260,14 @@ def load(): return pickle.load(f) -pickler = LambdaDataSet(load=load, save=save) +pickler = LambdaDataset(load=load, save=save) io.add("v", pickler) ```
It is important to make sure that the data catalog variable name `v` matches the name `v` in the pipeline definition. -Next we can confirm that this `LambdaDataSet` behaves correctly: +Next we can confirm that this `LambdaDataset` behaves correctly:
Click to expand From 4a172cd954a14e7752486d9e6a62c06c6ff6e28d Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Tue, 27 Jun 2023 10:53:53 -0500 Subject: [PATCH 02/14] MemoryDataSet->MemoryDataset in .md files Signed-off-by: Deepyaman Datta --- RELEASE.md | 8 +-- .../configuration/advanced_configuration.md | 6 +- docs/source/configuration/parameters.md | 2 +- docs/source/data/data_catalog.md | 10 +-- docs/source/deployment/argo.md | 2 +- docs/source/deployment/aws_step_functions.md | 2 +- docs/source/deployment/prefect.md | 4 +- docs/source/development/commands_reference.md | 2 +- docs/source/hooks/examples.md | 14 ++-- .../integrations/pyspark_integration.md | 12 ++-- docs/source/nodes_and_pipelines/nodes.md | 18 ++--- .../nodes_and_pipelines/run_a_pipeline.md | 10 +-- .../nodes_and_pipelines/slice_a_pipeline.md | 4 +- .../kedro_and_notebooks.md | 2 +- docs/source/tutorial/add_another_pipeline.md | 66 +++++++++---------- docs/source/tutorial/create_a_pipeline.md | 26 ++++---- 16 files changed, 94 insertions(+), 94 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index 260ebe289c..773b163977 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -778,7 +778,7 @@ from kedro.framework.session import KedroSession * In a significant change, [we have introduced `KedroSession`](https://docs.kedro.org/en/0.17.0/04_kedro_project_setup/03_session.html) which is responsible for managing the lifecycle of a Kedro run. * Created a new Kedro Starter: `kedro new --starter=mini-kedro`. It is possible to [use the DataCatalog as a standalone component](https://github.com/kedro-org/kedro-starters/tree/master/mini-kedro) in a Jupyter notebook and transition into the rest of the Kedro framework. * Added `DatasetSpecs` with Hooks to run before and after datasets are loaded from/saved to the catalog. -* Added a command: `kedro catalog create`. For a registered pipeline, it creates a `//catalog/.yml` configuration file with `MemoryDataSet` datasets for each dataset that is missing from `DataCatalog`. +* Added a command: `kedro catalog create`. For a registered pipeline, it creates a `//catalog/.yml` configuration file with `MemoryDataset` datasets for each dataset that is missing from `DataCatalog`. * Added `settings.py` and `pyproject.toml` (to replace `.kedro.yml`) for project configuration, in line with Python best practice. * `ProjectContext` is no longer needed, unless for very complex customisations. `KedroContext`, `ProjectHooks` and `settings.py` together implement sensible default behaviour. As a result `context_path` is also now an _optional_ key in `pyproject.toml`. * Removed `ProjectContext` from `src//run.py`. @@ -1284,7 +1284,7 @@ You can also load data incrementally whenever it is dumped into a directory with - `kedro.io` - `kedro.extras.datasets` - Import path, specified in `type` -* Added an optional `copy_mode` flag to `CachedDataSet` and `MemoryDataSet` to specify (`deepcopy`, `copy` or `assign`) the copy mode to use when loading and saving. +* Added an optional `copy_mode` flag to `CachedDataSet` and `MemoryDataset` to specify (`deepcopy`, `copy` or `assign`) the copy mode to use when loading and saving. ### New Datasets @@ -1504,7 +1504,7 @@ You can also load data incrementally whenever it is dumped into a directory with * Documented the architecture of Kedro showing how we think about library, project and framework components. * `extras/kedro_project_loader.py` renamed to `extras/ipython_loader.py` and now runs any IPython startup scripts without relying on the Kedro project structure. * Fixed TypeError when validating partial function's signature. -* After a node failure during a pipeline run, a resume command will be suggested in the logs. This command will not work if the required inputs are MemoryDataSets. +* After a node failure during a pipeline run, a resume command will be suggested in the logs. This command will not work if the required inputs are MemoryDatasets. ## Breaking changes to the API @@ -1615,7 +1615,7 @@ These steps should have brought your project to Kedro 0.15.0. There might be som * Fix local project source not having priority over the same source installed as a package, leading to local updates not being recognised. ## Breaking changes to the API -* Remove the max_loads argument from the `MemoryDataSet` constructor and from the `AbstractRunner.create_default_data_set` method. +* Remove the max_loads argument from the `MemoryDataset` constructor and from the `AbstractRunner.create_default_data_set` method. ## Thanks for supporting contributions [Joel Schwarzmann](https://github.com/datajoely), [Alex Kalmikov](https://github.com/kalexqb) diff --git a/docs/source/configuration/advanced_configuration.md b/docs/source/configuration/advanced_configuration.md index efd71a8564..04b52f18b1 100644 --- a/docs/source/configuration/advanced_configuration.md +++ b/docs/source/configuration/advanced_configuration.md @@ -176,7 +176,7 @@ From version 0.17.0, `TemplatedConfigLoader` also supports the [Jinja2](https:// ``` {% for speed in ['fast', 'slow'] %} {{ speed }}-trains: - type: MemoryDataSet + type: MemoryDataset {{ speed }}-cars: type: pandas.CSVDataSet @@ -197,13 +197,13 @@ The output Python dictionary will look as follows: ```python { - "fast-trains": {"type": "MemoryDataSet"}, + "fast-trains": {"type": "MemoryDataset"}, "fast-cars": { "type": "pandas.CSVDataSet", "filepath": "s3://my_s3_bucket/fast-cars.csv", "save_args": {"index": True}, }, - "slow-trains": {"type": "MemoryDataSet"}, + "slow-trains": {"type": "MemoryDataset"}, "slow-cars": { "type": "pandas.CSVDataSet", "filepath": "s3://my_s3_bucket/slow-cars.csv", diff --git a/docs/source/configuration/parameters.md b/docs/source/configuration/parameters.md index 60de2d4da4..61c6ff0e9c 100644 --- a/docs/source/configuration/parameters.md +++ b/docs/source/configuration/parameters.md @@ -66,7 +66,7 @@ node( ) ``` -In both cases, under the hood parameters are added to the Data Catalog through the method `add_feed_dict()` in [`DataCatalog`](/kedro.io.DataCatalog), where they live as `MemoryDataSet`s. This method is also what the `KedroContext` class uses when instantiating the catalog. +In both cases, under the hood parameters are added to the Data Catalog through the method `add_feed_dict()` in [`DataCatalog`](/kedro.io.DataCatalog), where they live as `MemoryDataset`s. This method is also what the `KedroContext` class uses when instantiating the catalog. ```{note} You can use `add_feed_dict()` to inject any other entries into your `DataCatalog` as per your use case. diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index 3ea11b2a27..2cc5f4c995 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -359,14 +359,14 @@ The list of all available parameters is given in the [Paramiko documentation](ht You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file). -This creates a `//catalog/.yml` configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`. +This creates a `//catalog/.yml` configuration file with `MemoryDataset` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`. ```yaml # //catalog/.yml rockets: - type: MemoryDataSet + type: MemoryDataset scooters: - type: MemoryDataSet + type: MemoryDataset ``` ## Adding parameters @@ -601,9 +601,9 @@ This use is not recommended unless you are prototyping in notebooks. #### Save data to memory ```python -from kedro.io import MemoryDataSet +from kedro.io import MemoryDataset -memory = MemoryDataSet(data=None) +memory = MemoryDataset(data=None) io.add("cars_cache", memory) io.save("cars_cache", "Memory can store anything.") io.load("car_cache") diff --git a/docs/source/deployment/argo.md b/docs/source/deployment/argo.md index f66b809b0e..599ff819c0 100644 --- a/docs/source/deployment/argo.md +++ b/docs/source/deployment/argo.md @@ -24,7 +24,7 @@ To use Argo Workflows, ensure you have the following prerequisites in place: - [Argo Workflows is installed](https://github.com/argoproj/argo/blob/master/README.md#quickstart) on your Kubernetes cluster - [Argo CLI is installed](https://github.com/argoproj/argo/releases) on your machine - A `name` attribute is set for each [Kedro node](/kedro.pipeline.node) since it is used to build a DAG -- [All node input/output DataSets must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataSet` in your workflow +- [All node input/output DataSets must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataset` in your workflow ```{note} Each node will run in its own container. diff --git a/docs/source/deployment/aws_step_functions.md b/docs/source/deployment/aws_step_functions.md index 380f303067..7a08ba3416 100644 --- a/docs/source/deployment/aws_step_functions.md +++ b/docs/source/deployment/aws_step_functions.md @@ -40,7 +40,7 @@ $ cdk -h The deployment process for a Kedro pipeline on AWS Step Functions consists of the following steps: * Develop the Kedro pipeline locally as normal -* Create a new configuration environment in which we ensure all nodes' inputs and outputs have a persistent location on S3, since `MemoryDataSet` can't be shared between AWS Lambda functions +* Create a new configuration environment in which we ensure all nodes' inputs and outputs have a persistent location on S3, since `MemoryDataset` can't be shared between AWS Lambda functions * Package the Kedro pipeline as an [AWS Lambda-compliant Docker image](https://docs.aws.amazon.com/lambda/latest/dg/lambda-images.html) * Write a script to convert and deploy each Kedro node as an AWS Lambda function. Each function will use the same pipeline Docker image created in the previous step and run a single Kedro node associated with it. This follows the principles laid out in our [distributed deployment guide](distributed). * The script above will also convert and deploy the entire Kedro pipeline as an AWS Step Functions State Machine. diff --git a/docs/source/deployment/prefect.md b/docs/source/deployment/prefect.md index 556097faa6..0880691be6 100644 --- a/docs/source/deployment/prefect.md +++ b/docs/source/deployment/prefect.md @@ -39,7 +39,7 @@ from kedro.framework.hooks.manager import _create_hook_manager from kedro.framework.project import pipelines from kedro.framework.session import KedroSession from kedro.framework.startup import bootstrap_project -from kedro.io import DataCatalog, MemoryDataSet +from kedro.io import DataCatalog, MemoryDataset from kedro.pipeline.node import Node from kedro.runner import run_node from prefect import Client, Flow, Task @@ -133,7 +133,7 @@ class KedroInitTask(Task): catalog = context.catalog unregistered_ds = pipeline.data_sets() - set(catalog.list()) # NOQA for ds_name in unregistered_ds: - catalog.add(ds_name, MemoryDataSet()) + catalog.add(ds_name, MemoryDataset()) return {"catalog": catalog, "sess_id": session.session_id} diff --git a/docs/source/development/commands_reference.md b/docs/source/development/commands_reference.md index 1745aee8b9..b26357bdae 100644 --- a/docs/source/development/commands_reference.md +++ b/docs/source/development/commands_reference.md @@ -495,7 +495,7 @@ kedro catalog list --pipeline=ds,de ##### Create a Data Catalog YAML configuration file -The following command creates a Data Catalog YAML configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline, if it is missing from the `DataCatalog`. +The following command creates a Data Catalog YAML configuration file with `MemoryDataset` datasets for each dataset in a registered pipeline, if it is missing from the `DataCatalog`. ```bash kedro catalog create --pipeline= diff --git a/docs/source/hooks/examples.md b/docs/source/hooks/examples.md index f556879319..cdb9963157 100644 --- a/docs/source/hooks/examples.md +++ b/docs/source/hooks/examples.md @@ -78,17 +78,17 @@ The output should look similar to the following: ... [01/25/23 21:38:23] INFO Loading data from 'example_iris_data' (CSVDataSet)... data_catalog.py:343 INFO Loading example_iris_data consumed 0.99MiB memory hooks.py:67 - INFO Loading data from 'parameters' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'parameters' (MemoryDataset)... data_catalog.py:343 INFO Loading parameters consumed 0.48MiB memory hooks.py:67 INFO Running node: split: split_data([example_iris_data,parameters]) -> [X_train,X_test,y_train,y_test] node.py:327 - INFO Saving data to 'X_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'X_test' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'y_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'y_test' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'X_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'X_test' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'y_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'y_test' (MemoryDataset)... data_catalog.py:382 INFO Completed 1 out of 3 tasks sequential_runner.py:85 - INFO Loading data from 'X_train' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'X_train' (MemoryDataset)... data_catalog.py:343 INFO Loading X_train consumed 0.49MiB memory hooks.py:67 - INFO Loading data from 'X_test' (MemoryDataSet)... + INFO Loading data from 'X_test' (MemoryDataset)... ... ``` diff --git a/docs/source/integrations/pyspark_integration.md b/docs/source/integrations/pyspark_integration.md index 3afaf084c7..c0e5cec08b 100644 --- a/docs/source/integrations/pyspark_integration.md +++ b/docs/source/integrations/pyspark_integration.md @@ -169,7 +169,7 @@ pipeline( ) ``` -`first_operation_complete` is a `MemoryDataSet` and it signals that any Delta operations which occur "outside" the Kedro DAG are complete. This can be used as input to a downstream node, to preserve the shape of the DAG. Otherwise, if no downstream nodes need to run after this, the node can simply not return anything: +`first_operation_complete` is a `MemoryDataset` and it signals that any Delta operations which occur "outside" the Kedro DAG are complete. This can be used as input to a downstream node, to preserve the shape of the DAG. Otherwise, if no downstream nodes need to run after this, the node can simply not return anything: ```python pipeline( @@ -188,11 +188,11 @@ The following diagram is the visual representation of the workflow explained abo This pattern of creating "dummy" datasets to preserve the data flow also applies to other "out of DAG" execution operations such as SQL operations within a node. ``` -## Use `MemoryDataSet` for intermediary `DataFrame` +## Use `MemoryDataset` for intermediary `DataFrame` -For nodes operating on `DataFrame` that doesn't need to perform Spark actions such as writing the `DataFrame` to storage, we recommend using the default `MemoryDataSet` to hold the `DataFrame`. In other words, there is no need to specify it in the `DataCatalog` or `catalog.yml`. This allows you to take advantage of Spark's optimiser and lazy evaluation. +For nodes operating on `DataFrame` that doesn't need to perform Spark actions such as writing the `DataFrame` to storage, we recommend using the default `MemoryDataset` to hold the `DataFrame`. In other words, there is no need to specify it in the `DataCatalog` or `catalog.yml`. This allows you to take advantage of Spark's optimiser and lazy evaluation. -## Use `MemoryDataSet` with `copy_mode="assign"` for non-`DataFrame` Spark objects +## Use `MemoryDataset` with `copy_mode="assign"` for non-`DataFrame` Spark objects Sometimes, you might want to use Spark objects that aren't `DataFrame` as inputs and outputs in your pipeline. For example, suppose you have a `train_model` node to train a classifier using Spark ML's [`RandomForrestClassifier`](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier) and a `predict` node to make predictions using this classifier. In this scenario, the `train_model` node will output a `RandomForestClassifier` object, which then becomes the input for the `predict` node. Below is the code for this pipeline: @@ -233,11 +233,11 @@ To make the pipeline work, you will need to specify `example_classifier` as foll ```yaml example_classifier: - type: MemoryDataSet + type: MemoryDataset copy_mode: assign ``` -The `assign` copy mode ensures that the `MemoryDataSet` will be assigned the Spark object itself, not a [deep copy](https://docs.python.org/3/library/copy.html) version of it, since deep copy doesn't work with Spark object generally. +The `assign` copy mode ensures that the `MemoryDataset` will be assigned the Spark object itself, not a [deep copy](https://docs.python.org/3/library/copy.html) version of it, since deep copy doesn't work with Spark object generally. ## Tips for maximising concurrency using `ThreadRunner` diff --git a/docs/source/nodes_and_pipelines/nodes.md b/docs/source/nodes_and_pipelines/nodes.md index 1d11988a3b..9c81c344c8 100644 --- a/docs/source/nodes_and_pipelines/nodes.md +++ b/docs/source/nodes_and_pipelines/nodes.md @@ -325,17 +325,17 @@ We can now `kedro run` in the terminal. The output shows `X_train`, `X_test`, `y ``` ... [02/10/23 12:42:55] INFO Loading data from 'example_iris_data' (ChunkWiseCSVDataSet)... data_catalog.py:343 - INFO Loading data from 'parameters' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'parameters' (MemoryDataset)... data_catalog.py:343 INFO Running node: split: split_data([example_iris_data,parameters]) -> node.py:329 [X_train,X_test,y_train,y_test] - INFO Saving data to 'X_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'X_test' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'y_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'y_test' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'X_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'X_test' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'y_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'y_test' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'X_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'X_test' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'y_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'y_test' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'X_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'X_test' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'y_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'y_test' (MemoryDataset)... data_catalog.py:382 INFO Completed 1 out of 3 tasks sequential_runner.py:85 ... ``` diff --git a/docs/source/nodes_and_pipelines/run_a_pipeline.md b/docs/source/nodes_and_pipelines/run_a_pipeline.md index 76ea1afc77..66e2658986 100644 --- a/docs/source/nodes_and_pipelines/run_a_pipeline.md +++ b/docs/source/nodes_and_pipelines/run_a_pipeline.md @@ -57,13 +57,13 @@ If the built-in Kedro runners do not meet your requirements, you can also define ```python # in src//runner.py -from kedro.io import AbstractDataSet, DataCatalog, MemoryDataSet +from kedro.io import AbstractDataSet, DataCatalog, MemoryDataset from kedro.pipeline import Pipeline from kedro.runner.runner import AbstractRunner from pluggy import PluginManager -from kedro.io import AbstractDataSet, DataCatalog, MemoryDataSet +from kedro.io import AbstractDataSet, DataCatalog, MemoryDataset from kedro.pipeline import Pipeline from kedro.runner.runner import AbstractRunner @@ -84,7 +84,7 @@ class DryRunner(AbstractRunner): for all unregistered data sets. """ - return MemoryDataSet() + return MemoryDataset() def _run( self, @@ -204,14 +204,14 @@ By using `DataCatalog` from the IO module we are still able to write pure functi Through `DataCatalog`, we can control where inputs are loaded from, where intermediate variables get persisted and ultimately the location to which output variables are written. -In a simple example, we define a `MemoryDataSet` called `xs` to store our inputs, save our input list `[1, 2, 3]` into `xs`, then instantiate `SequentialRunner` and call its `run` method with the pipeline and data catalog instances: +In a simple example, we define a `MemoryDataset` called `xs` to store our inputs, save our input list `[1, 2, 3]` into `xs`, then instantiate `SequentialRunner` and call its `run` method with the pipeline and data catalog instances:
Click to expand ```python -io = DataCatalog(dict(xs=MemoryDataSet())) +io = DataCatalog(dict(xs=MemoryDataset())) ``` ```python diff --git a/docs/source/nodes_and_pipelines/slice_a_pipeline.md b/docs/source/nodes_and_pipelines/slice_a_pipeline.md index f4f4bccf0d..2ed8ee4b3a 100644 --- a/docs/source/nodes_and_pipelines/slice_a_pipeline.md +++ b/docs/source/nodes_and_pipelines/slice_a_pipeline.md @@ -303,10 +303,10 @@ To demonstrate this, let us save the intermediate output `n` using a `JSONDataSe ```python from kedro_datasets.pandas import JSONDataSet -from kedro.io import DataCatalog, MemoryDataSet +from kedro.io import DataCatalog, MemoryDataset n_json = JSONDataSet(filepath="./data/07_model_output/len.json") -io = DataCatalog(dict(xs=MemoryDataSet([1, 2, 3]), n=n_json)) +io = DataCatalog(dict(xs=MemoryDataset([1, 2, 3]), n=n_json)) ```
diff --git a/docs/source/notebooks_and_ipython/kedro_and_notebooks.md b/docs/source/notebooks_and_ipython/kedro_and_notebooks.md index d32139b2f8..ddf42e0f0e 100644 --- a/docs/source/notebooks_and_ipython/kedro_and_notebooks.md +++ b/docs/source/notebooks_and_ipython/kedro_and_notebooks.md @@ -93,7 +93,7 @@ catalog.load("parameters") You should see the following: ```ipython -INFO Loading data from 'parameters' (MemoryDataSet)... +INFO Loading data from 'parameters' (MemoryDataset)... {'example_test_data_ratio': 0.2, 'example_num_train_iter': 10000, diff --git a/docs/source/tutorial/add_another_pipeline.md b/docs/source/tutorial/add_another_pipeline.md index 3e4c0089e2..9a72c3b57d 100644 --- a/docs/source/tutorial/add_another_pipeline.md +++ b/docs/source/tutorial/add_another_pipeline.md @@ -187,40 +187,40 @@ You should see output similar to the following: INFO Loading data from 'companies' (CSVDataSet)... data_catalog.py:343 INFO Running node: preprocess_companies_node: node.py:327 preprocess_companies([companies]) -> [preprocessed_companies] - INFO Saving data to 'preprocessed_companies' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'preprocessed_companies' (MemoryDataset)... data_catalog.py:382 INFO Completed 1 out of 6 tasks sequential_runner.py:85 INFO Loading data from 'shuttles' (ExcelDataSet)... data_catalog.py:343 [08/09/22 16:56:15] INFO Running node: preprocess_shuttles_node: preprocess_shuttles([shuttles]) node.py:327 -> [preprocessed_shuttles] - INFO Saving data to 'preprocessed_shuttles' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'preprocessed_shuttles' (MemoryDataset)... data_catalog.py:382 INFO Completed 2 out of 6 tasks sequential_runner.py:85 - INFO Loading data from 'preprocessed_shuttles' (MemoryDataSet)... data_catalog.py:343 - INFO Loading data from 'preprocessed_companies' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'preprocessed_shuttles' (MemoryDataset)... data_catalog.py:343 + INFO Loading data from 'preprocessed_companies' (MemoryDataset)... data_catalog.py:343 INFO Loading data from 'reviews' (CSVDataSet)... data_catalog.py:343 INFO Running node: create_model_input_table_node: node.py:327 create_model_input_table([preprocessed_shuttles,preprocessed_companies, reviews]) -> [model_input_table] -[08/09/22 16:56:18] INFO Saving data to 'model_input_table' (MemoryDataSet)... data_catalog.py:382 +[08/09/22 16:56:18] INFO Saving data to 'model_input_table' (MemoryDataset)... data_catalog.py:382 [08/09/22 16:56:19] INFO Completed 3 out of 6 tasks sequential_runner.py:85 - INFO Loading data from 'model_input_table' (MemoryDataSet)... data_catalog.py:343 - INFO Loading data from 'params:model_options' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'model_input_table' (MemoryDataset)... data_catalog.py:343 + INFO Loading data from 'params:model_options' (MemoryDataset)... data_catalog.py:343 INFO Running node: split_data_node: node.py:327 split_data([model_input_table,params:model_options]) -> [X_train,X_test,y_train,y_test] - INFO Saving data to 'X_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'X_test' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'y_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'y_test' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'X_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'X_test' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'y_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'y_test' (MemoryDataset)... data_catalog.py:382 INFO Completed 4 out of 6 tasks sequential_runner.py:85 - INFO Loading data from 'X_train' (MemoryDataSet)... data_catalog.py:343 - INFO Loading data from 'y_train' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'X_train' (MemoryDataset)... data_catalog.py:343 + INFO Loading data from 'y_train' (MemoryDataset)... data_catalog.py:343 INFO Running node: train_model_node: train_model([X_train,y_train]) -> node.py:327 [regressor] [08/09/22 16:56:20] INFO Saving data to 'regressor' (PickleDataSet)... data_catalog.py:382 INFO Completed 5 out of 6 tasks sequential_runner.py:85 INFO Loading data from 'regressor' (PickleDataSet)... data_catalog.py:343 - INFO Loading data from 'X_test' (MemoryDataSet)... data_catalog.py:343 - INFO Loading data from 'y_test' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'X_test' (MemoryDataset)... data_catalog.py:343 + INFO Loading data from 'y_test' (MemoryDataset)... data_catalog.py:343 INFO Running node: evaluate_model_node: node.py:327 evaluate_model([regressor,X_test,y_test]) -> None INFO Model has a coefficient R^2 of 0.462 on test data. nodes.py:55 @@ -384,52 +384,52 @@ def create_pipeline(**kwargs) -> Pipeline: ^[[B[11/02/22 10:41:14] INFO Saving data to 'model_input_table' (ParquetDataSet)... data_catalog.py:382 [11/02/22 10:41:15] INFO Completed 3 out of 9 tasks sequential_runner.py:85 INFO Loading data from 'model_input_table' (ParquetDataSet)... data_catalog.py:343 - INFO Loading data from 'params:active_modelling_pipeline.model_options' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'params:active_modelling_pipeline.model_options' (MemoryDataset)... data_catalog.py:343 INFO Running node: split_data_node: node.py:327 split_data([model_input_table,params:active_modelling_pipeline.model_options]) -> [active_modelling_pipeline.X_train,active_modelling_pipeline.X_test,active_modelling_pipeline.y_t rain,active_modelling_pipeline.y_test] - INFO Saving data to 'active_modelling_pipeline.X_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'active_modelling_pipeline.X_test' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'active_modelling_pipeline.y_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'active_modelling_pipeline.y_test' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'active_modelling_pipeline.X_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'active_modelling_pipeline.X_test' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'active_modelling_pipeline.y_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'active_modelling_pipeline.y_test' (MemoryDataset)... data_catalog.py:382 INFO Completed 4 out of 9 tasks sequential_runner.py:85 INFO Loading data from 'model_input_table' (ParquetDataSet)... data_catalog.py:343 - INFO Loading data from 'params:candidate_modelling_pipeline.model_options' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'params:candidate_modelling_pipeline.model_options' (MemoryDataset)... data_catalog.py:343 INFO Running node: split_data_node: node.py:327 split_data([model_input_table,params:candidate_modelling_pipeline.model_options]) -> [candidate_modelling_pipeline.X_train,candidate_modelling_pipeline.X_test,candidate_modelling_pip eline.y_train,candidate_modelling_pipeline.y_test] - INFO Saving data to 'candidate_modelling_pipeline.X_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'candidate_modelling_pipeline.X_test' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'candidate_modelling_pipeline.y_train' (MemoryDataSet)... data_catalog.py:382 - INFO Saving data to 'candidate_modelling_pipeline.y_test' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'candidate_modelling_pipeline.X_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'candidate_modelling_pipeline.X_test' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'candidate_modelling_pipeline.y_train' (MemoryDataset)... data_catalog.py:382 + INFO Saving data to 'candidate_modelling_pipeline.y_test' (MemoryDataset)... data_catalog.py:382 INFO Completed 5 out of 9 tasks sequential_runner.py:85 - INFO Loading data from 'active_modelling_pipeline.X_train' (MemoryDataSet)... data_catalog.py:343 - INFO Loading data from 'active_modelling_pipeline.y_train' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'active_modelling_pipeline.X_train' (MemoryDataset)... data_catalog.py:343 + INFO Loading data from 'active_modelling_pipeline.y_train' (MemoryDataset)... data_catalog.py:343 INFO Running node: train_model_node: node.py:327 train_model([active_modelling_pipeline.X_train,active_modelling_pipeline.y_train]) -> [active_modelling_pipeline.regressor] INFO Saving data to 'active_modelling_pipeline.regressor' (PickleDataSet)... data_catalog.py:382 INFO Completed 6 out of 9 tasks sequential_runner.py:85 - INFO Loading data from 'candidate_modelling_pipeline.X_train' (MemoryDataSet)... data_catalog.py:343 - INFO Loading data from 'candidate_modelling_pipeline.y_train' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'candidate_modelling_pipeline.X_train' (MemoryDataset)... data_catalog.py:343 + INFO Loading data from 'candidate_modelling_pipeline.y_train' (MemoryDataset)... data_catalog.py:343 INFO Running node: train_model_node: node.py:327 train_model([candidate_modelling_pipeline.X_train,candidate_modelling_pipeline.y_train]) -> [candidate_modelling_pipeline.regressor] INFO Saving data to 'candidate_modelling_pipeline.regressor' (PickleDataSet)... data_catalog.py:382 INFO Completed 7 out of 9 tasks sequential_runner.py:85 INFO Loading data from 'active_modelling_pipeline.regressor' (PickleDataSet)... data_catalog.py:343 - INFO Loading data from 'active_modelling_pipeline.X_test' (MemoryDataSet)... data_catalog.py:343 - INFO Loading data from 'active_modelling_pipeline.y_test' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'active_modelling_pipeline.X_test' (MemoryDataset)... data_catalog.py:343 + INFO Loading data from 'active_modelling_pipeline.y_test' (MemoryDataset)... data_catalog.py:343 INFO Running node: evaluate_model_node: node.py:327 evaluate_model([active_modelling_pipeline.regressor,active_modelling_pipeline.X_test,active_model ling_pipeline.y_test]) -> None INFO Model has a coefficient R^2 of 0.462 on test data. nodes.py:60 INFO Completed 8 out of 9 tasks sequential_runner.py:85 INFO Loading data from 'candidate_modelling_pipeline.regressor' (PickleDataSet)... data_catalog.py:343 - INFO Loading data from 'candidate_modelling_pipeline.X_test' (MemoryDataSet)... data_catalog.py:343 - INFO Loading data from 'candidate_modelling_pipeline.y_test' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'candidate_modelling_pipeline.X_test' (MemoryDataset)... data_catalog.py:343 + INFO Loading data from 'candidate_modelling_pipeline.y_test' (MemoryDataset)... data_catalog.py:343 INFO Running node: evaluate_model_node: node.py:327 evaluate_model([candidate_modelling_pipeline.regressor,candidate_modelling_pipeline.X_test,candid ate_modelling_pipeline.y_test]) -> None diff --git a/docs/source/tutorial/create_a_pipeline.md b/docs/source/tutorial/create_a_pipeline.md index d0173a1cc9..668835b073 100644 --- a/docs/source/tutorial/create_a_pipeline.md +++ b/docs/source/tutorial/create_a_pipeline.md @@ -138,10 +138,10 @@ You should see output similar to the below: [08/09/22 16:43:11] INFO Loading data from 'companies' (CSVDataSet)... data_catalog.py:343 INFO Running node: preprocess_companies_node: node.py:327 preprocess_companies([companies]) -> [preprocessed_companies] - INFO Saving data to 'preprocessed_companies' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'preprocessed_companies' (MemoryDataset)... data_catalog.py:382 INFO Completed 1 out of 1 tasks sequential_runner.py:85 INFO Pipeline execution completed successfully. runner.py:89 - INFO Loading data from 'preprocessed_companies' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'preprocessed_companies' (MemoryDataset)... data_catalog.py:343 ```
@@ -161,16 +161,16 @@ You should see output similar to the following: INFO Loading data from 'companies' (CSVDataSet)... data_catalog.py:343 INFO Running node: preprocess_companies_node: node.py:327 preprocess_companies([companies]) -> [preprocessed_companies] - INFO Saving data to 'preprocessed_companies' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'preprocessed_companies' (MemoryDataset)... data_catalog.py:382 INFO Completed 1 out of 2 tasks sequential_runner.py:85 INFO Loading data from 'shuttles' (ExcelDataSet)... data_catalog.py:343 [08/09/22 16:46:08] INFO Running node: preprocess_shuttles_node: preprocess_shuttles([shuttles]) node.py:327 -> [preprocessed_shuttles] - INFO Saving data to 'preprocessed_shuttles' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'preprocessed_shuttles' (MemoryDataset)... data_catalog.py:382 INFO Completed 2 out of 2 tasks sequential_runner.py:85 INFO Pipeline execution completed successfully. runner.py:89 - INFO Loading data from 'preprocessed_companies' (MemoryDataSet)... data_catalog.py:343 - INFO Loading data from 'preprocessed_shuttles' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'preprocessed_companies' (MemoryDataset)... data_catalog.py:343 + INFO Loading data from 'preprocessed_shuttles' (MemoryDataset)... data_catalog.py:343 ``` @@ -193,7 +193,7 @@ preprocessed_shuttles: ``` -If you remove these lines from `catalog.yml`, Kedro still runs the pipeline successfully and automatically stores the preprocessed data, in memory, as temporary Python objects of the [MemoryDataSet](/kedro.io.MemoryDataSet) class. Once all nodes that depend on a temporary dataset have executed, Kedro clears the dataset and the Python garbage collector releases the memory. +If you remove these lines from `catalog.yml`, Kedro still runs the pipeline successfully and automatically stores the preprocessed data, in memory, as temporary Python objects of the [MemoryDataset](/kedro.io.MemoryDataset) class. Once all nodes that depend on a temporary dataset have executed, Kedro clears the dataset and the Python garbage collector releases the memory. ## Create a table for model input @@ -295,24 +295,24 @@ You should see output similar to the following: INFO Loading data from 'companies' (CSVDataSet)... data_catalog.py:343 INFO Running node: preprocess_companies_node: node.py:327 preprocess_companies([companies]) -> [preprocessed_companies] - INFO Saving data to 'preprocessed_companies' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'preprocessed_companies' (MemoryDataset)... data_catalog.py:382 INFO Completed 1 out of 3 tasks sequential_runner.py:85 INFO Loading data from 'shuttles' (ExcelDataSet)... data_catalog.py:343 [08/09/22 17:01:25] INFO Running node: preprocess_shuttles_node: preprocess_shuttles([shuttles]) node.py:327 -> [preprocessed_shuttles] - INFO Saving data to 'preprocessed_shuttles' (MemoryDataSet)... data_catalog.py:382 + INFO Saving data to 'preprocessed_shuttles' (MemoryDataset)... data_catalog.py:382 INFO Completed 2 out of 3 tasks sequential_runner.py:85 - INFO Loading data from 'preprocessed_shuttles' (MemoryDataSet)... data_catalog.py:343 - INFO Loading data from 'preprocessed_companies' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'preprocessed_shuttles' (MemoryDataset)... data_catalog.py:343 + INFO Loading data from 'preprocessed_companies' (MemoryDataset)... data_catalog.py:343 INFO Loading data from 'reviews' (CSVDataSet)... data_catalog.py:343 INFO Running node: create_model_input_table_node: node.py:327 create_model_input_table([preprocessed_shuttles,preprocessed_companies, reviews]) -> [model_input_table] -[08/09/22 17:01:28] INFO Saving data to 'model_input_table' (MemoryDataSet)... data_catalog.py:382 +[08/09/22 17:01:28] INFO Saving data to 'model_input_table' (MemoryDataset)... data_catalog.py:382 [08/09/22 17:01:29] INFO Completed 3 out of 3 tasks sequential_runner.py:85 INFO Pipeline execution completed successfully. runner.py:89 - INFO Loading data from 'model_input_table' (MemoryDataSet)... data_catalog.py:343 + INFO Loading data from 'model_input_table' (MemoryDataset)... data_catalog.py:343 ``` From c88115432ad9c79175c3f3d5221b898c1569ea4f Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Tue, 27 Jun 2023 10:55:19 -0500 Subject: [PATCH 03/14] PartitionedDataSet->PartitionedDataset in .md files Signed-off-by: Deepyaman Datta --- RELEASE.md | 22 ++++---- docs/source/data/kedro_io.md | 60 ++++++++++----------- docs/source/extend_kedro/custom_datasets.md | 10 ++-- 3 files changed, 46 insertions(+), 46 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index 773b163977..3e80ab79a6 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -3,7 +3,7 @@ ## Major features and improvements ## Bug fixes and other changes -* Compare for protocol and delimiter in `PartitionedDataSet` to be able to pass the protocol to partitions which paths starts with the same characters as the protocol (e.g. `s3://s3-my-bucket`). +* Compare for protocol and delimiter in `PartitionedDataset` to be able to pass the protocol to partitions which paths starts with the same characters as the protocol (e.g. `s3://s3-my-bucket`). ## Breaking changes to the API @@ -540,7 +540,7 @@ The parameters should look like this: * Upgraded `pip-tools`, which is used by `kedro build-reqs`, to 6.4. This `pip-tools` version requires `pip>=21.2` while [adding support for `pip>=21.3`](https://github.com/jazzband/pip-tools/pull/1501). To upgrade `pip`, please refer to [their documentation](https://pip.pypa.io/en/stable/installing/#upgrading-pip). * Relaxed the bounds on the `plotly` requirement for `plotly.PlotlyDataSet` and the `pyarrow` requirement for `pandas.ParquetDataSet`. * `kedro pipeline package ` now raises an error if the `` argument doesn't look like a valid Python module path (e.g. has `/` instead of `.`). -* Added new `overwrite` argument to `PartitionedDataSet` and `MatplotlibWriter` to enable deletion of existing partitions and plots on dataset `save`. +* Added new `overwrite` argument to `PartitionedDataset` and `MatplotlibWriter` to enable deletion of existing partitions and plots on dataset `save`. * `kedro pipeline pull` now works when the project requirements contains entries such as `-r`, `--extra-index-url` and local wheel files ([Issue #913](https://github.com/kedro-org/kedro/issues/913)). * Fixed slow startup because of catalog processing by reducing the exponential growth of extra processing during `_FrozenDatasets` creations. * Removed `.coveragerc` from the Kedro project template. `coverage` settings are now given in `pyproject.toml`. @@ -620,7 +620,7 @@ The parameters should look like this: * Fixed a bug where `kedro ipython` and `kedro jupyter notebook` didn't work if the `PYTHONPATH` was already set. * Update the IPython extension to allow passing `env` and `extra_params` to `reload_kedro` similar to how the IPython script works. * `kedro info` now outputs if a plugin has any `hooks` or `cli_hooks` implemented. -* `PartitionedDataSet` now supports lazily materializing data on save. +* `PartitionedDataset` now supports lazily materializing data on save. * `kedro pipeline describe` now defaults to the `__default__` pipeline when no pipeline name is provided and also shows the namespace the nodes belong to. * Fixed an issue where spark.SparkDataSet with enabled versioning would throw a VersionNotFoundError when using databricks-connect from a remote machine and saving to dbfs filesystem. * `EmailMessageDataSet` added to doctree. @@ -805,7 +805,7 @@ from kedro.framework.session import KedroSession * The pipeline-specific `catalog.yml` file is no longer automatically created for modular pipelines when running `kedro pipeline create`. Use `kedro catalog create` to replace this functionality. * Removed `include_examples` prompt from `kedro new`. To generate boilerplate example code, you should use a Kedro starter. * Changed the `--verbose` flag from a global command to a project-specific command flag (e.g `kedro --verbose new` becomes `kedro new --verbose`). -* Dropped support of the `dataset_credentials` key in credentials in `PartitionedDataSet`. +* Dropped support of the `dataset_credentials` key in credentials in `PartitionedDataset`. * `get_source_dir()` was removed from `kedro/framework/cli/utils.py`. * Dropped support of `get_config`, `create_catalog`, `create_pipeline`, `template_version`, `project_name` and `project_path` keys by `get_project_context()` function (`kedro/framework/cli/cli.py`). * `kedro new --starter` now defaults to fetching the starter template matching the installed Kedro version. @@ -908,7 +908,7 @@ Check your source directory. If you defined a different source directory (`sourc ## Bug fixes and other changes * Fixed `TypeError` when converting dict inputs to a node made from a wrapped `partial` function. -* `PartitionedDataSet` improvements: +* `PartitionedDataset` improvements: - Supported passing arguments to the underlying filesystem. * Improved handling of non-ASCII word characters in dataset names. - For example, a dataset named `jalapeño` will be accessible as `DataCatalog.datasets.jalapeño` rather than `DataCatalog.datasets.jalape__o`. @@ -1122,9 +1122,9 @@ Even though this release ships a fix for project generated with `kedro==0.16.2`, * Updated contribution process in `CONTRIBUTING.md` - added Developer Workflow. * Documented installation of development version of Kedro in the [FAQ section](https://docs.kedro.org/en/0.16.0/06_resources/01_faq.html#how-can-i-use-development-version-of-kedro). * Added missing `_exists` method to `MyOwnDataSet` example in 04_user_guide/08_advanced_io. -* Fixed a bug where `PartitionedDataSet` and `IncrementalDataSet` were not working with `s3a` or `s3n` protocol. +* Fixed a bug where `PartitionedDataset` and `IncrementalDataSet` were not working with `s3a` or `s3n` protocol. * Added ability to read partitioned parquet file from a directory in `pandas.ParquetDataSet`. -* Replaced `functools.lru_cache` with `cachetools.cachedmethod` in `PartitionedDataSet` and `IncrementalDataSet` for per-instance cache invalidation. +* Replaced `functools.lru_cache` with `cachetools.cachedmethod` in `PartitionedDataset` and `IncrementalDataSet` for per-instance cache invalidation. * Implemented custom glob function for `SparkDataSet` when running on Databricks. * Fixed a bug in `SparkDataSet` not allowing for loading data from DBFS in a Windows machine using Databricks-connect. * Improved the error message for `DataSetNotFoundError` to suggest possible dataset names user meant to type. @@ -1141,7 +1141,7 @@ Even though this release ships a fix for project generated with `kedro==0.16.2`, * `get_last_load_version` and `get_last_save_version` have been renamed to `resolve_load_version` and `resolve_save_version` on ``AbstractVersionedDataSet``, the results of which are cached. * The `release()` method on datasets extending ``AbstractVersionedDataSet`` clears the cached load and save version. All custom datasets must call `super()._release()` inside `_release()`. * ``TextDataSet`` no longer has `load_args` and `save_args`. These can instead be specified under `open_args_load` or `open_args_save` in `fs_args`. -* `PartitionedDataSet` and `IncrementalDataSet` method `invalidate_cache` was made private: `_invalidate_caches`. +* `PartitionedDataset` and `IncrementalDataSet` method `invalidate_cache` was made private: `_invalidate_caches`. ### Other * Removed `KEDRO_ENV_VAR` from `kedro.context` to speed up the CLI run time. @@ -1302,7 +1302,7 @@ You can also load data incrementally whenever it is dumped into a directory with | `biosequence.BioSequenceDataSet` | Work with bio-sequence objects using [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to communicate with the underlying filesystem | `kedro.extras.datasets.biosequence` | | `pandas.GBQTableDataSet` | Work with Google BigQuery | `kedro.extras.datasets.pandas` | | `pandas.FeatherDataSet` | Work with feather files using [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to communicate with the underlying filesystem | `kedro.extras.datasets.pandas` | -| `IncrementalDataSet` | Inherit from `PartitionedDataSet` and remembers the last processed partition | `kedro.io` | +| `IncrementalDataSet` | Inherit from `PartitionedDataset` and remembers the last processed partition | `kedro.io` | ### Files with a new location @@ -1373,7 +1373,7 @@ You can also load data incrementally whenever it is dumped into a directory with * Bumped minimum required pandas version to 0.24.0 to make use of `pandas.DataFrame.to_numpy` (recommended alternative to `pandas.DataFrame.values`). * Docs improvements. * `Pipeline.transform` skips modifying node inputs/outputs containing `params:` or `parameters` keywords. -* Support for `dataset_credentials` key in the credentials for `PartitionedDataSet` is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. +* Support for `dataset_credentials` key in the credentials for `PartitionedDataset` is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. * Datasets can have a new `confirm` function which is called after a successful node function execution if the node contains `confirms` argument with such dataset name. * Make the resume prompt on pipeline run failure use `--from-nodes` instead of `--from-inputs` to avoid unnecessarily re-running nodes that had already executed. * When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use `--idle-timeout` option to update it. @@ -1402,7 +1402,7 @@ You can also load data incrementally whenever it is dumped into a directory with - `ParquetGCSDataSet` dataset in `contrib` for working with Parquet files in Google Cloud Storage. - `JSONGCSDataSet` dataset in `contrib` for working with JSON files in Google Cloud Storage. - `MatplotlibS3Writer` dataset in `contrib` for saving Matplotlib images to S3. - - `PartitionedDataSet` for working with datasets split across multiple files. + - `PartitionedDataset` for working with datasets split across multiple files. - `JSONDataSet` dataset for working with JSON files that uses [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to communicate with the underlying filesystem. It doesn't support `http(s)` protocol for now. * Added `s3fs_args` to all S3 datasets. * Pipelines can be deducted with `pipeline1 - pipeline2`. diff --git a/docs/source/data/kedro_io.md b/docs/source/data/kedro_io.md index 6fdfefdd66..720120fbf2 100644 --- a/docs/source/data/kedro_io.md +++ b/docs/source/data/kedro_io.md @@ -241,9 +241,9 @@ Although HTTP(S) is a supported file system in the dataset implementations, it d These days, distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible. -This is why Kedro provides a built-in [PartitionedDataSet](/kedro.io.PartitionedDataSet), with the following features: +This is why Kedro provides a built-in [PartitionedDataset](/kedro.io.PartitionedDataset), with the following features: -* `PartitionedDataSet` can recursively load/save all or specific files from a given location. +* `PartitionedDataset` can recursively load/save all or specific files from a given location. * It is platform agnostic, and can work with any filesystem implementation supported by [fsspec](https://filesystem-spec.readthedocs.io/) including local, S3, GCS, and many more. * It implements a [lazy loading](https://en.wikipedia.org/wiki/Lazy_loading) approach, and does not attempt to load any partition data until a processing node explicitly requests it. * It supports lazy saving by using `Callable`s. @@ -254,13 +254,13 @@ In this section, each individual file inside a given location is called a partit ### Partitioned dataset definition -`PartitionedDataSet` definition can be put in your `catalog.yml` file like any other regular dataset definition. The definition represents the following structure: +`PartitionedDataset` definition can be put in your `catalog.yml` file like any other regular dataset definition. The definition represents the following structure: ```yaml # conf/base/catalog.yml my_partitioned_dataset: - type: PartitionedDataSet + type: PartitionedDataset path: s3://my-bucket-name/path/to/folder # path to the location of partitions dataset: pandas.CSVDataSet # shorthand notation for the dataset which will handle individual partitions credentials: my_credentials @@ -270,16 +270,16 @@ my_partitioned_dataset: ``` ```{note} -Like any other dataset, `PartitionedDataSet` can also be instantiated programmatically in Python: +Like any other dataset, `PartitionedDataset` can also be instantiated programmatically in Python: ``` ```python from kedro_datasets.pandas import CSVDataSet -from kedro.io import PartitionedDataSet +from kedro.io import PartitionedDataset my_credentials = {...} # credentials dictionary -my_partitioned_dataset = PartitionedDataSet( +my_partitioned_dataset = PartitionedDataset( path="s3://my-bucket-name/path/to/folder", dataset=CSVDataSet, credentials=my_credentials, @@ -293,7 +293,7 @@ Alternatively, if you need more granular configuration of the underlying dataset # conf/base/catalog.yml my_partitioned_dataset: - type: PartitionedDataSet + type: PartitionedDataset path: s3://my-bucket-name/path/to/folder dataset: # full dataset config notation type: pandas.CSVDataSet @@ -309,7 +309,7 @@ my_partitioned_dataset: filename_suffix: ".csv" ``` -Here is an exhaustive list of the arguments supported by `PartitionedDataSet`: +Here is an exhaustive list of the arguments supported by `PartitionedDataset`: | Argument | Required | Supported types | Description | | ----------------- | ------------------------------ | ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -322,7 +322,7 @@ Here is an exhaustive list of the arguments supported by `PartitionedDataSet`: #### Dataset definition -Dataset definition should be passed into the `dataset` argument of the `PartitionedDataSet`. The dataset definition is used to instantiate a new dataset object for each individual partition, and use that dataset object for load and save operations. Dataset definition supports shorthand and full notations. +Dataset definition should be passed into the `dataset` argument of the `PartitionedDataset`. The dataset definition is used to instantiate a new dataset object for each individual partition, and use that dataset object for load and save operations. Dataset definition supports shorthand and full notations. ##### Shorthand notation @@ -332,26 +332,26 @@ Requires you only to specify a class of the underlying dataset either as a strin Full notation allows you to specify a dictionary with the full underlying dataset definition _except_ the following arguments: * The argument that receives the partition path (`filepath` by default) - if specified, a `UserWarning` will be emitted stating that this value will be overridden by individual partition paths -* `credentials` key - specifying it will result in a `DataSetError` being raised; dataset credentials should be passed into the `credentials` argument of the `PartitionedDataSet` rather than the underlying dataset definition - see the section below on [partitioned dataset credentials](#partitioned-dataset-credentials) for details +* `credentials` key - specifying it will result in a `DataSetError` being raised; dataset credentials should be passed into the `credentials` argument of the `PartitionedDataset` rather than the underlying dataset definition - see the section below on [partitioned dataset credentials](#partitioned-dataset-credentials) for details * `versioned` flag - specifying it will result in a `DataSetError` being raised; versioning cannot be enabled for the underlying datasets #### Partitioned dataset credentials ```{note} -Support for `dataset_credentials` key in the credentials for `PartitionedDataSet` is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. +Support for `dataset_credentials` key in the credentials for `PartitionedDataset` is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. ``` -Credentials management for `PartitionedDataSet` is somewhat special, because it might contain credentials for both `PartitionedDataSet` itself _and_ the underlying dataset that is used for partition load and save. Top-level credentials are passed to the underlying dataset config (unless such config already has credentials configured), but not the other way around - dataset credentials are never propagated to the filesystem. +Credentials management for `PartitionedDataset` is somewhat special, because it might contain credentials for both `PartitionedDataset` itself _and_ the underlying dataset that is used for partition load and save. Top-level credentials are passed to the underlying dataset config (unless such config already has credentials configured), but not the other way around - dataset credentials are never propagated to the filesystem. Here is the full list of possible scenarios: -| Top-level credentials | Underlying dataset credentials | Example `PartitionedDataSet` definition | Description | +| Top-level credentials | Underlying dataset credentials | Example `PartitionedDataset` definition | Description | | --------------------- | ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Undefined | Undefined | `PartitionedDataSet(path="s3://bucket-name/path/to/folder", dataset="pandas.CSVDataSet")` | Credentials are not passed to the underlying dataset or the filesystem | -| Undefined | Specified | `PartitionedDataSet(path="s3://bucket-name/path/to/folder", dataset={"type": "pandas.CSVDataSet", "credentials": {"secret": True}})` | Underlying dataset credentials are passed to the `CSVDataSet` constructor, filesystem is instantiated without credentials | -| Specified | Undefined | `PartitionedDataSet(path="s3://bucket-name/path/to/folder", dataset="pandas.CSVDataSet", credentials={"secret": True})` | Top-level credentials are passed to the underlying `CSVDataSet` constructor and the filesystem | -| Specified | `None` | `PartitionedDataSet(path="s3://bucket-name/path/to/folder", dataset={"type": "pandas.CSVDataSet", "credentials": None}, credentials={"dataset_secret": True})` | Top-level credentials are passed to the filesystem, `CSVDataSet` is instantiated without credentials - this way you can stop the top-level credentials from propagating into the dataset config | -| Specified | Specified | `PartitionedDataSet(path="s3://bucket-name/path/to/folder", dataset={"type": "pandas.CSVDataSet", "credentials": {"dataset_secret": True}}, credentials={"secret": True})` | Top-level credentials are passed to the filesystem, underlying dataset credentials are passed to the `CSVDataSet` constructor | +| Undefined | Undefined | `PartitionedDataset(path="s3://bucket-name/path/to/folder", dataset="pandas.CSVDataSet")` | Credentials are not passed to the underlying dataset or the filesystem | +| Undefined | Specified | `PartitionedDataset(path="s3://bucket-name/path/to/folder", dataset={"type": "pandas.CSVDataSet", "credentials": {"secret": True}})` | Underlying dataset credentials are passed to the `CSVDataSet` constructor, filesystem is instantiated without credentials | +| Specified | Undefined | `PartitionedDataset(path="s3://bucket-name/path/to/folder", dataset="pandas.CSVDataSet", credentials={"secret": True})` | Top-level credentials are passed to the underlying `CSVDataSet` constructor and the filesystem | +| Specified | `None` | `PartitionedDataset(path="s3://bucket-name/path/to/folder", dataset={"type": "pandas.CSVDataSet", "credentials": None}, credentials={"dataset_secret": True})` | Top-level credentials are passed to the filesystem, `CSVDataSet` is instantiated without credentials - this way you can stop the top-level credentials from propagating into the dataset config | +| Specified | Specified | `PartitionedDataset(path="s3://bucket-name/path/to/folder", dataset={"type": "pandas.CSVDataSet", "credentials": {"dataset_secret": True}}, credentials={"secret": True})` | Top-level credentials are passed to the filesystem, underlying dataset credentials are passed to the `CSVDataSet` constructor | ### Partitioned dataset load @@ -389,7 +389,7 @@ def concat_partitions(partitioned_input: Dict[str, Callable[[], Any]]) -> pd.Dat return result ``` -As you can see from the above example, on load `PartitionedDataSet` _does not_ automatically load the data from the located partitions. Instead, `PartitionedDataSet` returns a dictionary with partition IDs as keys and the corresponding load functions as values. It allows the node that consumes the `PartitionedDataSet` to implement the logic that defines what partitions need to be loaded, and how this data is going to be processed. +As you can see from the above example, on load `PartitionedDataset` _does not_ automatically load the data from the located partitions. Instead, `PartitionedDataset` returns a dictionary with partition IDs as keys and the corresponding load functions as values. It allows the node that consumes the `PartitionedDataset` to implement the logic that defines what partitions need to be loaded, and how this data is going to be processed. Partition ID _does not_ represent the whole partition path, but only a part of it that is unique for a given partition _and_ filename suffix: @@ -398,17 +398,17 @@ Partition ID _does not_ represent the whole partition path, but only a part of i * Example 2: if `path=s3://my-bucket-name/folder` and `filename_suffix=".csv"` and partition is stored in `s3://my-bucket-name/folder/2019-12-04/data.csv`, then its Partition ID is `2019-12-04/data`. -`PartitionedDataSet` implements caching on load operation, which means that if multiple nodes consume the same `PartitionedDataSet`, they will all receive the same partition dictionary even if some new partitions were added to the folder after the first load has been completed. This is done deliberately to guarantee the consistency of load operations between the nodes and avoid race conditions. To reset the cache, call the `release()` method of the partitioned dataset object. +`PartitionedDataset` implements caching on load operation, which means that if multiple nodes consume the same `PartitionedDataset`, they will all receive the same partition dictionary even if some new partitions were added to the folder after the first load has been completed. This is done deliberately to guarantee the consistency of load operations between the nodes and avoid race conditions. To reset the cache, call the `release()` method of the partitioned dataset object. ### Partitioned dataset save -`PartitionedDataSet` also supports a save operation. Let's assume the following configuration: +`PartitionedDataset` also supports a save operation. Let's assume the following configuration: ```yaml # conf/base/catalog.yml new_partitioned_dataset: - type: PartitionedDataSet + type: PartitionedDataset path: s3://my-bucket-name dataset: pandas.CSVDataSet filename_suffix: ".csv" @@ -430,7 +430,7 @@ import pandas as pd def create_partitions() -> Dict[str, Any]: - """Create new partitions and save using PartitionedDataSet. + """Create new partitions and save using PartitionedDataset. Returns: Dictionary with the partitions to create. @@ -444,11 +444,11 @@ def create_partitions() -> Dict[str, Any]: ``` ```{note} -Writing to an existing partition may result in its data being overwritten, if this case is not specifically handled by the underlying dataset implementation. You should implement your own checks to ensure that no existing data is lost when writing to a `PartitionedDataSet`. The simplest safety mechanism could be to use partition IDs with a high chance of uniqueness: for example, the current timestamp. +Writing to an existing partition may result in its data being overwritten, if this case is not specifically handled by the underlying dataset implementation. You should implement your own checks to ensure that no existing data is lost when writing to a `PartitionedDataset`. The simplest safety mechanism could be to use partition IDs with a high chance of uniqueness: for example, the current timestamp. ``` ### Partitioned dataset lazy saving -`PartitionedDataSet` also supports lazy saving, where the partition's data is not materialised until it is time to write. +`PartitionedDataset` also supports lazy saving, where the partition's data is not materialised until it is time to write. To use this, simply return `Callable` types in the dictionary: ```python @@ -457,7 +457,7 @@ import pandas as pd def create_partitions() -> Dict[str, Callable[[], Any]]: - """Create new partitions and save using PartitionedDataSet. + """Create new partitions and save using PartitionedDataset. Returns: Dictionary of the partitions to create to a function that creates them. @@ -476,7 +476,7 @@ When using lazy saving, the dataset will be written _after_ the `after_node_run` ### Incremental loads with `IncrementalDataSet` -[IncrementalDataSet](/kedro.io.IncrementalDataSet) is a subclass of `PartitionedDataSet`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataSet` addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs. +[IncrementalDataSet](/kedro.io.IncrementalDataSet) is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataSet` addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs. This checkpoint, by default, is persisted to the location of the data partitions. For example, for `IncrementalDataSet` instantiated with path `s3://my-bucket-name/path/to/folder`, the checkpoint will be saved to `s3://my-bucket-name/path/to/folder/CHECKPOINT`, unless [the checkpoint configuration is explicitly overwritten](#checkpoint-configuration). @@ -484,13 +484,13 @@ The checkpoint file is only created _after_ [the partitioned dataset is explicit #### Incremental dataset load -Loading `IncrementalDataSet` works similarly to [`PartitionedDataSet`](#partitioned-dataset-load) with several exceptions: +Loading `IncrementalDataSet` works similarly to [`PartitionedDataset`](#partitioned-dataset-load) with several exceptions: 1. `IncrementalDataSet` loads the data _eagerly_, so the values in the returned dictionary represent the actual data stored in the corresponding partition, rather than a pointer to the load function. `IncrementalDataSet` considers a partition relevant for processing if its ID satisfies the comparison function, given the checkpoint value. 2. `IncrementalDataSet` _does not_ raise a `DataSetError` if load finds no partitions to return - an empty dictionary is returned instead. An empty list of available partitions is part of a normal workflow for `IncrementalDataSet`. #### Incremental dataset save -The `IncrementalDataSet` save operation is identical to the [save operation of the `PartitionedDataSet`](#partitioned-dataset-save). +The `IncrementalDataSet` save operation is identical to the [save operation of the `PartitionedDataset`](#partitioned-dataset-save). #### Incremental dataset confirm diff --git a/docs/source/extend_kedro/custom_datasets.md b/docs/source/extend_kedro/custom_datasets.md index 9e4b0713eb..6d9341d3bc 100644 --- a/docs/source/extend_kedro/custom_datasets.md +++ b/docs/source/extend_kedro/custom_datasets.md @@ -262,19 +262,19 @@ class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]): ``` -## Integration with `PartitionedDataSet` +## Integration with `PartitionedDataset` Currently, the `ImageDataSet` only works with a single image, but this example needs to load all Pokemon images from the raw data directory for further processing. -Kedro's [`PartitionedDataSet`](../data/kedro_io.md#partitioned-dataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory. +Kedro's [`PartitionedDataset`](../data/kedro_io.md#partitioned-dataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory. -To use `PartitionedDataSet` with `ImageDataSet` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataSet` loads all PNG files from the data directory using `ImageDataSet`: +To use `PartitionedDataset` with `ImageDataSet` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataset` loads all PNG files from the data directory using `ImageDataSet`: ```yaml # in conf/base/catalog.yml pokemon: - type: PartitionedDataSet + type: PartitionedDataset dataset: kedro_pokemon.extras.datasets.image_dataset.ImageDataSet path: data/01_raw/pokemon-images-and-types/images/images filename_suffix: ".png" @@ -298,7 +298,7 @@ $ ls -la data/01_raw/pokemon-images-and-types/images/images/*.png | wc -l ## Versioning ```{note} -Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at the same time. +Versioning doesn't work with `PartitionedDataset`. You can't use both of them at the same time. ``` To add [Versioning](../data/kedro_io.md#versioning) support to the new dataset we need to extend the [AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataSet) to: From 8136be0a7705be7468622b00c395198adef94723 Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Tue, 27 Jun 2023 11:07:43 -0500 Subject: [PATCH 04/14] IncrementalDataSet->IncrementalDataset in .md files Signed-off-by: Deepyaman Datta --- RELEASE.md | 10 +++++----- docs/source/data/kedro_io.md | 30 +++++++++++++++--------------- 2 files changed, 20 insertions(+), 20 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index 3e80ab79a6..c291e0c644 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1122,9 +1122,9 @@ Even though this release ships a fix for project generated with `kedro==0.16.2`, * Updated contribution process in `CONTRIBUTING.md` - added Developer Workflow. * Documented installation of development version of Kedro in the [FAQ section](https://docs.kedro.org/en/0.16.0/06_resources/01_faq.html#how-can-i-use-development-version-of-kedro). * Added missing `_exists` method to `MyOwnDataSet` example in 04_user_guide/08_advanced_io. -* Fixed a bug where `PartitionedDataset` and `IncrementalDataSet` were not working with `s3a` or `s3n` protocol. +* Fixed a bug where `PartitionedDataset` and `IncrementalDataset` were not working with `s3a` or `s3n` protocol. * Added ability to read partitioned parquet file from a directory in `pandas.ParquetDataSet`. -* Replaced `functools.lru_cache` with `cachetools.cachedmethod` in `PartitionedDataset` and `IncrementalDataSet` for per-instance cache invalidation. +* Replaced `functools.lru_cache` with `cachetools.cachedmethod` in `PartitionedDataset` and `IncrementalDataset` for per-instance cache invalidation. * Implemented custom glob function for `SparkDataSet` when running on Databricks. * Fixed a bug in `SparkDataSet` not allowing for loading data from DBFS in a Windows machine using Databricks-connect. * Improved the error message for `DataSetNotFoundError` to suggest possible dataset names user meant to type. @@ -1141,7 +1141,7 @@ Even though this release ships a fix for project generated with `kedro==0.16.2`, * `get_last_load_version` and `get_last_save_version` have been renamed to `resolve_load_version` and `resolve_save_version` on ``AbstractVersionedDataSet``, the results of which are cached. * The `release()` method on datasets extending ``AbstractVersionedDataSet`` clears the cached load and save version. All custom datasets must call `super()._release()` inside `_release()`. * ``TextDataSet`` no longer has `load_args` and `save_args`. These can instead be specified under `open_args_load` or `open_args_save` in `fs_args`. -* `PartitionedDataset` and `IncrementalDataSet` method `invalidate_cache` was made private: `_invalidate_caches`. +* `PartitionedDataset` and `IncrementalDataset` method `invalidate_cache` was made private: `_invalidate_caches`. ### Other * Removed `KEDRO_ENV_VAR` from `kedro.context` to speed up the CLI run time. @@ -1272,7 +1272,7 @@ weather: file_format: csv ``` -You can also load data incrementally whenever it is dumped into a directory with the extension to [`PartionedDataSet`](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#partitioned-dataset), a feature that allows you to load a directory of files. The [`IncrementalDataSet`](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#incremental-loads-with-incrementaldataset) stores the information about the last processed partition in a `checkpoint`, read more about this feature [**here**](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#incremental-loads-with-incrementaldataset). +You can also load data incrementally whenever it is dumped into a directory with the extension to [`PartionedDataSet`](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#partitioned-dataset), a feature that allows you to load a directory of files. The [`IncrementalDataset`](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#incremental-loads-with-incrementaldataset) stores the information about the last processed partition in a `checkpoint`, read more about this feature [**here**](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#incremental-loads-with-incrementaldataset). ### New features @@ -1302,7 +1302,7 @@ You can also load data incrementally whenever it is dumped into a directory with | `biosequence.BioSequenceDataSet` | Work with bio-sequence objects using [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to communicate with the underlying filesystem | `kedro.extras.datasets.biosequence` | | `pandas.GBQTableDataSet` | Work with Google BigQuery | `kedro.extras.datasets.pandas` | | `pandas.FeatherDataSet` | Work with feather files using [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to communicate with the underlying filesystem | `kedro.extras.datasets.pandas` | -| `IncrementalDataSet` | Inherit from `PartitionedDataset` and remembers the last processed partition | `kedro.io` | +| `IncrementalDataset` | Inherit from `PartitionedDataset` and remembers the last processed partition | `kedro.io` | ### Files with a new location diff --git a/docs/source/data/kedro_io.md b/docs/source/data/kedro_io.md index 720120fbf2..9cb74d7e06 100644 --- a/docs/source/data/kedro_io.md +++ b/docs/source/data/kedro_io.md @@ -474,23 +474,23 @@ def create_partitions() -> Dict[str, Callable[[], Any]]: When using lazy saving, the dataset will be written _after_ the `after_node_run` [hook](../hooks/introduction). ``` -### Incremental loads with `IncrementalDataSet` +### Incremental loads with `IncrementalDataset` -[IncrementalDataSet](/kedro.io.IncrementalDataSet) is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataSet` addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs. +[IncrementalDataset](/kedro.io.IncrementalDataset) is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs. -This checkpoint, by default, is persisted to the location of the data partitions. For example, for `IncrementalDataSet` instantiated with path `s3://my-bucket-name/path/to/folder`, the checkpoint will be saved to `s3://my-bucket-name/path/to/folder/CHECKPOINT`, unless [the checkpoint configuration is explicitly overwritten](#checkpoint-configuration). +This checkpoint, by default, is persisted to the location of the data partitions. For example, for `IncrementalDataset` instantiated with path `s3://my-bucket-name/path/to/folder`, the checkpoint will be saved to `s3://my-bucket-name/path/to/folder/CHECKPOINT`, unless [the checkpoint configuration is explicitly overwritten](#checkpoint-configuration). The checkpoint file is only created _after_ [the partitioned dataset is explicitly confirmed](#incremental-dataset-confirm). #### Incremental dataset load -Loading `IncrementalDataSet` works similarly to [`PartitionedDataset`](#partitioned-dataset-load) with several exceptions: -1. `IncrementalDataSet` loads the data _eagerly_, so the values in the returned dictionary represent the actual data stored in the corresponding partition, rather than a pointer to the load function. `IncrementalDataSet` considers a partition relevant for processing if its ID satisfies the comparison function, given the checkpoint value. -2. `IncrementalDataSet` _does not_ raise a `DataSetError` if load finds no partitions to return - an empty dictionary is returned instead. An empty list of available partitions is part of a normal workflow for `IncrementalDataSet`. +Loading `IncrementalDataset` works similarly to [`PartitionedDataset`](#partitioned-dataset-load) with several exceptions: +1. `IncrementalDataset` loads the data _eagerly_, so the values in the returned dictionary represent the actual data stored in the corresponding partition, rather than a pointer to the load function. `IncrementalDataset` considers a partition relevant for processing if its ID satisfies the comparison function, given the checkpoint value. +2. `IncrementalDataset` _does not_ raise a `DataSetError` if load finds no partitions to return - an empty dictionary is returned instead. An empty list of available partitions is part of a normal workflow for `IncrementalDataset`. #### Incremental dataset save -The `IncrementalDataSet` save operation is identical to the [save operation of the `PartitionedDataset`](#partitioned-dataset-save). +The `IncrementalDataset` save operation is identical to the [save operation of the `PartitionedDataset`](#partitioned-dataset-save). #### Incremental dataset confirm @@ -503,7 +503,7 @@ Partitioned dataset checkpoint update is triggered by an explicit `confirms` ins ```python from kedro.pipeline import node -# process and then confirm `IncrementalDataSet` within the same node +# process and then confirm `IncrementalDataset` within the same node node( process_partitions, inputs="my_partitioned_dataset", @@ -545,17 +545,17 @@ pipeline( Important notes about the confirmation operation: -* Confirming a partitioned dataset does not affect any subsequent loads within the same run. All downstream nodes that input the same partitioned dataset as input will all receive the _same_ partitions. Partitions that are created externally during the run will also not affect the dataset loads and won't appear in the list of loaded partitions until the next run or until the [`release()`](/kedro.io.IncrementalDataSet) method is called on the dataset object. +* Confirming a partitioned dataset does not affect any subsequent loads within the same run. All downstream nodes that input the same partitioned dataset as input will all receive the _same_ partitions. Partitions that are created externally during the run will also not affect the dataset loads and won't appear in the list of loaded partitions until the next run or until the [`release()`](/kedro.io.IncrementalDataset) method is called on the dataset object. * A pipeline cannot contain more than one node confirming the same dataset. #### Checkpoint configuration -`IncrementalDataSet` does not require explicit configuration of the checkpoint unless there is a need to deviate from the defaults. To update the checkpoint configuration, add a `checkpoint` key containing the valid dataset configuration. This may be required if, say, the pipeline has read-only permissions to the location of partitions (or write operations are undesirable for any other reason). In such cases, `IncrementalDataSet` can be configured to save the checkpoint elsewhere. The `checkpoint` key also supports partial config updates where only some checkpoint attributes are overwritten, while the defaults are kept for the rest: +`IncrementalDataset` does not require explicit configuration of the checkpoint unless there is a need to deviate from the defaults. To update the checkpoint configuration, add a `checkpoint` key containing the valid dataset configuration. This may be required if, say, the pipeline has read-only permissions to the location of partitions (or write operations are undesirable for any other reason). In such cases, `IncrementalDataset` can be configured to save the checkpoint elsewhere. The `checkpoint` key also supports partial config updates where only some checkpoint attributes are overwritten, while the defaults are kept for the rest: ```yaml my_partitioned_dataset: - type: IncrementalDataSet + type: IncrementalDataset path: s3://my-bucket-name/path/to/folder dataset: pandas.CSVDataSet checkpoint: @@ -572,7 +572,7 @@ Along with the standard dataset attributes, `checkpoint` config also accepts two ```yaml my_partitioned_dataset: - type: IncrementalDataSet + type: IncrementalDataset path: s3://my-bucket-name/path/to/folder dataset: pandas.CSVDataSet checkpoint: @@ -583,7 +583,7 @@ my_partitioned_dataset: ```yaml my_partitioned_dataset: - type: IncrementalDataSet + type: IncrementalDataset path: s3://my-bucket-name/path/to/folder dataset: pandas.CSVDataSet checkpoint: @@ -596,7 +596,7 @@ Specification of `force_checkpoint` is also supported via the shorthand notation ```yaml my_partitioned_dataset: - type: IncrementalDataSet + type: IncrementalDataset path: s3://my-bucket-name/path/to/folder dataset: pandas.CSVDataSet checkpoint: 2020-01-01/data.csv @@ -608,7 +608,7 @@ If you need to force the partitioned dataset to load all available partitions, s ```yaml my_partitioned_dataset: - type: IncrementalDataSet + type: IncrementalDataset path: s3://my-bucket-name/path/to/folder dataset: pandas.CSVDataSet checkpoint: "" From 35f3e782132b2471264bbe2b7275fde9b6dd9e91 Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Tue, 27 Jun 2023 11:09:05 -0500 Subject: [PATCH 05/14] CachedDataSet->CachedDataset in .md files Signed-off-by: Deepyaman Datta --- RELEASE.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index c291e0c644..f2a175e4c4 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1284,7 +1284,7 @@ You can also load data incrementally whenever it is dumped into a directory with - `kedro.io` - `kedro.extras.datasets` - Import path, specified in `type` -* Added an optional `copy_mode` flag to `CachedDataSet` and `MemoryDataset` to specify (`deepcopy`, `copy` or `assign`) the copy mode to use when loading and saving. +* Added an optional `copy_mode` flag to `CachedDataset` and `MemoryDataset` to specify (`deepcopy`, `copy` or `assign`) the copy mode to use when loading and saving. ### New Datasets @@ -1340,7 +1340,7 @@ You can also load data incrementally whenever it is dumped into a directory with | | `JSONLocalDataSet` | | | `HDFLocalDataSet` | | | `HDFS3DataSet` | -| | `kedro.contrib.io.cached.CachedDataSet` | +| | `kedro.contrib.io.cached.CachedDataset` | | | `kedro.contrib.io.catalog_with_default.DataCatalogWithDefault` | | | `MatplotlibLocalWriter` | | | `MatplotlibS3Writer` | @@ -1526,7 +1526,7 @@ You can also load data incrementally whenever it is dumped into a directory with - `CSVHTTPDataSet` to load CSV using HTTP(s) links. - `JSONBlobDataSet` to load json (-delimited) files from Azure Blob Storage. - `ParquetS3DataSet` in `contrib` for usage with pandas. (by [@mmchougule](https://github.com/mmchougule)) - - `CachedDataSet` in `contrib` which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by [@tsanikgr](https://github.com/tsanikgr)) + - `CachedDataset` in `contrib` which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by [@tsanikgr](https://github.com/tsanikgr)) - `YAMLLocalDataSet` in `contrib` to load and save local YAML files. (by [@Minyus](https://github.com/Minyus)) ## Bug fixes and other changes From 942d7cdfda1279cf23171d3c5275afccbd283df3 Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Tue, 27 Jun 2023 11:10:52 -0500 Subject: [PATCH 06/14] DataSetError->DatasetError in .md files Signed-off-by: Deepyaman Datta --- docs/source/data/kedro_io.md | 14 +++++++------- docs/source/tutorial/spaceflights_tutorial_faqs.md | 12 ++++++------ 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/source/data/kedro_io.md b/docs/source/data/kedro_io.md index 9cb74d7e06..e39ffa3b96 100644 --- a/docs/source/data/kedro_io.md +++ b/docs/source/data/kedro_io.md @@ -1,7 +1,7 @@ # Kedro IO -In this tutorial, we cover advanced uses of [the Kedro IO module](/kedro.io) to understand the underlying implementation. The relevant API documentation is [kedro.io.AbstractDataSet](/kedro.io.AbstractDataSet) and [kedro.io.DataSetError](/kedro.io.DataSetError). +In this tutorial, we cover advanced uses of [the Kedro IO module](/kedro.io) to understand the underlying implementation. The relevant API documentation is [kedro.io.AbstractDataSet](/kedro.io.AbstractDataSet) and [kedro.io.DatasetError](/kedro.io.DatasetError). ## Error handling @@ -16,7 +16,7 @@ io = DataCatalog(data_sets=dict()) # empty catalog try: cars_df = io.load("cars") -except DataSetError: +except DatasetError: print("Error raised.") ``` @@ -184,7 +184,7 @@ io.save("test_data_set", data1) reloaded = io.load("test_data_set") assert data1.equals(reloaded) -# raises DataSetError since the path +# raises DatasetError since the path # data/01_raw/test.csv/my_exact_version/test.csv already exists io.save("test_data_set", data2) ``` @@ -206,7 +206,7 @@ io = DataCatalog({"test_data_set": test_data_set}) io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency -# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv +# raises DatasetError since the data/01_raw/test.csv/exact_load_version/test.csv # file does not exist reloaded = io.load("test_data_set") ``` @@ -332,8 +332,8 @@ Requires you only to specify a class of the underlying dataset either as a strin Full notation allows you to specify a dictionary with the full underlying dataset definition _except_ the following arguments: * The argument that receives the partition path (`filepath` by default) - if specified, a `UserWarning` will be emitted stating that this value will be overridden by individual partition paths -* `credentials` key - specifying it will result in a `DataSetError` being raised; dataset credentials should be passed into the `credentials` argument of the `PartitionedDataset` rather than the underlying dataset definition - see the section below on [partitioned dataset credentials](#partitioned-dataset-credentials) for details -* `versioned` flag - specifying it will result in a `DataSetError` being raised; versioning cannot be enabled for the underlying datasets +* `credentials` key - specifying it will result in a `DatasetError` being raised; dataset credentials should be passed into the `credentials` argument of the `PartitionedDataset` rather than the underlying dataset definition - see the section below on [partitioned dataset credentials](#partitioned-dataset-credentials) for details +* `versioned` flag - specifying it will result in a `DatasetError` being raised; versioning cannot be enabled for the underlying datasets #### Partitioned dataset credentials @@ -486,7 +486,7 @@ The checkpoint file is only created _after_ [the partitioned dataset is explicit Loading `IncrementalDataset` works similarly to [`PartitionedDataset`](#partitioned-dataset-load) with several exceptions: 1. `IncrementalDataset` loads the data _eagerly_, so the values in the returned dictionary represent the actual data stored in the corresponding partition, rather than a pointer to the load function. `IncrementalDataset` considers a partition relevant for processing if its ID satisfies the comparison function, given the checkpoint value. -2. `IncrementalDataset` _does not_ raise a `DataSetError` if load finds no partitions to return - an empty dictionary is returned instead. An empty list of available partitions is part of a normal workflow for `IncrementalDataset`. +2. `IncrementalDataset` _does not_ raise a `DatasetError` if load finds no partitions to return - an empty dictionary is returned instead. An empty list of available partitions is part of a normal workflow for `IncrementalDataset`. #### Incremental dataset save diff --git a/docs/source/tutorial/spaceflights_tutorial_faqs.md b/docs/source/tutorial/spaceflights_tutorial_faqs.md index 92d873dcb9..dcfa4d7f98 100644 --- a/docs/source/tutorial/spaceflights_tutorial_faqs.md +++ b/docs/source/tutorial/spaceflights_tutorial_faqs.md @@ -7,11 +7,11 @@ If you can't find the answer you need here, [ask the Kedro community for help](h ## How do I resolve these common errors? ### DataSet errors -#### DataSetError: Failed while loading data from data set +#### DatasetError: Failed while loading data from data set You're [testing whether Kedro can load the raw test data](./set_up_data.md#test-that-kedro-can-load-the-data) and see the following: ```python -DataSetError: Failed while loading data from data set +DatasetError: Failed while loading data from data set CSVDataSet(filepath=...). [Errno 2] No such file or directory: '.../companies.csv' ``` @@ -34,12 +34,12 @@ Has something changed in your `catalog.yml` from the version generated by the sp Call `exit()` within the IPython session and restart `kedro ipython` (or type `@kedro_reload` into the IPython console to reload Kedro into the session without restarting). Then try again. -#### DataSetError: An exception occurred when parsing config for DataSet +#### DatasetError: An exception occurred when parsing config for DataSet Are you seeing a message saying that an exception occurred? ```bash -DataSetError: An exception occurred when parsing config for DataSet +DatasetError: An exception occurred when parsing config for DataSet 'data_processing.preprocessed_companies': Object 'ParquetDataSet' cannot be loaded from 'kedro_datasets.pandas'. Please see the documentation on how to install relevant dependencies for kedro_datasets.pandas.ParquetDataSet: @@ -70,7 +70,7 @@ The above exception was the direct cause of the following exception: Traceback (most recent call last): ... - raise DataSetError(message) from exc -kedro.io.core.DataSetError: Failed while loading data from data set CSVDataSet(filepath=data/03_primary/model_input_table.csv, save_args={'index': False}). + raise DatasetError(message) from exc +kedro.io.core.DatasetError: Failed while loading data from data set CSVDataSet(filepath=data/03_primary/model_input_table.csv, save_args={'index': False}). [Errno 2] File b'data/03_primary/model_input_table.csv' does not exist: b'data/03_primary/model_input_table.csv' ``` From 7ebe6c32a220cc69140edd5e4cc37dc5e1ed5693 Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Tue, 27 Jun 2023 11:11:43 -0500 Subject: [PATCH 07/14] DataSetNotFoundError->DatasetNotFoundError in .md files Signed-off-by: Deepyaman Datta --- RELEASE.md | 2 +- docs/source/tutorial/spaceflights_tutorial_faqs.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index f2a175e4c4..95cb485e2b 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1127,7 +1127,7 @@ Even though this release ships a fix for project generated with `kedro==0.16.2`, * Replaced `functools.lru_cache` with `cachetools.cachedmethod` in `PartitionedDataset` and `IncrementalDataset` for per-instance cache invalidation. * Implemented custom glob function for `SparkDataSet` when running on Databricks. * Fixed a bug in `SparkDataSet` not allowing for loading data from DBFS in a Windows machine using Databricks-connect. -* Improved the error message for `DataSetNotFoundError` to suggest possible dataset names user meant to type. +* Improved the error message for `DatasetNotFoundError` to suggest possible dataset names user meant to type. * Added the option for contributors to run Kedro tests locally without Spark installation with `make test-no-spark`. * Added option to lint the project without applying the formatting changes (`kedro lint --check-only`). diff --git a/docs/source/tutorial/spaceflights_tutorial_faqs.md b/docs/source/tutorial/spaceflights_tutorial_faqs.md index dcfa4d7f98..739a34b398 100644 --- a/docs/source/tutorial/spaceflights_tutorial_faqs.md +++ b/docs/source/tutorial/spaceflights_tutorial_faqs.md @@ -20,12 +20,12 @@ or a similar error for the `shuttles` or `reviews` data. Are the [three sample data files](./set_up_data.md#project-datasets) stored in the `data/raw` folder? -#### DataSetNotFoundError: DataSet not found in the catalog +#### DatasetNotFoundError: DataSet not found in the catalog You see an error such as the following: ```python -DataSetNotFoundError: DataSet 'companies' not found in the catalog +DatasetNotFoundError: DataSet 'companies' not found in the catalog ``` Has something changed in your `catalog.yml` from the version generated by the spaceflights starter? Take a look at the [data specification](./set_up_data.md#dataset-registration) to ensure it is valid. From 4156fef724fda93dd188d97d7def3c9994e94441 Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Tue, 27 Jun 2023 11:13:13 -0500 Subject: [PATCH 08/14] Replace "DataSet" with "Dataset" in Markdown files --- RELEASE.md | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index 95cb485e2b..260ebe289c 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -3,7 +3,7 @@ ## Major features and improvements ## Bug fixes and other changes -* Compare for protocol and delimiter in `PartitionedDataset` to be able to pass the protocol to partitions which paths starts with the same characters as the protocol (e.g. `s3://s3-my-bucket`). +* Compare for protocol and delimiter in `PartitionedDataSet` to be able to pass the protocol to partitions which paths starts with the same characters as the protocol (e.g. `s3://s3-my-bucket`). ## Breaking changes to the API @@ -540,7 +540,7 @@ The parameters should look like this: * Upgraded `pip-tools`, which is used by `kedro build-reqs`, to 6.4. This `pip-tools` version requires `pip>=21.2` while [adding support for `pip>=21.3`](https://github.com/jazzband/pip-tools/pull/1501). To upgrade `pip`, please refer to [their documentation](https://pip.pypa.io/en/stable/installing/#upgrading-pip). * Relaxed the bounds on the `plotly` requirement for `plotly.PlotlyDataSet` and the `pyarrow` requirement for `pandas.ParquetDataSet`. * `kedro pipeline package ` now raises an error if the `` argument doesn't look like a valid Python module path (e.g. has `/` instead of `.`). -* Added new `overwrite` argument to `PartitionedDataset` and `MatplotlibWriter` to enable deletion of existing partitions and plots on dataset `save`. +* Added new `overwrite` argument to `PartitionedDataSet` and `MatplotlibWriter` to enable deletion of existing partitions and plots on dataset `save`. * `kedro pipeline pull` now works when the project requirements contains entries such as `-r`, `--extra-index-url` and local wheel files ([Issue #913](https://github.com/kedro-org/kedro/issues/913)). * Fixed slow startup because of catalog processing by reducing the exponential growth of extra processing during `_FrozenDatasets` creations. * Removed `.coveragerc` from the Kedro project template. `coverage` settings are now given in `pyproject.toml`. @@ -620,7 +620,7 @@ The parameters should look like this: * Fixed a bug where `kedro ipython` and `kedro jupyter notebook` didn't work if the `PYTHONPATH` was already set. * Update the IPython extension to allow passing `env` and `extra_params` to `reload_kedro` similar to how the IPython script works. * `kedro info` now outputs if a plugin has any `hooks` or `cli_hooks` implemented. -* `PartitionedDataset` now supports lazily materializing data on save. +* `PartitionedDataSet` now supports lazily materializing data on save. * `kedro pipeline describe` now defaults to the `__default__` pipeline when no pipeline name is provided and also shows the namespace the nodes belong to. * Fixed an issue where spark.SparkDataSet with enabled versioning would throw a VersionNotFoundError when using databricks-connect from a remote machine and saving to dbfs filesystem. * `EmailMessageDataSet` added to doctree. @@ -778,7 +778,7 @@ from kedro.framework.session import KedroSession * In a significant change, [we have introduced `KedroSession`](https://docs.kedro.org/en/0.17.0/04_kedro_project_setup/03_session.html) which is responsible for managing the lifecycle of a Kedro run. * Created a new Kedro Starter: `kedro new --starter=mini-kedro`. It is possible to [use the DataCatalog as a standalone component](https://github.com/kedro-org/kedro-starters/tree/master/mini-kedro) in a Jupyter notebook and transition into the rest of the Kedro framework. * Added `DatasetSpecs` with Hooks to run before and after datasets are loaded from/saved to the catalog. -* Added a command: `kedro catalog create`. For a registered pipeline, it creates a `//catalog/.yml` configuration file with `MemoryDataset` datasets for each dataset that is missing from `DataCatalog`. +* Added a command: `kedro catalog create`. For a registered pipeline, it creates a `//catalog/.yml` configuration file with `MemoryDataSet` datasets for each dataset that is missing from `DataCatalog`. * Added `settings.py` and `pyproject.toml` (to replace `.kedro.yml`) for project configuration, in line with Python best practice. * `ProjectContext` is no longer needed, unless for very complex customisations. `KedroContext`, `ProjectHooks` and `settings.py` together implement sensible default behaviour. As a result `context_path` is also now an _optional_ key in `pyproject.toml`. * Removed `ProjectContext` from `src//run.py`. @@ -805,7 +805,7 @@ from kedro.framework.session import KedroSession * The pipeline-specific `catalog.yml` file is no longer automatically created for modular pipelines when running `kedro pipeline create`. Use `kedro catalog create` to replace this functionality. * Removed `include_examples` prompt from `kedro new`. To generate boilerplate example code, you should use a Kedro starter. * Changed the `--verbose` flag from a global command to a project-specific command flag (e.g `kedro --verbose new` becomes `kedro new --verbose`). -* Dropped support of the `dataset_credentials` key in credentials in `PartitionedDataset`. +* Dropped support of the `dataset_credentials` key in credentials in `PartitionedDataSet`. * `get_source_dir()` was removed from `kedro/framework/cli/utils.py`. * Dropped support of `get_config`, `create_catalog`, `create_pipeline`, `template_version`, `project_name` and `project_path` keys by `get_project_context()` function (`kedro/framework/cli/cli.py`). * `kedro new --starter` now defaults to fetching the starter template matching the installed Kedro version. @@ -908,7 +908,7 @@ Check your source directory. If you defined a different source directory (`sourc ## Bug fixes and other changes * Fixed `TypeError` when converting dict inputs to a node made from a wrapped `partial` function. -* `PartitionedDataset` improvements: +* `PartitionedDataSet` improvements: - Supported passing arguments to the underlying filesystem. * Improved handling of non-ASCII word characters in dataset names. - For example, a dataset named `jalapeño` will be accessible as `DataCatalog.datasets.jalapeño` rather than `DataCatalog.datasets.jalape__o`. @@ -1122,12 +1122,12 @@ Even though this release ships a fix for project generated with `kedro==0.16.2`, * Updated contribution process in `CONTRIBUTING.md` - added Developer Workflow. * Documented installation of development version of Kedro in the [FAQ section](https://docs.kedro.org/en/0.16.0/06_resources/01_faq.html#how-can-i-use-development-version-of-kedro). * Added missing `_exists` method to `MyOwnDataSet` example in 04_user_guide/08_advanced_io. -* Fixed a bug where `PartitionedDataset` and `IncrementalDataset` were not working with `s3a` or `s3n` protocol. +* Fixed a bug where `PartitionedDataSet` and `IncrementalDataSet` were not working with `s3a` or `s3n` protocol. * Added ability to read partitioned parquet file from a directory in `pandas.ParquetDataSet`. -* Replaced `functools.lru_cache` with `cachetools.cachedmethod` in `PartitionedDataset` and `IncrementalDataset` for per-instance cache invalidation. +* Replaced `functools.lru_cache` with `cachetools.cachedmethod` in `PartitionedDataSet` and `IncrementalDataSet` for per-instance cache invalidation. * Implemented custom glob function for `SparkDataSet` when running on Databricks. * Fixed a bug in `SparkDataSet` not allowing for loading data from DBFS in a Windows machine using Databricks-connect. -* Improved the error message for `DatasetNotFoundError` to suggest possible dataset names user meant to type. +* Improved the error message for `DataSetNotFoundError` to suggest possible dataset names user meant to type. * Added the option for contributors to run Kedro tests locally without Spark installation with `make test-no-spark`. * Added option to lint the project without applying the formatting changes (`kedro lint --check-only`). @@ -1141,7 +1141,7 @@ Even though this release ships a fix for project generated with `kedro==0.16.2`, * `get_last_load_version` and `get_last_save_version` have been renamed to `resolve_load_version` and `resolve_save_version` on ``AbstractVersionedDataSet``, the results of which are cached. * The `release()` method on datasets extending ``AbstractVersionedDataSet`` clears the cached load and save version. All custom datasets must call `super()._release()` inside `_release()`. * ``TextDataSet`` no longer has `load_args` and `save_args`. These can instead be specified under `open_args_load` or `open_args_save` in `fs_args`. -* `PartitionedDataset` and `IncrementalDataset` method `invalidate_cache` was made private: `_invalidate_caches`. +* `PartitionedDataSet` and `IncrementalDataSet` method `invalidate_cache` was made private: `_invalidate_caches`. ### Other * Removed `KEDRO_ENV_VAR` from `kedro.context` to speed up the CLI run time. @@ -1272,7 +1272,7 @@ weather: file_format: csv ``` -You can also load data incrementally whenever it is dumped into a directory with the extension to [`PartionedDataSet`](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#partitioned-dataset), a feature that allows you to load a directory of files. The [`IncrementalDataset`](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#incremental-loads-with-incrementaldataset) stores the information about the last processed partition in a `checkpoint`, read more about this feature [**here**](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#incremental-loads-with-incrementaldataset). +You can also load data incrementally whenever it is dumped into a directory with the extension to [`PartionedDataSet`](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#partitioned-dataset), a feature that allows you to load a directory of files. The [`IncrementalDataSet`](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#incremental-loads-with-incrementaldataset) stores the information about the last processed partition in a `checkpoint`, read more about this feature [**here**](https://docs.kedro.org/en/0.15.6/04_user_guide/08_advanced_io.html#incremental-loads-with-incrementaldataset). ### New features @@ -1284,7 +1284,7 @@ You can also load data incrementally whenever it is dumped into a directory with - `kedro.io` - `kedro.extras.datasets` - Import path, specified in `type` -* Added an optional `copy_mode` flag to `CachedDataset` and `MemoryDataset` to specify (`deepcopy`, `copy` or `assign`) the copy mode to use when loading and saving. +* Added an optional `copy_mode` flag to `CachedDataSet` and `MemoryDataSet` to specify (`deepcopy`, `copy` or `assign`) the copy mode to use when loading and saving. ### New Datasets @@ -1302,7 +1302,7 @@ You can also load data incrementally whenever it is dumped into a directory with | `biosequence.BioSequenceDataSet` | Work with bio-sequence objects using [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to communicate with the underlying filesystem | `kedro.extras.datasets.biosequence` | | `pandas.GBQTableDataSet` | Work with Google BigQuery | `kedro.extras.datasets.pandas` | | `pandas.FeatherDataSet` | Work with feather files using [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to communicate with the underlying filesystem | `kedro.extras.datasets.pandas` | -| `IncrementalDataset` | Inherit from `PartitionedDataset` and remembers the last processed partition | `kedro.io` | +| `IncrementalDataSet` | Inherit from `PartitionedDataSet` and remembers the last processed partition | `kedro.io` | ### Files with a new location @@ -1340,7 +1340,7 @@ You can also load data incrementally whenever it is dumped into a directory with | | `JSONLocalDataSet` | | | `HDFLocalDataSet` | | | `HDFS3DataSet` | -| | `kedro.contrib.io.cached.CachedDataset` | +| | `kedro.contrib.io.cached.CachedDataSet` | | | `kedro.contrib.io.catalog_with_default.DataCatalogWithDefault` | | | `MatplotlibLocalWriter` | | | `MatplotlibS3Writer` | @@ -1373,7 +1373,7 @@ You can also load data incrementally whenever it is dumped into a directory with * Bumped minimum required pandas version to 0.24.0 to make use of `pandas.DataFrame.to_numpy` (recommended alternative to `pandas.DataFrame.values`). * Docs improvements. * `Pipeline.transform` skips modifying node inputs/outputs containing `params:` or `parameters` keywords. -* Support for `dataset_credentials` key in the credentials for `PartitionedDataset` is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. +* Support for `dataset_credentials` key in the credentials for `PartitionedDataSet` is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. * Datasets can have a new `confirm` function which is called after a successful node function execution if the node contains `confirms` argument with such dataset name. * Make the resume prompt on pipeline run failure use `--from-nodes` instead of `--from-inputs` to avoid unnecessarily re-running nodes that had already executed. * When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use `--idle-timeout` option to update it. @@ -1402,7 +1402,7 @@ You can also load data incrementally whenever it is dumped into a directory with - `ParquetGCSDataSet` dataset in `contrib` for working with Parquet files in Google Cloud Storage. - `JSONGCSDataSet` dataset in `contrib` for working with JSON files in Google Cloud Storage. - `MatplotlibS3Writer` dataset in `contrib` for saving Matplotlib images to S3. - - `PartitionedDataset` for working with datasets split across multiple files. + - `PartitionedDataSet` for working with datasets split across multiple files. - `JSONDataSet` dataset for working with JSON files that uses [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to communicate with the underlying filesystem. It doesn't support `http(s)` protocol for now. * Added `s3fs_args` to all S3 datasets. * Pipelines can be deducted with `pipeline1 - pipeline2`. @@ -1504,7 +1504,7 @@ You can also load data incrementally whenever it is dumped into a directory with * Documented the architecture of Kedro showing how we think about library, project and framework components. * `extras/kedro_project_loader.py` renamed to `extras/ipython_loader.py` and now runs any IPython startup scripts without relying on the Kedro project structure. * Fixed TypeError when validating partial function's signature. -* After a node failure during a pipeline run, a resume command will be suggested in the logs. This command will not work if the required inputs are MemoryDatasets. +* After a node failure during a pipeline run, a resume command will be suggested in the logs. This command will not work if the required inputs are MemoryDataSets. ## Breaking changes to the API @@ -1526,7 +1526,7 @@ You can also load data incrementally whenever it is dumped into a directory with - `CSVHTTPDataSet` to load CSV using HTTP(s) links. - `JSONBlobDataSet` to load json (-delimited) files from Azure Blob Storage. - `ParquetS3DataSet` in `contrib` for usage with pandas. (by [@mmchougule](https://github.com/mmchougule)) - - `CachedDataset` in `contrib` which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by [@tsanikgr](https://github.com/tsanikgr)) + - `CachedDataSet` in `contrib` which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by [@tsanikgr](https://github.com/tsanikgr)) - `YAMLLocalDataSet` in `contrib` to load and save local YAML files. (by [@Minyus](https://github.com/Minyus)) ## Bug fixes and other changes @@ -1615,7 +1615,7 @@ These steps should have brought your project to Kedro 0.15.0. There might be som * Fix local project source not having priority over the same source installed as a package, leading to local updates not being recognised. ## Breaking changes to the API -* Remove the max_loads argument from the `MemoryDataset` constructor and from the `AbstractRunner.create_default_data_set` method. +* Remove the max_loads argument from the `MemoryDataSet` constructor and from the `AbstractRunner.create_default_data_set` method. ## Thanks for supporting contributions [Joel Schwarzmann](https://github.com/datajoely), [Alex Kalmikov](https://github.com/kalexqb) From eff96f4594f9fd76654a3460936a626ba86f5270 Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Tue, 27 Jun 2023 14:56:01 -0400 Subject: [PATCH 09/14] Update RELEASE.md --- RELEASE.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/RELEASE.md b/RELEASE.md index 260ebe289c..112b4ecab5 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -23,6 +23,8 @@ ## Breaking changes to the API ## Upcoming deprecations for Kedro 0.19.0 +* Renamed `CachedDataSet`, `LambdaDataSet`, `IncrementalDataSet`, `MemoryDataSet` and `PartitionedDataSet` to `CachedDataset`, `LambdaDataset`, `IncrementalDataset`, `MemoryDataset` and `PartitionedDataset`, respectively. +* Renamed `DataSetError`, `DataSetAlreadyExistsError` and `DataSetNotFoundError` to `DatasetError`, `DatasetAlreadyExistsError`, and `DatasetNotFoundError`, respectively. # Release 0.18.10 From 715d140454754c107409b67e6505b1be9e6a3c4c Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Wed, 16 Aug 2023 09:40:57 -0500 Subject: [PATCH 10/14] Fix remaining instance of "*DataSet*"->"*Dataset*" Signed-off-by: Deepyaman Datta --- docs/source/data/data_catalog.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index c08b2ec115..7178566831 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -656,8 +656,8 @@ The matches are ranked according to the following criteria : 2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. 3. Alphabetical order -### Example 6: Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataSet` -You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation. +### Example 6: Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataset` +You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataset` creation. ```yaml "{default_dataset}": type: pandas.CSVDataSet From e9dee828b61f48533e8d6611df5206342837117c Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Wed, 16 Aug 2023 10:14:08 -0500 Subject: [PATCH 11/14] `find . -name '*.md' -print0 | xargs -0 sed -i "" "s/\([^A-Za-z]\)DataSet/\1Dataset/g"` Signed-off-by: Deepyaman Datta --- docs/source/data/data_catalog.md | 2 +- docs/source/deployment/argo.md | 4 ++-- docs/source/deployment/aws_batch.md | 2 +- docs/source/extend_kedro/common_use_cases.md | 4 ++-- docs/source/hooks/examples.md | 2 +- docs/source/resources/glossary.md | 2 +- docs/source/tutorial/spaceflights_tutorial_faqs.md | 10 +++++----- 7 files changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index 7178566831..8454fca8f6 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -669,7 +669,7 @@ as `pandas.CSVDataSet`. ## Transcode datasets -You might come across a situation where you would like to read the same file using two different dataset implementations. Use transcoding when you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations. +You might come across a situation where you would like to read the same file using two different dataset implementations. Use transcoding when you want to load and save the same file, via its specified `filepath`, using different `Dataset` implementations. ### A typical example of transcoding diff --git a/docs/source/deployment/argo.md b/docs/source/deployment/argo.md index 599ff819c0..c31f757382 100644 --- a/docs/source/deployment/argo.md +++ b/docs/source/deployment/argo.md @@ -24,7 +24,7 @@ To use Argo Workflows, ensure you have the following prerequisites in place: - [Argo Workflows is installed](https://github.com/argoproj/argo/blob/master/README.md#quickstart) on your Kubernetes cluster - [Argo CLI is installed](https://github.com/argoproj/argo/releases) on your machine - A `name` attribute is set for each [Kedro node](/kedro.pipeline.node) since it is used to build a DAG -- [All node input/output DataSets must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataset` in your workflow +- [All node input/output Datasets must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataset` in your workflow ```{note} Each node will run in its own container. @@ -174,7 +174,7 @@ spec: The Argo Workflows is defined as the dependencies between tasks using a directed-acyclic graph (DAG). ``` -For the purpose of this walk-through, we will use an AWS S3 bucket for DataSets; therefore `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables must be set to have an ability to communicate with S3. The `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` values should be stored in [Kubernetes Secrets](https://kubernetes.io/docs/concepts/configuration/secret/) (an example [Kubernetes Secrets spec is given below](#submit-argo-workflows-spec-to-kubernetes)). +For the purpose of this walk-through, we will use an AWS S3 bucket for Datasets; therefore `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables must be set to have an ability to communicate with S3. The `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` values should be stored in [Kubernetes Secrets](https://kubernetes.io/docs/concepts/configuration/secret/) (an example [Kubernetes Secrets spec is given below](#submit-argo-workflows-spec-to-kubernetes)). The spec template is written with the [Jinja templating language](https://jinja.palletsprojects.com/en/2.11.x/), so you must install the Jinja Python package: diff --git a/docs/source/deployment/aws_batch.md b/docs/source/deployment/aws_batch.md index 976d5e9e5a..7190a4b960 100644 --- a/docs/source/deployment/aws_batch.md +++ b/docs/source/deployment/aws_batch.md @@ -18,7 +18,7 @@ To use AWS Batch, ensure you have the following prerequisites in place: - An [AWS account set up](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/). - A `name` attribute is set for each [Kedro node](/kedro.pipeline.node). Each node will run in its own Batch job, so having sensible node names will make it easier to `kedro run --node=`. -- [All node input/output `DataSets` must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3). A clean way to do this is to create a new configuration environment `conf/aws_batch` containing a `catalog.yml` file with the appropriate configuration, as illustrated below. +- [All node input/output `Datasets` must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3). A clean way to do this is to create a new configuration environment `conf/aws_batch` containing a `catalog.yml` file with the appropriate configuration, as illustrated below.
Click to expand diff --git a/docs/source/extend_kedro/common_use_cases.md b/docs/source/extend_kedro/common_use_cases.md index 04b36d6ca5..741f877405 100644 --- a/docs/source/extend_kedro/common_use_cases.md +++ b/docs/source/extend_kedro/common_use_cases.md @@ -4,7 +4,7 @@ Kedro has a few built-in mechanisms for you to extend its behaviour. This docume ## Use Case 1: How to add extra behaviour to Kedro's execution timeline -The execution timeline of a Kedro pipeline can be thought of as a sequence of actions performed by various Kedro library components, such as the [DataSets](/kedro_datasets), [DataCatalog](/kedro.io.DataCatalog), [Pipeline](/kedro.pipeline.Pipeline), [Node](/kedro.pipeline.node.Node) and [KedroContext](/kedro.framework.context.KedroContext). +The execution timeline of a Kedro pipeline can be thought of as a sequence of actions performed by various Kedro library components, such as the [Datasets](/kedro_datasets), [DataCatalog](/kedro.io.DataCatalog), [Pipeline](/kedro.pipeline.Pipeline), [Node](/kedro.pipeline.node.Node) and [KedroContext](/kedro.framework.context.KedroContext). At different points in the lifecycle of these components, you might want to add extra behaviour: for example, you could add extra computation for profiling purposes _before_ and _after_ a node runs, or _before_ and _after_ the I/O actions of a dataset, namely the `load` and `save` actions. @@ -12,7 +12,7 @@ This can now achieved by using [Hooks](../hooks/introduction.md), to define the ## Use Case 2: How to integrate Kedro with additional data sources -You can use [DataSets](/kedro_datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](custom_datasets.md). +You can use [Datasets](/kedro_datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](custom_datasets.md). ## Use Case 3: How to add or modify CLI commands diff --git a/docs/source/hooks/examples.md b/docs/source/hooks/examples.md index cdb9963157..9a293e56a0 100644 --- a/docs/source/hooks/examples.md +++ b/docs/source/hooks/examples.md @@ -264,7 +264,7 @@ This example adds observability to your pipeline using [statsd](https://statsd.r pip install statsd ``` -* Implement `before_node_run` and `after_node_run` Hooks to collect metrics (DataSet size and node execution time): +* Implement `before_node_run` and `after_node_run` Hooks to collect metrics (Dataset size and node execution time): ```python # src//hooks.py diff --git a/docs/source/resources/glossary.md b/docs/source/resources/glossary.md index 55f841c8e7..445e3096c1 100644 --- a/docs/source/resources/glossary.md +++ b/docs/source/resources/glossary.md @@ -2,7 +2,7 @@ ## Data Catalog - The Data Catalog is Kedro's registry of all data sources available for use in the data pipeline. It manages loading and saving of data. The Data Catalog maps the names of node inputs and outputs as keys in a Kedro `DataSet`, which can be specialised for different types of data storage. + The Data Catalog is Kedro's registry of all data sources available for use in the data pipeline. It manages loading and saving of data. The Data Catalog maps the names of node inputs and outputs as keys in a Kedro `Dataset`, which can be specialised for different types of data storage. [Further information about the Data Catalog](../data/data_catalog.md) diff --git a/docs/source/tutorial/spaceflights_tutorial_faqs.md b/docs/source/tutorial/spaceflights_tutorial_faqs.md index 739a34b398..0645c40f11 100644 --- a/docs/source/tutorial/spaceflights_tutorial_faqs.md +++ b/docs/source/tutorial/spaceflights_tutorial_faqs.md @@ -6,7 +6,7 @@ If you can't find the answer you need here, [ask the Kedro community for help](h ## How do I resolve these common errors? -### DataSet errors +### Dataset errors #### DatasetError: Failed while loading data from data set You're [testing whether Kedro can load the raw test data](./set_up_data.md#test-that-kedro-can-load-the-data) and see the following: @@ -20,12 +20,12 @@ or a similar error for the `shuttles` or `reviews` data. Are the [three sample data files](./set_up_data.md#project-datasets) stored in the `data/raw` folder? -#### DatasetNotFoundError: DataSet not found in the catalog +#### DatasetNotFoundError: Dataset not found in the catalog You see an error such as the following: ```python -DatasetNotFoundError: DataSet 'companies' not found in the catalog +DatasetNotFoundError: Dataset 'companies' not found in the catalog ``` Has something changed in your `catalog.yml` from the version generated by the spaceflights starter? Take a look at the [data specification](./set_up_data.md#dataset-registration) to ensure it is valid. @@ -34,12 +34,12 @@ Has something changed in your `catalog.yml` from the version generated by the sp Call `exit()` within the IPython session and restart `kedro ipython` (or type `@kedro_reload` into the IPython console to reload Kedro into the session without restarting). Then try again. -#### DatasetError: An exception occurred when parsing config for DataSet +#### DatasetError: An exception occurred when parsing config for Dataset Are you seeing a message saying that an exception occurred? ```bash -DatasetError: An exception occurred when parsing config for DataSet +DatasetError: An exception occurred when parsing config for Dataset 'data_processing.preprocessed_companies': Object 'ParquetDataSet' cannot be loaded from 'kedro_datasets.pandas'. Please see the documentation on how to install relevant dependencies for kedro_datasets.pandas.ParquetDataSet: From ad2c7b4083e5562ef9b6a31a800472e69b6ba875 Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Wed, 16 Aug 2023 11:53:58 -0500 Subject: [PATCH 12/14] Change non-class instances of Dataset to dataset --- docs/source/deployment/argo.md | 4 ++-- docs/source/deployment/aws_batch.md | 2 +- docs/source/extend_kedro/common_use_cases.md | 4 ++-- docs/source/resources/glossary.md | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/source/deployment/argo.md b/docs/source/deployment/argo.md index c31f757382..73828bc8db 100644 --- a/docs/source/deployment/argo.md +++ b/docs/source/deployment/argo.md @@ -24,7 +24,7 @@ To use Argo Workflows, ensure you have the following prerequisites in place: - [Argo Workflows is installed](https://github.com/argoproj/argo/blob/master/README.md#quickstart) on your Kubernetes cluster - [Argo CLI is installed](https://github.com/argoproj/argo/releases) on your machine - A `name` attribute is set for each [Kedro node](/kedro.pipeline.node) since it is used to build a DAG -- [All node input/output Datasets must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataset` in your workflow +- [All node input/output datasets must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataset` in your workflow ```{note} Each node will run in its own container. @@ -174,7 +174,7 @@ spec: The Argo Workflows is defined as the dependencies between tasks using a directed-acyclic graph (DAG). ``` -For the purpose of this walk-through, we will use an AWS S3 bucket for Datasets; therefore `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables must be set to have an ability to communicate with S3. The `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` values should be stored in [Kubernetes Secrets](https://kubernetes.io/docs/concepts/configuration/secret/) (an example [Kubernetes Secrets spec is given below](#submit-argo-workflows-spec-to-kubernetes)). +For the purpose of this walk-through, we will use an AWS S3 bucket for datasets; therefore `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables must be set to have an ability to communicate with S3. The `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` values should be stored in [Kubernetes Secrets](https://kubernetes.io/docs/concepts/configuration/secret/) (an example [Kubernetes Secrets spec is given below](#submit-argo-workflows-spec-to-kubernetes)). The spec template is written with the [Jinja templating language](https://jinja.palletsprojects.com/en/2.11.x/), so you must install the Jinja Python package: diff --git a/docs/source/deployment/aws_batch.md b/docs/source/deployment/aws_batch.md index 7190a4b960..cc92fcd485 100644 --- a/docs/source/deployment/aws_batch.md +++ b/docs/source/deployment/aws_batch.md @@ -18,7 +18,7 @@ To use AWS Batch, ensure you have the following prerequisites in place: - An [AWS account set up](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/). - A `name` attribute is set for each [Kedro node](/kedro.pipeline.node). Each node will run in its own Batch job, so having sensible node names will make it easier to `kedro run --node=`. -- [All node input/output `Datasets` must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3). A clean way to do this is to create a new configuration environment `conf/aws_batch` containing a `catalog.yml` file with the appropriate configuration, as illustrated below. +- [All node input/output datasets must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3). A clean way to do this is to create a new configuration environment `conf/aws_batch` containing a `catalog.yml` file with the appropriate configuration, as illustrated below.
Click to expand diff --git a/docs/source/extend_kedro/common_use_cases.md b/docs/source/extend_kedro/common_use_cases.md index 741f877405..7583879ad0 100644 --- a/docs/source/extend_kedro/common_use_cases.md +++ b/docs/source/extend_kedro/common_use_cases.md @@ -4,7 +4,7 @@ Kedro has a few built-in mechanisms for you to extend its behaviour. This docume ## Use Case 1: How to add extra behaviour to Kedro's execution timeline -The execution timeline of a Kedro pipeline can be thought of as a sequence of actions performed by various Kedro library components, such as the [Datasets](/kedro_datasets), [DataCatalog](/kedro.io.DataCatalog), [Pipeline](/kedro.pipeline.Pipeline), [Node](/kedro.pipeline.node.Node) and [KedroContext](/kedro.framework.context.KedroContext). +The execution timeline of a Kedro pipeline can be thought of as a sequence of actions performed by various Kedro library components, such as the [datasets](/kedro_datasets), [DataCatalog](/kedro.io.DataCatalog), [Pipeline](/kedro.pipeline.Pipeline), [Node](/kedro.pipeline.node.Node) and [KedroContext](/kedro.framework.context.KedroContext). At different points in the lifecycle of these components, you might want to add extra behaviour: for example, you could add extra computation for profiling purposes _before_ and _after_ a node runs, or _before_ and _after_ the I/O actions of a dataset, namely the `load` and `save` actions. @@ -12,7 +12,7 @@ This can now achieved by using [Hooks](../hooks/introduction.md), to define the ## Use Case 2: How to integrate Kedro with additional data sources -You can use [Datasets](/kedro_datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](custom_datasets.md). +You can use [datasets](/kedro_datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](custom_datasets.md). ## Use Case 3: How to add or modify CLI commands diff --git a/docs/source/resources/glossary.md b/docs/source/resources/glossary.md index 445e3096c1..4f382d9b78 100644 --- a/docs/source/resources/glossary.md +++ b/docs/source/resources/glossary.md @@ -2,7 +2,7 @@ ## Data Catalog - The Data Catalog is Kedro's registry of all data sources available for use in the data pipeline. It manages loading and saving of data. The Data Catalog maps the names of node inputs and outputs as keys in a Kedro `Dataset`, which can be specialised for different types of data storage. + The Data Catalog is Kedro's registry of all data sources available for use in the data pipeline. It manages loading and saving of data. The Data Catalog maps the names of node inputs and outputs as keys in a Kedro dataset, which can be specialised for different types of data storage. [Further information about the Data Catalog](../data/data_catalog.md) From 76732ea5cee1ce2319bc369488c6c8ed4ce2ebb2 Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Fri, 18 Aug 2023 08:06:18 -0500 Subject: [PATCH 13/14] Replace any remaining instances of DataSet in docs --- docs/source/data/advanced_data_catalog_usage.md | 10 +++++----- docs/source/data/data_catalog.md | 4 ++-- docs/source/data/data_catalog_yaml_examples.md | 6 +++--- docs/source/data/how_to_create_a_custom_dataset.md | 6 +++--- docs/source/data/kedro_dataset_factories.md | 2 +- .../data/partitioned_and_incremental_datasets.md | 2 +- docs/source/deployment/argo.md | 2 +- 7 files changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md index 03670eaac7..1906500d35 100644 --- a/docs/source/data/advanced_data_catalog_usage.md +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -55,7 +55,7 @@ gear = cars["gear"].values The following steps happened behind the scenes when `load` was called: - The value `cars` was located in the Data Catalog -- The corresponding `AbstractDataSet` object was retrieved +- The corresponding `AbstractDataset` object was retrieved - The `load` method of this dataset was called - This `load` method delegated the loading to the underlying pandas `read_csv` function @@ -70,9 +70,9 @@ This pattern is not recommended unless you are using platform notebook environme To save data using an API similar to that used to load data: ```python -from kedro.io import MemoryDataSet +from kedro.io import MemoryDataset -memory = MemoryDataSet(data=None) +memory = MemoryDataset(data=None) io.add("cars_cache", memory) io.save("cars_cache", "Memory can store anything.") io.load("cars_cache") @@ -190,7 +190,7 @@ io.save("test_data_set", data1) reloaded = io.load("test_data_set") assert data1.equals(reloaded) -# raises DataSetError since the path +# raises DatasetError since the path # data/01_raw/test.csv/my_exact_version/test.csv already exists io.save("test_data_set", data2) ``` @@ -219,7 +219,7 @@ io = DataCatalog({"test_data_set": test_data_set}) io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency -# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv +# raises DatasetError since the data/01_raw/test.csv/exact_load_version/test.csv # file does not exist reloaded = io.load("test_data_set") ``` diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index 5e2a713616..241e339635 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -145,9 +145,9 @@ kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ ``` where `--load-version` is dataset name and version timestamp separated by `:`. -A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataset) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively. +A dataset offers versioning support if it extends the [`AbstractVersionedDataset`](/kedro.io.AbstractVersionedDataset) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively. -To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataSet`. For instance, if you encounter a class like `CSVDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning. +To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataset`. For instance, if you encounter a class like `CSVDataSet(AbstractVersionedDataset[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning. ```{note} Note that HTTP(S) is a supported file system in the dataset implementations, but if you it, you can't also use versioning. diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md index 0570aa0f2c..f27981600d 100644 --- a/docs/source/data/data_catalog_yaml_examples.md +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -397,12 +397,12 @@ for loading, so the first node outputs a `pyspark.sql.DataFrame`, while the seco You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file). -This creates a `//catalog/.yml` configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`. +This creates a `//catalog/.yml` configuration file with `MemoryDataset` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`. ```yaml # //catalog/.yml rockets: - type: MemoryDataSet + type: MemoryDataset scooters: - type: MemoryDataSet + type: MemoryDataset ``` diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md index eee2a832a5..548d442f9b 100644 --- a/docs/source/data/how_to_create_a_custom_dataset.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -2,9 +2,9 @@ [Kedro supports many datasets](/kedro_datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data. -## AbstractDataSet +## AbstractDataset -For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataSet` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataSet` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation. +For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataset` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataset` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation. ## Scenario @@ -309,7 +309,7 @@ Versioning doesn't work with `PartitionedDataset`. You can't use both of them at ``` To add versioning support to the new dataset we need to extend the - [AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataset) to: + [AbstractVersionedDataset](/kedro.io.AbstractVersionedDataset) to: * Accept a `version` keyword argument as part of the constructor * Adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively diff --git a/docs/source/data/kedro_dataset_factories.md b/docs/source/data/kedro_dataset_factories.md index 693272c013..2a65b4359e 100644 --- a/docs/source/data/kedro_dataset_factories.md +++ b/docs/source/data/kedro_dataset_factories.md @@ -215,7 +215,7 @@ The matches are ranked according to the following criteria: ## How to override the default dataset creation with dataset factories -You can use dataset factories to define a catch-all pattern which will overwrite the default [`MemoryDataSet`](/kedro.io.MemoryDataset) creation. +You can use dataset factories to define a catch-all pattern which will overwrite the default [`MemoryDataset`](/kedro.io.MemoryDataset) creation. ```yaml "{default_dataset}": diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md index 207170a0a1..fde9dfd90a 100644 --- a/docs/source/data/partitioned_and_incremental_datasets.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -15,7 +15,7 @@ This is why Kedro provides a built-in [PartitionedDataset](/kedro.io.Partitioned In this section, each individual file inside a given location is called a partition. ``` -### How to use `PartitionedDataSet` +### How to use `PartitionedDataset` You can use a `PartitionedDataset` in `catalog.yml` file like any other regular dataset definition: diff --git a/docs/source/deployment/argo.md b/docs/source/deployment/argo.md index a274968ed3..3aa86b8213 100644 --- a/docs/source/deployment/argo.md +++ b/docs/source/deployment/argo.md @@ -24,7 +24,7 @@ To use Argo Workflows, ensure you have the following prerequisites in place: - [Argo Workflows is installed](https://github.com/argoproj/argo/blob/master/README.md#quickstart) on your Kubernetes cluster - [Argo CLI is installed](https://github.com/argoproj/argo/releases) on your machine - A `name` attribute is set for each [Kedro node](/kedro.pipeline.node) since it is used to build a DAG -- [All node input/output DataSets must be configured in `catalog.yml`](../data/data_catalog_yaml_examples.md) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataset` in your workflow +- [All node input/output datasets must be configured in `catalog.yml`](../data/data_catalog_yaml_examples.md) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataset` in your workflow ```{note} Each node will run in its own container. From 65f84c26f1fcdf821a0097488539b7f8a0812271 Mon Sep 17 00:00:00 2001 From: Deepyaman Datta Date: Fri, 18 Aug 2023 09:19:50 -0500 Subject: [PATCH 14/14] Fix a broken link to docs for `PartitionedDataset` --- docs/source/data/how_to_create_a_custom_dataset.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md index 548d442f9b..46364031a0 100644 --- a/docs/source/data/how_to_create_a_custom_dataset.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -271,7 +271,7 @@ class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]): Currently, the `ImageDataSet` only works with a single image, but this example needs to load all Pokemon images from the raw data directory for further processing. -Kedro's [`PartitionedDataset`](../data/kedro_io.md#partitioned-dataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory. +Kedro's [`PartitionedDataset`](/kedro.io.PartitionedDataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory. To use `PartitionedDataset` with `ImageDataSet` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataset` loads all PNG files from the data directory using `ImageDataSet`: