Releases: kedro-org/kedro
0.15.9
0.15.8
Major features and improvements
- Added the additional libraries to our
requirements.txt
sopandas.CSVDataSet
class works out of box withpip install kedro
. - Added
pandas
to ourextra_requires
insetup.py
. - Improved the error message when dependencies of a
DataSet
class are missing.
0.15.7
0.15.6
Major features and improvements
TL;DR We're launching
kedro.extras
, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets inkedro.extras.datasets
usefsspec
to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.
An example of this new system can be seen below, loading the CSV SparkDataSet
from S3:
weather:
type: spark.SparkDataSet # Observe the specified type, this affects all datasets
filepath: s3a://your_bucket/data/01_raw/weather* # filepath uses fsspec to indicate the file storage system
credentials: dev_s3
file_format: csv
You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet
, a feature that allows you to load a directory of files. The IncrementalDataSet
stores the information about the last processed partition in a checkpoint
, read more about this feature here.
New features
- Added
layer
attribute for datasets inkedro.extras.datasets
to specify the name of a layer according to data engineering convention, this feature will be passed tokedro-viz
in future releases. - Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using
catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>")
. - Added property
run_id
onProjectContext
, used for versioning using theJournal
. To customise your journalrun_id
you can override the private method_get_run_id()
. - Added the ability to install all optional kedro dependencies via
pip install "kedro[all]"
. - Modified the
DataCatalog
's load order for datasets, loading order is the following:kedro.io
kedro.extras.datasets
- Import path, specified in
type
- Added an optional
copy_mode
flag toCachedDataSet
andMemoryDataSet
to specify (deepcopy
,copy
orassign
) the copy mode to use when loading and saving.
New Datasets
Type | Description | Location |
---|---|---|
ParquetDataSet |
Handles parquet datasets using Dask | kedro.extras.datasets.dask |
PickleDataSet |
Work with Pickle files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pickle |
CSVDataSet |
Work with CSV files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
TextDataSet |
Work with text files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
ExcelDataSet |
Work with Excel files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
HDFDataSet |
Work with HDF using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
YAMLDataSet |
Work with YAML files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.yaml |
MatplotlibWriter |
Save with Matplotlib images using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.matplotlib |
NetworkXDataSet |
Work with NetworkX files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.networkx |
BioSequenceDataSet |
Work with bio-sequence objects using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.biosequence |
GBQTableDataSet |
Work with Google BigQuery | kedro.extras.datasets.pandas |
FeatherDataSet |
Work with feather files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
IncrementalDataSet |
Inherit from PartitionedDataSet and remembers the last processed partition |
kedro.io |
Files with a new location
Type | New Location |
---|---|
JSONDataSet |
kedro.extras.datasets.pandas |
CSVBlobDataSet |
kedro.extras.datasets.pandas |
JSONBlobDataSet |
kedro.extras.datasets.pandas |
SQLTableDataSet |
kedro.extras.datasets.pandas |
SQLQueryDataSet |
kedro.extras.datasets.pandas |
SparkDataSet |
kedro.extras.datasets.spark |
SparkHiveDataSet |
kedro.extras.datasets.spark |
SparkJDBCDataSet |
kedro.extras.datasets.spark |
kedro/contrib/decorators/retry.py |
kedro/extras/decorators/retry_node.py |
kedro/contrib/decorators/memory_profiler.py |
kedro/extras/decorators/memory_profiler.py |
kedro/contrib/io/transformers/transformers.py |
kedro/extras/transformers/time_profiler.py |
kedro/contrib/colors/logging/color_logger.py |
kedro/extras/logging/color_logger.py |
extras/ipython_loader.py |
tools/ipython/ipython_loader.py |
kedro/contrib/io/cached/cached_dataset.py |
kedro/io/cached_dataset.py |
kedro/contrib/io/catalog_with_default/data_catalog_with_default.py |
kedro/io/data_catalog_with_default.py |
kedro/contrib/config/templated_config.py |
kedro/config/templated_config.py |
Upcoming deprecations
Category | Type |
---|---|
Datasets | BioSequenceLocalDataSet |
CSVGCSDataSet |
|
CSVHTTPDataSet |
|
CSVLocalDataSet |
|
CSVS3DataSet |
|
ExcelLocalDataSet |
|
FeatherLocalDataSet |
|
JSONGCSDataSet |
|
`JSONLo... |
0.15.5
Major features and improvements
- New CLI commands and command flags:
- Load multiple
kedro run
CLI flags from a configuration file with the--config
flag (e.g.kedro run --config run_config.yml
) - Run parametrised pipeline runs with the
--params
flag (e.g.kedro run --params param1:value1,param2:value2
). - Lint your project code using the
kedro lint
command, your project is linted withblack
(Python 3.6+),flake8
andisort
.
- Load multiple
- Load specific environments with Jupyter notebooks using
KEDRO_ENV
which will globally setrun
,jupyter notebook
andjupyter lab
commands using environment variables. - Added the following datasets:
CSVGCSDataSet
dataset incontrib
for working with CSV files in Google Cloud Storage.ParquetGCSDataSet
dataset incontrib
for working with Parquet files in Google Cloud Storage.JSONGCSDataSet
dataset incontrib
for working with JSON files in Google Cloud Storage.MatplotlibS3Writer
dataset incontrib
for saving Matplotlib images to S3.PartitionedDataSet
for working with datasets split across multiple files.JSONDataSet
dataset for working with JSON files that usesfsspec
to communicate with the underlying filesystem. It doesn't supporthttp(s)
protocol for now.
- Added
s3fs_args
to all S3 datasets. - Pipelines can be deducted with
pipeline1 - pipeline2
.
Bug fixes and other changes
ParallelRunner
now works withSparkDataSet
.- Allowed the use of nulls in
parameters.yml
. - Fixed an issue where
%reload_kedro
wasn't reloading all user modules. - Fixed
pandas_to_spark
andspark_to_pandas
decorators to work with functions with kwargs. - Fixed a bug where
kedro jupyter notebook
andkedro jupyter lab
would run a different Jupyter installation to the one in the local environment. - Implemented Databricks-compatible dataset versioning for
SparkDataSet
. - Fixed a bug where
kedro package
would fail in certain situations wherekedro build-reqs
was used to generaterequirements.txt
. - Made
bucket_name
argument optional for the following datasets:CSVS3DataSet
,HDFS3DataSet
,PickleS3DataSet
,contrib.io.parquet.ParquetS3DataSet
,contrib.io.gcs.JSONGCSDataSet
- bucket name can now be included into the filepath along with the filesystem protocol (e.g.s3://bucket-name/path/to/key.csv
). - Documentation improvements and fixes.
Breaking changes to the API
- Renamed entry point for running pip-installed projects to
run_package()
instead ofmain()
insrc/<package>/run.py
. bucket_name
key has been removed from the string representation of the following datasets:CSVS3DataSet
,HDFS3DataSet
,PickleS3DataSet
,contrib.io.parquet.ParquetS3DataSet
,contrib.io.gcs.JSONGCSDataSet
.- Moved the
mem_profiler
decorator tocontrib
and separated thecontrib
decorators so that dependencies are modular. You may need to update your import paths, for example the pyspark decorators should be imported asfrom kedro.contrib.decorators.pyspark import <pyspark_decorator>
instead offrom kedro.contrib.decorators import <pyspark_decorator>
.
Thanks for supporting contributions
Sheldon Tsen, @roumail, Karlson Lee, Waylon Walker, Deepyaman Datta, Giovanni, Zain Patel
0.15.4
Major features and improvements
kedro jupyter
now gives the default kernel a sensible name.Pipeline.name
has been deprecated in favour ofPipeline.tags
.- Reuse pipelines within a Kedro project using
Pipeline.transform
, it simplifies dataset and node renaming. - Added Jupyter Notebook line magic (
%run_viz
) to runkedro viz
in a Notebook cell (requireskedro-viz
version 3.0.0 or later). - Added the following datasets:
NetworkXLocalDataSet
inkedro.contrib.io.networkx
to load and save local graphs (JSON format) via NetworkX. (by @josephhaaga)SparkHiveDataSet
inkedro.contrib.io.pyspark.SparkHiveDataSet
allowing usage of Spark and insert/upsert on non-transactional Hive tables.
kedro.contrib.config.TemplatedConfigLoader
now supports name/dict key templating and default values.
Bug fixes and other changes
get_last_load_version()
method for versioned datasets now returns exact last load version if the dataset has been loaded at least once andNone
otherwise.- Fixed a bug in
_exists
method for versionedSparkDataSet
. - Enabled the customisation of the ExcelWriter in
ExcelLocalDataSet
by specifying options underwriter
key insave_args
. - Fixed a bug in IPython startup script, attempting to load context from the incorrect location.
- Removed capping the length of a dataset's string representation.
- Fixed
kedro install
command failing on Windows ifsrc/requirements.txt
contains a different version of Kedro. - Enabled passing a single tag into a node or a pipeline without having to wrap it in a list (i.e.
tags="my_tag"
).
Breaking changes to the API
- Removed
_check_paths_consistency()
method fromAbstractVersionedDataSet
. Version consistency check is now done inAbstractVersionedDataSet.save()
. Custom versioned datasets should modifysave()
method implementation accordingly.
Thanks for supporting contributions
Joseph Haaga, Deepyaman Datta, Joost Duisters, Zain Patel, Tom Vigrass
0.15.3
0.15.2
Major features and improvements
- Added
--load-version
, akedro run
argument that allows you run the pipeline with a particular load version of a dataset. - Support for modular pipelines in
src/
, break the pipeline into isolated parts with reusability in mind. - Support for multiple pipelines, an ability to have multiple entry point pipelines and choose one with
kedro run --pipeline NAME
. - Added a
MatplotlibWriter
dataset incontrib
for saving Matplotlib images. - An ability to template/parameterize configuration files with
kedro.contrib.config.TemplatedConfigLoader
. - Parameters are exposed as a context property for ease of access in iPython / Jupyter Notebooks with
context.params
. - Added
max_workers
parameter forParallelRunner
.
Bug fixes and other changes
- Users will override the
_get_pipeline
abstract method inProjectContext(KedroContext)
inrun.py
rather than thepipeline
abstract property. Thepipeline
property is not abstract anymore. - Improved an error message when versioned local dataset is saved and unversioned path already exists.
- Added
catalog
global variable to00-kedro-init.py
, allowing you to load datasets withcatalog.load()
. - Enabled tuples to be returned from a node.
- Disallowed the
ConfigLoader
loading the same file more than once, and deduplicated theconf_paths
passed in. - Added a
--open
flag tokedro build-docs
that opens the documentation on build. - Updated the
Pipeline
representation to include name of the pipeline, also making it readable as a context property. kedro.contrib.io.pyspark.SparkDataSet
andkedro.contrib.io.azure.CSVBlobDataSet
now support versioning.
Breaking changes to the API
KedroContext.run()
no longer acceptscatalog
andpipeline
arguments.node.inputs
now returns the node's inputs in the order required to bind them properly to the node's function.
Thanks for supporting contributions
Deepyaman Datta, Luciano Issoe, Joost Duisters, Zain Patel, William Ashford, Karlson Lee
0.15.1
Major features and improvements
- Extended
versioning
support to cover the tracking of environment setup, code and datasets. - Added the following datasets:
FeatherLocalDataSet
incontrib
for usage with pandas. (by @mdomarsaleem)
- Added
get_last_load_version
andget_last_save_version
toAbstractVersionedDataSet
. - Implemented
__call__
method onNode
to allow for users to executemy_node(input1=1, input2=2)
as an alternative tomy_node.run(dict(input1=1, input2=2))
. - Added new
--from-inputs
run argument.
Bug fixes and other changes
- Fixed a bug in
load_context()
not loading context in non-Kedro Jupyter Notebooks. - Fixed a bug in
ConfigLoader.get()
not listing nested files for**
-ending glob patterns. - Fixed a logging config error in Jupyter Notebook.
- Updated documentation in
03_configuration
regarding how to modify the configuration path. - Documented the architecture of Kedro showing how we think about library, project and framework components.
extras/kedro_project_loader.py
renamed toextras/ipython_loader.py
and now runs any IPython startup scripts without relying on the Kedro project structure.- Fixed TypeError when validating partial function's signature.
- After a node failure during a pipeline run, a resume command will be suggested in the logs. This command will not work if the required inputs are MemoryDataSets.
Breaking changes to the API
None
Thanks for supporting contributions
0.15.0
Major features and improvements
- Added
KedroContext
base class which holds the configuration and Kedro's main functionality (catalog, pipeline, config, runner). - Added a new CLI command
kedro jupyter convert
to facilitate converting Jupyter Notebook cells into Kedro nodes. - Added support for
pip-compile
and new Kedro commandkedro build-reqs
that generatesrequirements.txt
based onrequirements.in
. - Running
kedro install
will install packages to conda environment ifsrc/environment.yml
exists in your project. - Added a new
--node
flag tokedro run
, allowing users to run only the nodes with the specified names. - Added new
--from-nodes
and--to-nodes
run arguments, allowing users to run a range of nodes from the pipeline. - Added prefix
params:
to the parameters specified inparameters.yml
which allows users to differentiate between their different parameter node inputs and outputs. - Jupyter Lab/Notebook now starts with only one kernel by default.
- Added the following datasets:
CSVHTTPDataSet
to load CSV using HTTP(s) links.JSONBlobDataSet
to load json (-delimited) files from Azure Blob Storage.ParquetS3DataSet
incontrib
for usage with pandas. (by @mmchougule)CachedDataSet
incontrib
which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by @tsanikgr)YAMLLocalDataSet
incontrib
to load and save local YAML files. (by @Minyus)
Bug fixes and other changes
- Documentation improvements including instructions on how to initialise a Spark session using YAML configuration.
anyconfig
default log level changed fromINFO
toWARNING
.- Added information on installed plugins to
kedro info
. - Added style sheets for project documentation, so the output of
kedro build-docs
will resemble the style ofkedro docs
.
Breaking changes to the API
- Simplified the Kedro template in
run.py
with the introduction ofKedroContext
class. - Merged
FilepathVersionMixIn
andS3VersionMixIn
under one abstract classAbstractVersionedDataSet
which extendsAbstractDataSet
. name
changed to be a keyword-only argument forPipeline
.CSVLocalDataSet
no longer supports URLs.CSVHTTPDataSet
supports URLs.
Migration guide from Kedro 0.14.X to Kedro 0.15.0
Migration for Kedro project template
This guide assumes that:
- The framework specific code has not been altered significantly
- Your project specific code is stored in the dedicated python package under
src/
.
The breaking changes were introduced in the following project template files:
<project-name>/.ipython/profile_default/startup/00-kedro-init.py
<project-name>/kedro_cli.py
<project-name>/src/tests/test_run.py
<project-name>/src/<package-name>/run.py
<project-name>/.kedro.yml
(new file)
The easiest way to migrate your project from Kedro 0.14.* to Kedro 0.15.0 is to create a new project (by using kedro new
) and move code and files bit by bit as suggested in the detailed guide below:
-
Create a new project with the same name by running
kedro new
-
Copy the following folders to the new project:
results/
references/
notebooks/
logs/
data/
conf/
- If you customised your
src/<package>/run.py
, make sure you apply the same customisations tosrc/<package>/run.py
- If you customised
get_config()
, you can overrideconfig_loader
property inProjectContext
derived class - If you customised
create_catalog()
, you can overridecatalog()
property inProjectContext
derived class - If you customised
run()
, you can overriderun()
method inProjectContext
derived class - If you customised default
env
, you can override it inProjectContext
derived class or pass it at construction. By default,env
islocal
. - If you customised default
root_conf
, you can overrideCONF_ROOT
attribute inProjectContext
derived class. By default,KedroContext
base class hasCONF_ROOT
attribute set toconf
.
- The following syntax changes are introduced in ipython or Jupyter notebook/labs:
proj_dir
->context.project_path
proj_name
->context.project_name
conf
->context.config_loader
.io
->context.catalog
(e.g.,io.load()
->context.catalog.load()
)
-
If you customised your
kedro_cli.py
, you need to apply the same customisations to yourkedro_cli.py
in the new project. -
Copy the contents of the old project's
src/requirements.txt
into the new project'ssrc/requirements.in
and, from the project root directory, run thekedro build-reqs
command in your terminal window.
Migration for versioning custom dataset classes
If you defined any custom dataset classes which support versioning in your project, you need to apply the following changes:
- Make sure your dataset inherits from
AbstractVersionedDataSet
only. - Call
super().__init__()
with the appropriate arguments in the dataset's__init__
. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in anexists_function
and aglob_function
that emulateexists
andglob
in a different filesystem (seeCSVS3DataSet
as an example). - Remove setting of the
_filepath
and_version
attributes in the dataset's__init__
, as this is taken care of in the base abstract class. - Any calls to
_get_load_path
and_get_save_path
methods should take no arguments. - Ensure you convert the output of
_get_load_path
and_get_save_path
appropriately, as these now returnPurePath
s instead of strings. - Make sure
_check_paths_consistency
is called withPurePath
s as input arguments, instead of strings.
These steps should have brought your project to Kedro 0.15.0. There might be some more minor tweaks needed as every project is unique, but now you have a pretty solid base to work with. If you run into any problems, please consult the Kedro documentation.
Thanks for supporting contributions
Dmitry Vukolov, Jo Stichbury, Angus Williams, Deepyaman Datta, Mayur Chougule, Marat Kopytjuk, Evan Miller, Yusuke Minami