Merge pull request #105 from quantumblacklabs/release/0.14.3

Release 0.14.3
kedro-org · Jun 26, 2019 · d080ead · d080ead
2 parents a1fc18a + 4fb3930
commit d080ead
Show file tree

Hide file tree

Showing 160 changed files with 1,979 additions and 547 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1,6 +1,6 @@
 ## Notice
 
-- [ ] I acknowledge and agree that, by checking this box and clicking “Submit Pull Request”:
+- [ ] I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
 
 - I submit this contribution under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0.txt) and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
 - I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.

diff --git a/LICENSE.md b/LICENSE.md
@@ -6,22 +6,22 @@ You may obtain a copy of the License at
 
     http://www.apache.org/licenses/LICENSE-2.0
 
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
-EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
-OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND 
-NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS 
-BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN 
-ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN 
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
+NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
+BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
+ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
 CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 
-The QuantumBlack Visual Analytics Limited (“QuantumBlack”) name and logo 
-(either separately or in combination, “QuantumBlack Trademarks”) are 
-trademarks of QuantumBlack. The License does not grant you any right or 
-license to the QuantumBlack Trademarks. You may not use the QuantumBlack 
-Trademarks or any confusingly similar mark as a trademark for your product, 
-or use the QuantumBlack Trademarks in any other manner that might cause 
-confusion in the marketplace, including but not limited to in advertising, 
+The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
+(either separately or in combination, "QuantumBlack Trademarks") are
+trademarks of QuantumBlack. The License does not grant you any right or
+license to the QuantumBlack Trademarks. You may not use the QuantumBlack
+Trademarks or any confusingly similar mark as a trademark for your product,
+or use the QuantumBlack Trademarks in any other manner that might cause
+confusion in the marketplace, including but not limited to in advertising,
 on websites, or on software.
 
-See the License for the specific language governing permissions and 
+See the License for the specific language governing permissions and
 limitations under the License.
diff --git a/Makefile b/Makefile
@@ -18,7 +18,8 @@ lint:
 	pylint -j 0 --disable=unnecessary-pass kedro
 	pylint -j 0 --disable=missing-docstring,redefined-outer-name,no-self-use,invalid-name tests
 	pylint -j 0 --disable=missing-docstring,no-name-in-module features
-	flake8 kedro tests features --exclude kedro/template*
+	pylint -j 0 extras
+	flake8 kedro tests features extras --exclude kedro/template*
 
 test:
 	pytest tests

diff --git a/README.md b/README.md
@@ -3,6 +3,7 @@
 `develop` | `master`  
 ----------|---------  
 [![CircleCI](https://circleci.com/gh/quantumblacklabs/kedro/tree/develop.svg?style=shield)](https://circleci.com/gh/quantumblacklabs/kedro/tree/develop) | [![CircleCI](https://circleci.com/gh/quantumblacklabs/kedro/tree/master.svg?style=shield)](https://circleci.com/gh/quantumblacklabs/kedro/tree/master)
+[![Build status](https://ci.appveyor.com/api/projects/status/2u74p5g8fdc45wwh/branch/develop?svg=true)](https://ci.appveyor.com/project/QuantumBlack/kedro/branch/develop) | [![Build status](https://ci.appveyor.com/api/projects/status/2u74p5g8fdc45wwh/branch/master?svg=true)](https://ci.appveyor.com/project/QuantumBlack/kedro/branch/master)
 
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 [![Python Version](https://img.shields.io/badge/python-3.5%20%7C%203.6%20%7C%203.7-blue.svg)](https://pypi.org/project/kedro/)
@@ -13,7 +14,7 @@
 
 # What is Kedro?
 
-> “The centre of your data pipeline.”
+> "The centre of your data pipeline."
 
 Kedro is a workflow development tool that helps you build data pipelines that are robust, scalable, deployable, reproducible and versioned. We provide a standard approach so that you can:
 -   spend more time building your data pipeline,
@@ -58,9 +59,9 @@ For more detailed installation instructions, including how to setup Python virtu
 ### 4. Feature extensibility
 
 - A plugin system that injects commands into the Kedro command line interface (CLI)
-- (_coming soon_) List of officially supported plugins:
-  - Kedro-Airflow, making it easy to prototype your data pipeline in Kedro before deploying to [Airflow](https://github.com/apache/airflow), a workflow scheduler
-  - Kedro-Docker, a tool for packing and shipping Kedro projects within containers
+- List of officially supported plugins:
+  - (_coming soon_) Kedro-Airflow, making it easy to prototype your data pipeline in Kedro before deploying to [Airflow](https://github.com/apache/airflow), a workflow scheduler
+  - [Kedro-Docker](https://github.com/quantumblacklabs/kedro-docker), a tool for packaging and shipping Kedro projects within containers
 - Kedro can be deployed locally, on-premise and cloud (AWS, Azure and GCP) servers, or clusters (EMR, Azure HDinsight, GCP and Databricks)
 
 ![Kedro-Viz Pipeline Visualisation](https://raw.githubusercontent.com/quantumblacklabs/kedro/master/img/pipeline_visualisation.png)

diff --git a/RELEASE.md b/RELEASE.md
@@ -1,10 +1,33 @@
+# Release 0.14.3
+
+## Major features and improvements
+* Tab completion for catalog datasets in `ipython` or `jupyter` sessions. (Thank you [@datajoely](https://github.com/datajoely) and [@WaylonWalker](https://github.com/WaylonWalker))
+* Added support for transcoding, an ability to decouple loading/saving mechanisms of a dataset from its storage location, denoted by adding '@' to the dataset name.
+* Datasets have a new `release` function that instructs them to free any cached data. The runners will call this when the dataset is no longer needed downstream.
+
+## Bug fixes and other changes
+* Add support for pipeline nodes made up from partial functions.
+* Expand user home directory `~` for TextLocalDataSet (see issue #19).
+* Add a `short_name` property to `Node`s for a display-friendly (but not necessarily unique) name.
+* Add Kedro project loader for IPython: `extras/kedro_project_loader.py`.
+* Fix source file encoding issues with Python 3.5 on Windows.
+* Fix local project source not having priority over the same source installed as a package, leading to local updates not being recognised.
+
+## Breaking changes to the API
+* Remove the max_loads argument from the `MemoryDataSet` constructor and from the `AbstractRunner.create_default_data_set` method.
+
+## Thanks for supporting contributions
+[Nikolaos Tsaousis](https://github.com/tsanikgr), [Ivan Danov](https://github.com/idanov), [Gordon Wrigley](https://github.com/tolomea), [Yetunde Dada](https://github.com/yetudada), [Kiyohito Kunii](https://github.com/921kiyo), [Lorena Balan](https://github.com/lorenabalan), [Richard Westenra](https://github.com/richardwestenra), [Dmitrii Deriabin](https://github.com/DmitryDeryabin), [Joel Schwarzmann](https://github.com/datajoely), [Alex Kalmikov](https://github.com/kalexqb)
+
 # Release 0.14.2
 
 ## Major features and improvements
-* Added Data Set transformer support in the form of AbstractTransformer and DataCatalog.add_transformer
+* Added Data Set transformer support in the form of AbstractTransformer and DataCatalog.add_transformer.
 
 ## Breaking changes to the API
-* Merged the ExistsMixin into AbstractDataSet
+* Merged the `ExistsMixin` into `AbstractDataSet`.
+* `Pipeline.node_dependencies` returns a dictionary keyed by node, with sets of parent nodes as values; `Pipeline` and `ParallelRunner` were refactored to make use of this for topological sort for node dependency resolution and running pipelines respectively.
+* `Pipeline.grouped_nodes` returns a list of sets, rather than a list of lists.
 
 ## Thanks for supporting contributions
 

diff --git a/docs/build-docs.sh b/docs/build-docs.sh
@@ -16,8 +16,8 @@
 # ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
 # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 #
-# The QuantumBlack Visual Analytics Limited (“QuantumBlack”) name and logo
-# (either separately or in combination, “QuantumBlack Trademarks”) are
+# The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
+# (either separately or in combination, "QuantumBlack Trademarks") are
 # trademarks of QuantumBlack. The License does not grant you any right or
 # license to the QuantumBlack Trademarks. You may not use the QuantumBlack
 # Trademarks or any confusingly similar mark as a trademark for your product,

diff --git a/docs/source/02_getting_started/01_prerequisites.md b/docs/source/02_getting_started/01_prerequisites.md
@@ -6,6 +6,34 @@ Kedro supports macOS, Linux and Windows (7 / 8 / 10 and Windows Server 2016+). I
 
 In order to work effectively with Kedro projects, we highly recommend you download and install [Anaconda](https://www.anaconda.com/download/#macos) (Python 3.x version) and [Java](https://www.oracle.com/technetwork/java/javase/downloads/index.html) (if using PySpark).
 
+### Build tools
+
+On Unix-like operating systems, you will need to install a C compiler and related build tools for your platform. This is due to the inclusion of the [memory-profiler](https://pypi.org/project/memory-profiler/) library in our dependencies. If your operating system is not mentioned, please refer to its documentation.
+
+#### macOS
+To install Command Line Tools for Xcode, run the following from the terminal:
+
+```bash
+xcode-select --install
+```
+
+#### GNU/Linux
+
+##### Debian/Ubuntu
+
+The following command (run with root permissions) will install the `build-essential` metapackage for Debian-based distributions:
+
+```bash
+apt-get update && apt-get install build-essential
+```
+
+##### Red Hat Enterprise Linux / Centos
+The following command (run with root permissions) will install the "Develop Tools" group of packages on RHEL/Centos:
+
+```bash
+yum groupinstall 'Development Tools'
+```
+
 ### Windows
 
 You will require admin rights to complete the installation of the following tools on your machine:

diff --git a/docs/source/03_tutorial/05_package_a_project.md b/docs/source/03_tutorial/05_package_a_project.md
@@ -1,13 +1,12 @@
 # Packaging a project
 
-In this section, you will learn how to build your project documentation, as well as how to bundle your project into a Python package for handover. 
-
+In this section, you will learn how to build your project documentation, as well as how to bundle your project into a Python package for handover.
 
 ## Add documentation to your project
 
-While Kedro documentation can be found by running `kedro docs` from the command line, project-specific documentation can be generated by running `kedro build-docs` in the project's root directory. 
+While Kedro documentation can be found by running `kedro docs` from the command line, project-specific documentation can be generated by running `kedro build-docs` in the project's root directory.
 
-This will create documentation based on the code structure of your project. Documentation will also include the [`docstrings`](https://www.datacamp.com/community/tutorials/docstrings-python) defined in the project code. The resulting HTML files can be found in `docs/build/html/`. 
+This will create documentation based on the code structure of your project. Documentation will also include the [`docstrings`](https://www.datacamp.com/community/tutorials/docstrings-python) defined in the project code. The resulting HTML files can be found in `docs/build/html/`.
 
 `kedro build-docs` uses the [Sphinx](https://www.sphinx-doc.org) framework to build your project documentation, so if you want to customise it, please refer to `docs/source/conf.py` and the [corresponding section](http://www.sphinx-doc.org/en/master/usage/configuration.html) of the Sphinx documentation.
 
@@ -16,6 +15,7 @@ This will create documentation based on the code structure of your project. Docu
 
 You can package your project by running `kedro package` from the command line. This will create one `.egg` file  and one `.whl` file within the `src/dist/` folder of your project, which are Python packaging formats. For further information about packaging for Python, documentation is provided [here](https://packaging.python.org/overview/).
 
+You can also check out [Kedro-Docker](https://github.com/quantumblacklabs/kedro-docker), an officially supported Kedro plugin for packaging and shipping Kedro projects within [Docker](https://www.docker.com/) containers.
 
 ## What is next?
 

diff --git a/docs/source/04_user_guide/01_setting_up_vscode.md b/docs/source/04_user_guide/01_setting_up_vscode.md
@@ -1,6 +1,6 @@
 # Setting up Visual Studio Code
 
-> *Note:* This documentation is based on `Kedro 0.14.2`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
+> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
 
 Start by opening a new project directory in VS Code and installing the Python plugin under **Tools and languages**:
 

diff --git a/docs/source/04_user_guide/02_setting_up_pycharm.md b/docs/source/04_user_guide/02_setting_up_pycharm.md
@@ -1,6 +1,6 @@
 # Setting up PyCharm
 
-> *Note:* This documentation is based on `Kedro 0.14.2`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
+> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
 
 This section will present a quick guide on how to configure [PyCharm](https://www.jetbrains.com/pycharm/) as a development environment for working on Kedro projects.
 

diff --git a/docs/source/04_user_guide/03_configuration.md b/docs/source/04_user_guide/03_configuration.md
@@ -1,6 +1,6 @@
 # Configuration
 
-> *Note:* This documentation is based on `Kedro 0.14.2`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
+> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
 
 This section contains detailed information about configuration. You may also want to consult the relevant API documentation on [kedro.config](/kedro.config.rst).
 

diff --git a/docs/source/04_user_guide/04_data_catalog.md b/docs/source/04_user_guide/04_data_catalog.md
@@ -1,6 +1,6 @@
 # The Data Catalog
 
-> *Note:* This documentation is based on `Kedro 0.14.2`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
+> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
 
 This section introduces `catalog.yml`, the project-shareable Data Catalog. The file is located in `conf/base` and is a registry of all data sources available for use by a project; it manages loading and saving of data. 
 
@@ -175,6 +175,24 @@ airplanes:
 
 In this example the default `csv` configuration is inserted into `airplanes` and then the `load_args` block is overridden. Normally that would replace the whole dictionary. In order to extend `load_args` the defaults for that block are then re-inserted.
 
+
+### Transcoding datasets
+
+You may come across a situation where you would like to read the same file using two different dataset implementations. For instance, `parquet` files can not only be loaded via the `ParquetLocalDataSet`, but also directly by `SparkDataSet` using `pandas`. To do this, you can can define your `catalog.yml` as follows:
+
+```yaml
+mydata@pandas:
+  type: ParquetLocalDataSet
+  filepath: data/01_raw/data.parquet
+
+mydata@spark:
+    type: kedro.contrib.io.pyspark.SparkDataSet
+    filepath: data/01_raw/data.parquet
+```
+
+In your pipeline, you may refer to either dataset as input or output, and it will ensure the dependencies point to a single dataset `mydata` both while running the pipeline and in the visualisation.
+
+
 ### Transforming datasets
 
 If you need to augment the loading and / or saving of one or more datasets you can use the transformer API. To do this create a subclass of `AbstractTransformer` that implements your changes and then apply it to your catalog with `DataCatalog.add_transformer`. For example to print the runtimes of load and save operations you could do this:
@@ -231,9 +249,9 @@ In a file like `catalog.py`, you can generate the Data Catalog. This will allow
 io = DataCatalog({
   'bikes': CSVLocalDataSet(filepath='../data/01_raw/bikes.csv'),
   'cars': CSVLocalDataSet(filepath='../data/01_raw/cars.csv', load_args=dict(sep=',')), # additional arguments
-  'scooters': SQLTableDataSet(table_name="scooters", credentials=dict(con="sqlite:///kedro.db")),
+  'cars_table': SQLTableDataSet(table_name="cars", credentials=dict(con="sqlite:///kedro.db")),
   'scooters_query': SQLQueryDataSet(sql="select * from cars where gear=4", credentials=dict(con="sqlite:///kedro.db")),
-  'trucks': ParquetLocalDataSet(filepath="trucks.parquet")
+  'ranked': ParquetLocalDataSet(filepath="ranked.parquet")
 })
 ```
 
@@ -280,7 +298,7 @@ io.load('car_cache')
 
 #### Saving data to a SQL database for querying
 
-At this point we may want to put the data in a SQLite database to run queries on it. Let's use that to rank cars by their mpg.
+At this point we may want to put the data in a SQLite database to run queries on it. Let's use that to rank scooters by their mpg.
 
 ```python
 # This cleans up the database in case it exists at this point
@@ -291,7 +309,7 @@ except FileNotFoundError:
     pass
 
 io.save('cars_table', cars)
-ranked = io.load('cars_query')[['brand', 'mpg']]
+ranked = io.load('scooters_query')[['brand', 'mpg']]
 ```
 
 #### Saving data in parquet

diff --git a/docs/source/04_user_guide/05_nodes_and_pipelines.md b/docs/source/04_user_guide/05_nodes_and_pipelines.md
@@ -1,6 +1,6 @@
 # Nodes and pipelines
 
-> *Note:* This documentation is based on `Kedro 0.14.2`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
+> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
 In this section we introduce pipelines and nodes.
 
 Relevant API documentation:

diff --git a/docs/source/04_user_guide/06_logging.md b/docs/source/04_user_guide/06_logging.md
@@ -1,6 +1,6 @@
 # Logging
 
-> *Note:* This documentation is based on `Kedro 0.14.2`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
+> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
 
 Kedro uses, and facilitates the use of Python’s `logging` library, by providing a default logging configuration. This can be found in `conf/base/logging.yml` in every project generated using Kedro’s CLI `kedro new` command.
 

diff --git a/docs/source/04_user_guide/07_advanced_io.md b/docs/source/04_user_guide/07_advanced_io.md
@@ -1,6 +1,6 @@
 # Advanced IO
 
-> *Note:* This documentation is based on `Kedro 0.14.2`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
+> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
 
 In this tutorial, you will learn about advanced uses of the [Kedro IO](/kedro.io.rst) module and understand the underlying implementation.
 

diff --git a/docs/source/04_user_guide/08_pyspark.md b/docs/source/04_user_guide/08_pyspark.md
@@ -1,6 +1,6 @@
 # Working with PySpark
 
-> *Note:* This documentation is based on `Kedro 0.14.2`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
+> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
 
 In this tutorial we explain how to work with `PySpark` in a Kedro pipeline.