Skip to content

Commit

Permalink
Merge branch 'main' into feat/datasets/delay-connection
Browse files Browse the repository at this point in the history
  • Loading branch information
deepyaman authored Oct 10, 2023
2 parents 2be42aa + 527706d commit 46dcd92
Show file tree
Hide file tree
Showing 47 changed files with 253 additions and 195 deletions.
9 changes: 9 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,15 @@ repos:
- id: check-merge-conflict # Check for files that contain merge conflict strings.
- id: debug-statements # Check for debugger imports and py37+ `breakpoint()` calls in python source.

- repo: https://github.com/adamchainz/blacken-docs
rev: 1.16.0
hooks:
- id: blacken-docs
args:
- "--rst-literal-blocks"
additional_dependencies:
- black==22.12.0

- repo: local
hooks:
- id: ruff-kedro-datasets
Expand Down
7 changes: 5 additions & 2 deletions kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,20 @@
# Upcoming Release
## Major features and improvements
## Bug fixes and other changes
* Updated `PickleDataset` to explicitly mention `cloudpickle` support.
## Upcoming deprecations for Kedro-Datasets 2.0.0

## Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
* [Felix Wittmann](https://github.com/hfwittmann)

# Release 1.7.1
## Bug fixes and other changes
* Pin `tables` version on `kedro-datasets` for Python < 3.8.

## Upcoming deprecations for Kedro-Datasets 2.0.0
* Renamed dataset and error classes, in accordance with the [Kedro lexicon](https://github.com/kedro-org/kedro/wiki/Kedro-documentation-style-guide#kedro-lexicon). Dataset classes ending with "DataSet" are deprecated and will be removed in 2.0.0.

## Community contributions

# Release 1.7.0:
## Major features and improvements
* Added `polars.GenericDataSet`, a `GenericDataSet` backed by [polars](https://www.pola.rs/), a lightning fast dataframe package built entirely using Rust.
Expand Down
14 changes: 7 additions & 7 deletions kedro-datasets/kedro_datasets/api/api_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ class APIDataset(AbstractDataset[None, requests.Response]):
Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:
::
.. code-block:: pycon
>>> from kedro_datasets.api import APIDataset
>>>
Expand All @@ -51,23 +52,22 @@ class APIDataset(AbstractDataset[None, requests.Response]):
... "commodity_desc": "CORN",
... "statisticcat_des": "YIELD",
... "agg_level_desc": "STATE",
... "year": 2000
... "year": 2000,
... }
... },
... credentials=("username", "password")
... credentials=("username", "password"),
... )
>>> data = dataset.load()
``APIDataset`` can also be used to save output on a remote server using HTTP(S)
methods.
::
.. code-block:: pycon
>>> example_table = '{"col1":["val1", "val2"], "col2":["val3", "val4"]}'
>>>
>>> dataset = APIDataset(
... method = "POST",
... url = "url_of_remote_server",
... save_args = {"chunk_size":1}
... method="POST", url="url_of_remote_server", save_args={"chunk_size": 1}
... )
>>> dataset.save(example_table)
Expand Down
12 changes: 8 additions & 4 deletions kedro-datasets/kedro_datasets/biosequence/biosequence_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ class BioSequenceDataset(AbstractDataset[List, List]):
r"""``BioSequenceDataset`` loads and saves data to a sequence file.
Example:
::
.. code-block:: pycon
>>> from kedro_datasets.biosequence import BioSequenceDataset
>>> from io import StringIO
Expand All @@ -28,10 +29,13 @@ class BioSequenceDataset(AbstractDataset[List, List]):
>>> raw_data = []
>>> for record in SeqIO.parse(StringIO(data), "fasta"):
... raw_data.append(record)
...
>>>
>>> dataset = BioSequenceDataset(filepath="ls_orchid.fasta",
... load_args={"format": "fasta"},
... save_args={"format": "fasta"})
>>> dataset = BioSequenceDataset(
... filepath="ls_orchid.fasta",
... load_args={"format": "fasta"},
... save_args={"format": "fasta"},
... )
>>> dataset.save(raw_data)
>>> sequence_list = dataset.load()
>>>
Expand Down
14 changes: 7 additions & 7 deletions kedro-datasets/kedro_datasets/dask/parquet_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,25 +37,25 @@ class ParquetDataset(AbstractDataset[dd.DataFrame, dd.DataFrame]):
Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:
::
.. code-block:: pycon
>>> from kedro.extras.datasets.dask import ParquetDataset
>>> import pandas as pd
>>> import dask.dataframe as dd
>>>
>>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
... 'col3': [[5, 6], [7, 8]]})
>>> data = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [[5, 6], [7, 8]]})
>>> ddf = dd.from_pandas(data, npartitions=2)
>>>
>>> dataset = ParquetDataset(
... filepath="s3://bucket_name/path/to/folder",
... credentials={
... 'client_kwargs':{
... 'aws_access_key_id': 'YOUR_KEY',
... 'aws_secret_access_key': 'YOUR SECRET',
... "client_kwargs": {
... "aws_access_key_id": "YOUR_KEY",
... "aws_secret_access_key": "YOUR SECRET",
... }
... },
... save_args={"compression": "GZIP"}
... save_args={"compression": "GZIP"},
... )
>>> dataset.save(ddf)
>>> reloaded = dataset.load()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -176,12 +176,13 @@ class ManagedTableDataset(AbstractVersionedDataset):
.. code-block:: python
from pyspark.sql import SparkSession
from pyspark.sql.types import (StructField, StringType,
IntegerType, StructType)
from pyspark.sql.types import StructField, StringType, IntegerType, StructType
from kedro_datasets.databricks import ManagedTableDataset
schema = StructType([StructField("name", StringType(), True),
StructField("age", IntegerType(), True)])
data = [('Alex', 31), ('Bob', 12), ('Clarke', 65), ('Dave', 29)]
schema = StructType(
[StructField("name", StringType(), True), StructField("age", IntegerType(), True)]
)
data = [("Alex", 31), ("Bob", 12), ("Clarke", 65), ("Dave", 29)]
spark_df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
dataset = ManagedTableDataset(table="names_and_ages")
dataset.save(spark_df)
Expand Down
3 changes: 2 additions & 1 deletion kedro-datasets/kedro_datasets/email/message_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ class EmailMessageDataset(AbstractVersionedDataset[Message, Message]):
Note that ``EmailMessageDataset`` doesn't handle sending email messages.
Example:
::
.. code-block:: pycon
>>> from email.message import EmailMessage
>>>
Expand Down
9 changes: 6 additions & 3 deletions kedro-datasets/kedro_datasets/geopandas/geojson_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,17 @@ class GeoJSONDataset(
allowed geopandas (pandas) options for loading and saving GeoJSON files.
Example:
::
.. code-block:: pycon
>>> import geopandas as gpd
>>> from shapely.geometry import Point
>>> from kedro_datasets.geopandas import GeoJSONDataset
>>>
>>> data = gpd.GeoDataFrame({'col1': [1, 2], 'col2': [4, 5],
... 'col3': [5, 6]}, geometry=[Point(1,1), Point(2,4)])
>>> data = gpd.GeoDataFrame(
... {"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]},
... geometry=[Point(1, 1), Point(2, 4)],
... )
>>> dataset = GeoJSONDataset(filepath="test.geojson", save_args=None)
>>> dataset.save(data)
>>> reloaded = dataset.load()
Expand Down
3 changes: 2 additions & 1 deletion kedro-datasets/kedro_datasets/holoviews/holoviews_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ class HoloviewsWriter(AbstractVersionedDataset[HoloViews, NoReturn]):
filesystem (e.g. local, S3, GCS).
Example:
::
.. code-block:: pycon
>>> import holoviews as hv
>>> from kedro_datasets.holoviews import HoloviewsWriter
Expand Down
5 changes: 3 additions & 2 deletions kedro-datasets/kedro_datasets/json/json_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,12 @@ class JSONDataset(AbstractVersionedDataset[Any, Any]):
Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:
::
.. code-block:: pycon
>>> from kedro_datasets.json import JSONDataset
>>>
>>> data = {'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]}
>>> data = {"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]}
>>>
>>> dataset = JSONDataset(filepath="test.json")
>>> dataset.save(data)
Expand Down
24 changes: 11 additions & 13 deletions kedro-datasets/kedro_datasets/matplotlib/matplotlib_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,21 +37,21 @@ class MatplotlibWriter(
Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:
::
.. code-block:: pycon
>>> import matplotlib.pyplot as plt
>>> from kedro_datasets.matplotlib import MatplotlibWriter
>>>
>>> fig = plt.figure()
>>> plt.plot([1, 2, 3])
>>> plot_writer = MatplotlibWriter(
... filepath="data/08_reporting/output_plot.png"
... )
>>> plot_writer = MatplotlibWriter(filepath="data/08_reporting/output_plot.png")
>>> plt.close()
>>> plot_writer.save(fig)
Example saving a plot as a PDF file:
::
.. code-block:: pycon
>>> import matplotlib.pyplot as plt
>>> from kedro_datasets.matplotlib import MatplotlibWriter
Expand All @@ -66,7 +66,8 @@ class MatplotlibWriter(
>>> pdf_plot_writer.save(fig)
Example saving multiple plots in a folder, using a dictionary:
::
.. code-block:: pycon
>>> import matplotlib.pyplot as plt
>>> from kedro_datasets.matplotlib import MatplotlibWriter
Expand All @@ -77,13 +78,12 @@ class MatplotlibWriter(
... plt.plot([1, 2, 3], color=colour)
...
>>> plt.close("all")
>>> dict_plot_writer = MatplotlibWriter(
... filepath="data/08_reporting/plots"
... )
>>> dict_plot_writer = MatplotlibWriter(filepath="data/08_reporting/plots")
>>> dict_plot_writer.save(plots_dict)
Example saving multiple plots in a folder, using a list:
::
.. code-block:: pycon
>>> import matplotlib.pyplot as plt
>>> from kedro_datasets.matplotlib import MatplotlibWriter
Expand All @@ -94,9 +94,7 @@ class MatplotlibWriter(
... plt.plot([i, i + 1, i + 2])
...
>>> plt.close("all")
>>> list_plot_writer = MatplotlibWriter(
... filepath="data/08_reporting/plots"
... )
>>> list_plot_writer = MatplotlibWriter(filepath="data/08_reporting/plots")
>>> list_plot_writer.save(plots_list)
"""
Expand Down
3 changes: 2 additions & 1 deletion kedro-datasets/kedro_datasets/networkx/gml_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ class GMLDataset(AbstractVersionedDataset[networkx.Graph, networkx.Graph]):
See https://networkx.org/documentation/stable/tutorial.html for details.
Example:
::
.. code-block:: pycon
>>> from kedro_datasets.networkx import GMLDataset
>>> import networkx as nx
Expand Down
3 changes: 2 additions & 1 deletion kedro-datasets/kedro_datasets/networkx/graphml_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ class GraphMLDataset(AbstractVersionedDataset[networkx.Graph, networkx.Graph]):
See https://networkx.org/documentation/stable/tutorial.html for details.
Example:
::
.. code-block:: pycon
>>> from kedro_datasets.networkx import GraphMLDataset
>>> import networkx as nx
Expand Down
3 changes: 2 additions & 1 deletion kedro-datasets/kedro_datasets/networkx/json_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ class JSONDataset(AbstractVersionedDataset[networkx.Graph, networkx.Graph]):
See https://networkx.org/documentation/stable/tutorial.html for details.
Example:
::
.. code-block:: pycon
>>> from kedro_datasets.networkx import JSONDataset
>>> import networkx as nx
Expand Down
6 changes: 3 additions & 3 deletions kedro-datasets/kedro_datasets/pandas/csv_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,13 +52,13 @@ class CSVDataset(AbstractVersionedDataset[pd.DataFrame, pd.DataFrame]):
Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:
::
.. code-block:: pycon
>>> from kedro_datasets.pandas import CSVDataset
>>> import pandas as pd
>>>
>>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
... 'col3': [5, 6]})
>>> data = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
>>>
>>> dataset = CSVDataset(filepath="test.csv")
>>> dataset.save(data)
Expand Down
7 changes: 4 additions & 3 deletions kedro-datasets/kedro_datasets/pandas/deltatable_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,19 +61,20 @@ class DeltaTableDataset(AbstractDataset):
Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:
::
.. code-block:: pycon
>>> from kedro_datasets.pandas import DeltaTableDataset
>>> import pandas as pd
>>>
>>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]})
>>> data = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
>>> dataset = DeltaTableDataset(filepath="test")
>>>
>>> dataset.save(data)
>>> reloaded = dataset.load()
>>> assert data.equals(reloaded)
>>>
>>> new_data = pd.DataFrame({'col1': [7, 8], 'col2': [9, 10], 'col3': [11, 12]})
>>> new_data = pd.DataFrame({"col1": [7, 8], "col2": [9, 10], "col3": [11, 12]})
>>> dataset.save(new_data)
>>> dataset.get_loaded_version()
Expand Down
14 changes: 7 additions & 7 deletions kedro-datasets/kedro_datasets/pandas/excel_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,13 @@ class ExcelDataset(
Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:
::
.. code-block:: pycon
>>> from kedro_datasets.pandas import ExcelDataset
>>> import pandas as pd
>>>
>>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
... 'col3': [5, 6]})
>>> data = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
>>>
>>> dataset = ExcelDataset(filepath="test.xlsx")
>>> dataset.save(data)
Expand Down Expand Up @@ -90,16 +90,16 @@ class ExcelDataset(
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_
for a multi-sheet Excel file:
::
.. code-block:: pycon
>>> from kedro_datasets.pandas import ExcelDataset
>>> import pandas as pd
>>>
>>> dataframe = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
... 'col3': [5, 6]})
>>> dataframe = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
>>> another_dataframe = pd.DataFrame({"x": [10, 20], "y": ["hello", "world"]})
>>> multiframe = {"Sheet1": dataframe, "Sheet2": another_dataframe}
>>> dataset = ExcelDataset(filepath="test.xlsx", load_args = {"sheet_name": None})
>>> dataset = ExcelDataset(filepath="test.xlsx", load_args={"sheet_name": None})
>>> dataset.save(multiframe)
>>> reloaded = dataset.load()
>>> assert multiframe["Sheet1"].equals(reloaded["Sheet1"])
Expand Down
Loading

0 comments on commit 46dcd92

Please sign in to comment.