feat(datasets): Improved compatibility, functionality and testing for SnowflakeTableDataset #881

tdhooghe · 2024-10-10T17:51:10Z

Description

Enable saving a Pandas DataFrame directly into a SnowflakeTable, i.e. to ingest a .csv directly into Snowflake within Kedro
Update tests to use Snowpark's local testing framework
Updated dependencies

Development notes

Skips pytest for Python version >3.11

Bump of cloudpickle required to allow for snowflake-snowpark-python >= 1.23

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

Signed-off-by: tdhooghe <[email protected]>

* feat(datasets): create separate `ibis.FileDataset` Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): deprecate `TableDataset` file I/O Signed-off-by: Deepyaman Datta <[email protected]> * feat(datasets): implement `FileDataset` versioning Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): try `os.path.exists`, for Windows Signed-off-by: Deepyaman Datta <[email protected]> * revert(datasets): use pathlib, ignore Windows test Refs: b7ff0c7 Signed-off-by: Deepyaman Datta <[email protected]> * docs(datasets): add `ibis.FileDataset` to contents Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): add docstring for `hashable` func Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): add docstring for `hashable` func Signed-off-by: Deepyaman Datta <[email protected]> * feat(datasets)!: expose `load` and `save` publicly Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): remove second filepath assignment Signed-off-by: Deepyaman Datta <[email protected]> --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: tdhooghe <[email protected]>

Update error code in e2e test Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: tdhooghe <[email protected]>

Signed-off-by: tdhooghe <[email protected]>

…kedro-org#891) * Update PR template with checkbox for core dataset contribution Signed-off-by: Merel Theisen <[email protected]> * Update .github/PULL_REQUEST_TEMPLATE.md Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: Merel Theisen <[email protected]> * Fix lint Signed-off-by: Merel Theisen <[email protected]> --------- Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Merel Theisen <[email protected]> Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: tdhooghe <[email protected]>

Signed-off-by: tdhooghe <[email protected]>

* fix(datasets): default to DuckDB in in-memory mode Signed-off-by: Deepyaman Datta <[email protected]> * test(datasets): use `object()` sentinel as default Signed-off-by: Deepyaman Datta <[email protected]> * docs(datasets): add default database to RELEASE.md Signed-off-by: Deepyaman Datta <[email protected]> --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: tdhooghe <[email protected]>

Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: tdhooghe <[email protected]>

…ro-org#896) * Add GH action to check for TSC votes on core dataset changes * Ignore TSC vote action in gatekeeper * Trigger TSC vote action only on changes in core dataset --------- Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: tdhooghe <[email protected]>

Signed-off-by: tdhooghe <[email protected]>

Signed-off-by: Thomas <[email protected]> Signed-off-by: tdhooghe <[email protected]>

…cd fails on coverage Signed-off-by: tdhooghe <[email protected]>

Signed-off-by: tdhooghe <[email protected]>

merelcht

Left some minor comments, but otherwise this looks great! ⭐ Thank you so much @tdhooghe for getting this dataset in better shapes and adding tests that can now run as part of the test suite for Python versions < 3.12!

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

kedro-datasets/tests/snowflake/README.md

kedro-datasets/tests/snowflake/test_snowpark_dataset.py

Signed-off-by: tdhooghe <[email protected]>

merelcht · 2024-10-24T10:53:32Z

kedro-datasets/tests/snowflake/test_snowpark_dataset.py

+if sys.version_info >= (3, 12):
+    pytest.mark.xfail(
+        "Snowpark is not supported in Python versions higher than 3.11",
+        allow_module_level=True,
+    )


I did some more testing, and it looks like you can replace this with snowpark = pytest.importorskip("snowpark") and it will pass the builds.

No way, that's awesome! Let me test :)

Doesn't help unfortunately, as it still includes the tests in the coverage..

Yeah unfortunately coverage still needs to be skipped, but now at least the tests themselves pass. Before the build were showing:

__________ ERROR collecting tests/snowflake/test_snowpark_dataset.py ___________ ImportError while importing test module '/home/runner/work/kedro-plugins/kedro-plugins/kedro-datasets/tests/snowflake/test_snowpark_dataset.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: /opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/importlib/__init__.py:90: in import_module return _bootstrap._gcd_import(name[level:], package, level) tests/snowflake/test_snowpark_dataset.py:15: in <module> from snowflake.snowpark import DataFrame, Session E ModuleNotFoundError: No module named 'snowflake' =========================== short test summary info ============================

Signed-off-by: tdhooghe <[email protected]>

* Bump matplotlib test dependency Signed-off-by: Merel Theisen <[email protected]> * Fix matplotlib test Signed-off-by: Merel Theisen <[email protected]> --------- Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: tdhooghe <[email protected]>

Signed-off-by: tdhooghe <[email protected]>

kedro-datasets/tests/snowflake/test_snowpark_dataset.py

Signed-off-by: tdhooghe <[email protected]>

DimedS

Many thanks for the PR, @tdhooghe! I really appreciate your proposal to allow saving Pandas DataFrames in the _save() method - it's a very useful feature. I implemented a similar approach in a custom version of the Snowpark dataset for an ETL project (https://github.com/DimedS/kedro-pypi-to-snowflake/tree/main), so I’m glad to see it being incorporated into the official dataset now! I leaved a one small proposal.

DimedS · 2024-10-24T13:28:19Z

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

+            data (pd.DataFrame | DataFrame): The data to save.
+        """
+        if isinstance(data, pd.DataFrame):
+            data = self._session.create_dataframe(data)


Maybe we should add an additional check and raise an error here, like this:

# Check if the input is a Pandas DataFrame and convert it to Snowpark DataFrame if isinstance(data, pd.DataFrame): # Convert the Pandas DataFrame to a Snowpark DataFrame snowpark_df = self._session.create_dataframe(data) elif isinstance(data, DataFrame): # If it's already a Snowpark DataFrame, use it as is snowpark_df = data else: raise DatasetError(f"Data of type {type(data)} is not supported for saving.")

This ensures we handle different types of DataFrames appropriately, as we're currently allowing not only Snowpark DataFrames but other DataFrame types as well.

Thank you for your kind words @DimedS! 🙂

I like your proposal, I will include it!

…frame Signed-off-by: tdhooghe <[email protected]>

…cd fails on coverage Signed-off-by: tdhooghe <[email protected]>

Signed-off-by: tdhooghe <[email protected]>

merelcht

Left some final comments, but otherwise let's get this merged!

kedro-datasets/pyproject.toml

kedro-datasets/tests/snowflake/README.md

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

Signed-off-by: tdhooghe <[email protected]>

…m/tdhooghe/kedro-plugins into feature/save-pd-to-snowflaketable

Signed-off-by: Merel Theisen <[email protected]>

tdhooghe changed the title ~~(feature)Datasets: save pandas df directly to SnowflakeTable~~ feat(datasets): save pandas df directly to SnowflakeTable Oct 10, 2024

tdhooghe mentioned this pull request Oct 10, 2024

feat(datasets): Release kedro-datasets 5.0.0 #869

Merged

4 tasks

tdhooghe force-pushed the feature/save-pd-to-snowflaketable branch 2 times, most recently from 0e281c2 to c9c1872 Compare October 10, 2024 19:47

tdhooghe mentioned this pull request Oct 15, 2024

kedro-datasets: allow skipping Python 3.12 unit tests for SnowflakeTableDatasets #890

Closed

tdhooghe commented Oct 16, 2024

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Outdated Show resolved Hide resolved

tdhooghe force-pushed the feature/save-pd-to-snowflaketable branch from dc88a84 to 0a14bb7 Compare October 21, 2024 14:39

tdhooghe and others added 23 commits October 21, 2024 17:22

check and convert pd df to snowflake df

2305c3d

Signed-off-by: tdhooghe <[email protected]>

add tests

07ef67b

Signed-off-by: tdhooghe <[email protected]>

revert import to previous state

04928ba

Signed-off-by: tdhooghe <[email protected]>

relax snowpark version

77551d6

Signed-off-by: tdhooghe <[email protected]>

revert pyproject.toml

fbb3374

Signed-off-by: tdhooghe <[email protected]>

bump snowpark version

fb621ae

Signed-off-by: tdhooghe <[email protected]>

relax python version for snowpark

cb320bd

Signed-off-by: tdhooghe <[email protected]>

downgrade snowpark to 1.22

628def6

Signed-off-by: tdhooghe <[email protected]>

upgrade cloudpickle to 2.2.1

136fb09

Signed-off-by: tdhooghe <[email protected]>

ci(docker): Update error code in e2e test (kedro-org#888)

fce2974

Update error code in e2e test Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: tdhooghe <[email protected]>

update docstring

60b4a86

Signed-off-by: tdhooghe <[email protected]>

test ci/cd without python version 3.12

3e578c6

Signed-off-by: tdhooghe <[email protected]>

fix linting error

a161191

Signed-off-by: tdhooghe <[email protected]>

add max python version for compatibility

e4509bc

Signed-off-by: tdhooghe <[email protected]>

fix error

c98aef8

Signed-off-by: tdhooghe <[email protected]>

ci(datasets): Fix mypy errors (kedro-org#893)

bcec735

Signed-off-by: tdhooghe <[email protected]>

build(datasets): make Kedro-Datasets 5.1.0 release (kedro-org#887)

388f455

Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: tdhooghe <[email protected]>

update testing framework and docstrings

101696c

Signed-off-by: tdhooghe <[email protected]>

Update kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

c1eeb77

Signed-off-by: Thomas <[email protected]> Signed-off-by: tdhooghe <[email protected]>

add back omit for snowflake as python 3.12 is not supported, hence ci…

ceacc8c

…cd fails on coverage Signed-off-by: tdhooghe <[email protected]>

merelcht requested a review from DimedS October 23, 2024 15:13

revert makefile to original state

e0735ad

Signed-off-by: tdhooghe <[email protected]>

merelcht approved these changes Oct 23, 2024

View reviewed changes

tdhooghe and others added 3 commits October 23, 2024 17:48

remove commented out setters

d17b8a6

Signed-off-by: tdhooghe <[email protected]>

remove commented out fixtures

c5c9e4d

Signed-off-by: tdhooghe <[email protected]>

Merge branch 'main' into feature/save-pd-to-snowflaketable

fbc2545

merelcht reviewed Oct 24, 2024

View reviewed changes

tdhooghe and others added 3 commits October 24, 2024 14:20

remove commented out fixtures

67f24c0

Signed-off-by: tdhooghe <[email protected]>

skip test if import is missing

be5fb6f

Signed-off-by: tdhooghe <[email protected]>

tdhooghe force-pushed the feature/save-pd-to-snowflaketable branch from 3c3dcee to 217e60b Compare October 24, 2024 12:20

remove snowpark from omit

f288ab7

Signed-off-by: tdhooghe <[email protected]>

merelcht reviewed Oct 24, 2024

View reviewed changes

kedro-datasets/tests/snowflake/test_snowpark_dataset.py Show resolved Hide resolved

tdhooghe added 2 commits October 24, 2024 15:20

change snowpark to snowflake

6a21d1f

Signed-off-by: tdhooghe <[email protected]>

fix error in import or skip and add noqa for ruff

0e2abe3

Signed-off-by: tdhooghe <[email protected]>

DimedS approved these changes Oct 24, 2024

View reviewed changes

tdhooghe added 3 commits October 24, 2024 15:37

implement pr feedback to raise if data is not pandas or snowpark data…

c0c42a8

…frame Signed-off-by: tdhooghe <[email protected]>

add back omit for snowflake as python 3.12 is not supported, hence ci…

41dfa22

…cd fails on coverage Signed-off-by: tdhooghe <[email protected]>

add test for new raising error

0f3d838

Signed-off-by: tdhooghe <[email protected]>

merelcht reviewed Oct 24, 2024

View reviewed changes

kedro-datasets/pyproject.toml Outdated Show resolved Hide resolved

kedro-datasets/tests/snowflake/README.md Outdated Show resolved Hide resolved

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Outdated Show resolved Hide resolved

tdhooghe and others added 7 commits October 24, 2024 21:27

process final pr comments

2a688d8

Signed-off-by: tdhooghe <[email protected]>

Merge branch 'main' into feature/save-pd-to-snowflaketable

a974010

Merge branch 'main' into feature/save-pd-to-snowflaketable

660a428

Merge branch 'feature/save-pd-to-snowflaketable' of https://github.co…

ecabf7d

…m/tdhooghe/kedro-plugins into feature/save-pd-to-snowflaketable

Merge branch 'main' into feature/save-pd-to-snowflaketable

fb6d73e

Merge branch 'feature/save-pd-to-snowflaketable' of https://github.co…

b6643a2

…m/tdhooghe/kedro-plugins into feature/save-pd-to-snowflaketable

Fix lint by checking session for None

77567b7

Signed-off-by: Merel Theisen <[email protected]>

merelcht enabled auto-merge (squash) October 28, 2024 15:45

merelcht merged commit 59dcf50 into kedro-org:main Oct 28, 2024
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Improved compatibility, functionality and testing for SnowflakeTableDataset #881

feat(datasets): Improved compatibility, functionality and testing for SnowflakeTableDataset #881

tdhooghe commented Oct 10, 2024 •

edited

Loading

merelcht left a comment

merelcht Oct 24, 2024

tdhooghe Oct 24, 2024

tdhooghe Oct 24, 2024 •

edited

Loading

merelcht Oct 24, 2024

DimedS left a comment

DimedS Oct 24, 2024

tdhooghe Oct 24, 2024

merelcht left a comment

feat(datasets): Improved compatibility, functionality and testing for SnowflakeTableDataset #881

feat(datasets): Improved compatibility, functionality and testing for SnowflakeTableDataset #881

Conversation

tdhooghe commented Oct 10, 2024 • edited Loading

Description

Development notes

Checklist

merelcht left a comment

Choose a reason for hiding this comment

merelcht Oct 24, 2024

Choose a reason for hiding this comment

tdhooghe Oct 24, 2024

Choose a reason for hiding this comment

tdhooghe Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

merelcht Oct 24, 2024

Choose a reason for hiding this comment

DimedS left a comment

Choose a reason for hiding this comment

DimedS Oct 24, 2024

Choose a reason for hiding this comment

tdhooghe Oct 24, 2024

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

tdhooghe commented Oct 10, 2024 •

edited

Loading

tdhooghe Oct 24, 2024 •

edited

Loading