Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Improved compatibility, functionality and testing for SnowflakeTableDataset #881

Merged
merged 97 commits into from
Oct 28, 2024

Conversation

tdhooghe
Copy link
Contributor

@tdhooghe tdhooghe commented Oct 10, 2024

Description

  • Enable saving a Pandas DataFrame directly into a SnowflakeTable, i.e. to ingest a .csv directly into Snowflake within Kedro
  • Update tests to use Snowpark's local testing framework
  • Updated dependencies

Development notes

Skips pytest for Python version >3.11

Bump of cloudpickle required to allow for snowflake-snowpark-python >= 1.23
image

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@tdhooghe tdhooghe changed the title (feature)Datasets: save pandas df directly to SnowflakeTable feat(datasets): save pandas df directly to SnowflakeTable Oct 10, 2024
@tdhooghe tdhooghe force-pushed the feature/save-pd-to-snowflaketable branch 2 times, most recently from 0e281c2 to c9c1872 Compare October 10, 2024 19:47
@tdhooghe tdhooghe force-pushed the feature/save-pd-to-snowflaketable branch from dc88a84 to 0a14bb7 Compare October 21, 2024 14:39
tdhooghe and others added 23 commits October 21, 2024 17:22
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
* feat(datasets): create separate `ibis.FileDataset`

Signed-off-by: Deepyaman Datta <[email protected]>

* chore(datasets): deprecate `TableDataset` file I/O

Signed-off-by: Deepyaman Datta <[email protected]>

* feat(datasets): implement `FileDataset` versioning

Signed-off-by: Deepyaman Datta <[email protected]>

* chore(datasets): try `os.path.exists`, for Windows

Signed-off-by: Deepyaman Datta <[email protected]>

* revert(datasets): use pathlib, ignore Windows test

Refs: b7ff0c7

Signed-off-by: Deepyaman Datta <[email protected]>

* docs(datasets): add `ibis.FileDataset` to contents

Signed-off-by: Deepyaman Datta <[email protected]>

* chore(datasets): add docstring for `hashable` func

Signed-off-by: Deepyaman Datta <[email protected]>

* chore(datasets): add docstring for `hashable` func

Signed-off-by: Deepyaman Datta <[email protected]>

* feat(datasets)!: expose `load` and `save` publicly

Signed-off-by: Deepyaman Datta <[email protected]>

* chore(datasets): remove second filepath assignment

Signed-off-by: Deepyaman Datta <[email protected]>

---------

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Update error code in e2e test

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
…kedro-org#891)

* Update PR template with checkbox for core dataset contribution

Signed-off-by: Merel Theisen <[email protected]>

* Update .github/PULL_REQUEST_TEMPLATE.md

Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>

* Fix lint

Signed-off-by: Merel Theisen <[email protected]>

---------

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
* fix(datasets): default to DuckDB in in-memory mode

Signed-off-by: Deepyaman Datta <[email protected]>

* test(datasets): use `object()` sentinel as default

Signed-off-by: Deepyaman Datta <[email protected]>

* docs(datasets): add default database to RELEASE.md

Signed-off-by: Deepyaman Datta <[email protected]>

---------

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
…ro-org#896)

* Add GH action to check for TSC votes on core dataset changes
* Ignore TSC vote action in gatekeeper
* Trigger TSC vote action only on changes in core dataset

---------

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
@merelcht merelcht requested a review from DimedS October 23, 2024 15:13
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments, but otherwise this looks great! ⭐ Thank you so much @tdhooghe for getting this dataset in better shapes and adding tests that can now run as part of the test suite for Python versions < 3.12!

kedro-datasets/tests/snowflake/README.md Outdated Show resolved Hide resolved
kedro-datasets/tests/snowflake/test_snowpark_dataset.py Outdated Show resolved Hide resolved
Comment on lines 7 to 11
if sys.version_info >= (3, 12):
pytest.mark.xfail(
"Snowpark is not supported in Python versions higher than 3.11",
allow_module_level=True,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more testing, and it looks like you can replace this with snowpark = pytest.importorskip("snowpark") and it will pass the builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No way, that's awesome! Let me test :)

Copy link
Contributor Author

@tdhooghe tdhooghe Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't help unfortunately, as it still includes the tests in the coverage..
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah unfortunately coverage still needs to be skipped, but now at least the tests themselves pass. Before the build were showing:

__________ ERROR collecting tests/snowflake/test_snowpark_dataset.py ___________
ImportError while importing test module '/home/runner/work/kedro-plugins/kedro-plugins/kedro-datasets/tests/snowflake/test_snowpark_dataset.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/snowflake/test_snowpark_dataset.py:15: in <module>
    from snowflake.snowpark import DataFrame, Session
E   ModuleNotFoundError: No module named 'snowflake'
=========================== short test summary info ============================

tdhooghe and others added 3 commits October 24, 2024 14:20
* Bump matplotlib test dependency

Signed-off-by: Merel Theisen <[email protected]>

* Fix matplotlib test

Signed-off-by: Merel Theisen <[email protected]>

---------

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
@tdhooghe tdhooghe force-pushed the feature/save-pd-to-snowflaketable branch from 3c3dcee to 217e60b Compare October 24, 2024 12:20
Copy link
Contributor

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the PR, @tdhooghe! I really appreciate your proposal to allow saving Pandas DataFrames in the _save() method - it's a very useful feature. I implemented a similar approach in a custom version of the Snowpark dataset for an ETL project (https://github.com/DimedS/kedro-pypi-to-snowflake/tree/main), so I’m glad to see it being incorporated into the official dataset now! I leaved a one small proposal.

data (pd.DataFrame | DataFrame): The data to save.
"""
if isinstance(data, pd.DataFrame):
data = self._session.create_dataframe(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add an additional check and raise an error here, like this:

# Check if the input is a Pandas DataFrame and convert it to Snowpark DataFrame
if isinstance(data, pd.DataFrame):
    # Convert the Pandas DataFrame to a Snowpark DataFrame
    snowpark_df = self._session.create_dataframe(data)
elif isinstance(data, DataFrame):
    # If it's already a Snowpark DataFrame, use it as is
    snowpark_df = data
else:
    raise DatasetError(f"Data of type {type(data)} is not supported for saving.")

This ensures we handle different types of DataFrames appropriately, as we're currently allowing not only Snowpark DataFrames but other DataFrame types as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your kind words @DimedS! 🙂

I like your proposal, I will include it!

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some final comments, but otherwise let's get this merged!

kedro-datasets/pyproject.toml Outdated Show resolved Hide resolved
kedro-datasets/tests/snowflake/README.md Outdated Show resolved Hide resolved
@merelcht merelcht enabled auto-merge (squash) October 28, 2024 15:45
@merelcht merelcht merged commit 59dcf50 into kedro-org:main Oct 28, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants