-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Improved compatibility, functionality and testing for SnowflakeTableDataset #881
feat(datasets): Improved compatibility, functionality and testing for SnowflakeTableDataset #881
Conversation
0e281c2
to
c9c1872
Compare
dc88a84
to
0a14bb7
Compare
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
* feat(datasets): create separate `ibis.FileDataset` Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): deprecate `TableDataset` file I/O Signed-off-by: Deepyaman Datta <[email protected]> * feat(datasets): implement `FileDataset` versioning Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): try `os.path.exists`, for Windows Signed-off-by: Deepyaman Datta <[email protected]> * revert(datasets): use pathlib, ignore Windows test Refs: b7ff0c7 Signed-off-by: Deepyaman Datta <[email protected]> * docs(datasets): add `ibis.FileDataset` to contents Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): add docstring for `hashable` func Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): add docstring for `hashable` func Signed-off-by: Deepyaman Datta <[email protected]> * feat(datasets)!: expose `load` and `save` publicly Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): remove second filepath assignment Signed-off-by: Deepyaman Datta <[email protected]> --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: tdhooghe <[email protected]>
Update error code in e2e test Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
…kedro-org#891) * Update PR template with checkbox for core dataset contribution Signed-off-by: Merel Theisen <[email protected]> * Update .github/PULL_REQUEST_TEMPLATE.md Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: Merel Theisen <[email protected]> * Fix lint Signed-off-by: Merel Theisen <[email protected]> --------- Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Merel Theisen <[email protected]> Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
* fix(datasets): default to DuckDB in in-memory mode Signed-off-by: Deepyaman Datta <[email protected]> * test(datasets): use `object()` sentinel as default Signed-off-by: Deepyaman Datta <[email protected]> * docs(datasets): add default database to RELEASE.md Signed-off-by: Deepyaman Datta <[email protected]> --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: tdhooghe <[email protected]>
…ro-org#896) * Add GH action to check for TSC votes on core dataset changes * Ignore TSC vote action in gatekeeper * Trigger TSC vote action only on changes in core dataset --------- Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: Thomas <[email protected]> Signed-off-by: tdhooghe <[email protected]>
…cd fails on coverage Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments, but otherwise this looks great! ⭐ Thank you so much @tdhooghe for getting this dataset in better shapes and adding tests that can now run as part of the test suite for Python versions < 3.12!
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
if sys.version_info >= (3, 12): | ||
pytest.mark.xfail( | ||
"Snowpark is not supported in Python versions higher than 3.11", | ||
allow_module_level=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more testing, and it looks like you can replace this with snowpark = pytest.importorskip("snowpark")
and it will pass the builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No way, that's awesome! Let me test :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah unfortunately coverage still needs to be skipped, but now at least the tests themselves pass. Before the build were showing:
__________ ERROR collecting tests/snowflake/test_snowpark_dataset.py ___________
ImportError while importing test module '/home/runner/work/kedro-plugins/kedro-plugins/kedro-datasets/tests/snowflake/test_snowpark_dataset.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/importlib/__init__.py:90: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/snowflake/test_snowpark_dataset.py:15: in <module>
from snowflake.snowpark import DataFrame, Session
E ModuleNotFoundError: No module named 'snowflake'
=========================== short test summary info ============================
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
* Bump matplotlib test dependency Signed-off-by: Merel Theisen <[email protected]> * Fix matplotlib test Signed-off-by: Merel Theisen <[email protected]> --------- Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: tdhooghe <[email protected]>
3c3dcee
to
217e60b
Compare
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the PR, @tdhooghe! I really appreciate your proposal to allow saving Pandas DataFrames in the _save()
method - it's a very useful feature. I implemented a similar approach in a custom version of the Snowpark dataset for an ETL project (https://github.com/DimedS/kedro-pypi-to-snowflake/tree/main), so I’m glad to see it being incorporated into the official dataset now! I leaved a one small proposal.
data (pd.DataFrame | DataFrame): The data to save. | ||
""" | ||
if isinstance(data, pd.DataFrame): | ||
data = self._session.create_dataframe(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add an additional check and raise an error here, like this:
# Check if the input is a Pandas DataFrame and convert it to Snowpark DataFrame
if isinstance(data, pd.DataFrame):
# Convert the Pandas DataFrame to a Snowpark DataFrame
snowpark_df = self._session.create_dataframe(data)
elif isinstance(data, DataFrame):
# If it's already a Snowpark DataFrame, use it as is
snowpark_df = data
else:
raise DatasetError(f"Data of type {type(data)} is not supported for saving.")
This ensures we handle different types of DataFrames appropriately, as we're currently allowing not only Snowpark DataFrames but other DataFrame types as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your kind words @DimedS! 🙂
I like your proposal, I will include it!
…frame Signed-off-by: tdhooghe <[email protected]>
…cd fails on coverage Signed-off-by: tdhooghe <[email protected]>
Signed-off-by: tdhooghe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some final comments, but otherwise let's get this merged!
Signed-off-by: tdhooghe <[email protected]>
…m/tdhooghe/kedro-plugins into feature/save-pd-to-snowflaketable
…m/tdhooghe/kedro-plugins into feature/save-pd-to-snowflaketable
Signed-off-by: Merel Theisen <[email protected]>
Description
Development notes
Skips pytest for Python version >3.11
Bump of cloudpickle required to allow for
snowflake-snowpark-python >= 1.23
Checklist
RELEASE.md
file