-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support interchange protocol #3340
Support interchange protocol #3340
Conversation
Yes, definitely still open to this, but I don't think that adding it to every interface function is the right approach. Most functions call down to one (of a couple, with gradual standardization) internal functions for parsing the data specification.
I'm not sure I see the argument for gating it by pandas version. If someone passes a DataFrame that pandas 2.0.0 can handle, why not let it? (Also I thought that the exchange protocol was introduced in 1.5?) |
@@ -889,3 +891,13 @@ def _disable_autolayout(): | |||
def _version_predates(lib: ModuleType, version: str) -> bool: | |||
"""Helper function for checking version compatibility.""" | |||
return Version(lib.__version__) < Version(version) | |||
|
|||
|
|||
def try_convert_to_pandas(data: object | None) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why object | None
and not Any
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just because Any
turns off the type checker so I tend to avoid using it unless I have to, whereas as object
prevents me from making assumptions about what properties the variable might have
def try_convert_to_pandas(data: object | None) -> pd.DataFrame: | ||
if data is None: | ||
return None | ||
elif isinstance(data, pd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guard important? will passing a pandas.DataFrame
to pd.api.interchange.from_dataframe
be costly? I'd expect it to be a no-op without looking closer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, thanks, it is in-fact a no-op
However, seaborn still supports versions of pandas older than those which support the interchange protocol, so I introduced this to keep it a no-op in such cases
Also is there an analogous solution for |
Looks like the Lines 49 to 50 in 54c36b7
For example, the following just works |
|
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #3340 +/- ##
==========================================
- Coverage 98.24% 98.23% -0.01%
==========================================
Files 77 77
Lines 24791 24824 +33
==========================================
+ Hits 24357 24387 +30
- Misses 434 437 +3
|
I tried amending I've opened a PR to my own fork demonstrating this - it's easiest to see the diff, and the outputs, with reviewnb: https://app.reviewnb.com/MarcoGorelli/seaborn/pull/2/ I had to use the latest pandas nightly (in order to get some recent fixes which will be in version 2.0.2), which I installed with:
To execute all the notebooks, I just did
Testing-wise, what would you like? OK to just "trust" the interchange protocol, as is done here, or would you expect another CI job re-running all the tests with a non-pandas dataframe library? |
@@ -922,7 +924,7 @@ def _assign_variables_longform(self, data=None, **kwargs): | |||
val in data | |||
or (isinstance(val, (str, bytes)) and val in index) | |||
) | |||
except (KeyError, TypeError): | |||
except (KeyError, TypeError, ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if val
is a non-pandas Series, then checking val in data
will throw ValueError
tests/conftest.py
Outdated
import numpy as np | ||
import pandas as pd | ||
|
||
import pytest | ||
|
||
|
||
def maybe_convert_to_polars(df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating several fixtures to return polars.DataFrame
instead of pandas.DataFrame
I've added a CI job to test this, in which the fixtures are amended to return a polars DataFrame instead of a pandas one (where possible) There are currently some CI failures in that job, which also appear when using the latest pandas nightly - might have to wait for pandas 2.0.2 to come out then, to use that in this CI workflow. Fortunately, that's expected to be quite soon, on Monday EDIT: turns out the failures were from numpy nightly, not pandas nightly. For now I've downgraded to 2.0.1, which doesn't pull the numpy nightly EDIT2: I've added a test which fails with pandas 2.0.1, but will pass with 2.0.2. Will update on Monday, just checking CI actually passes like this |
Hi @MarcoGorelli this is a very helpful proof of concept but just a heads up that I don't think I want to maintain this sort of test-everything-on-polars approach on an ongoing basis, so no need to kill yourself tying up every edge case in the way that tests are written. |
😄 sure, I'll see if there's a simpler way round this, thanks |
closing in favour of #3369 |
In #3277 (comment), it was mentioned
so I figured I'd open a PR to drive the conversation forwards
Just wanted to check:
2.0.2
(probably out mid-May) as the minimum pandas version to try interchanging from?Trying this out with the latest pandas nightly, the
docs/tutorial
notebooks all seem to "just work" with non-pandas DataFrames - here's an example: