Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support interchange protocol #3340

Closed

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Apr 25, 2023

In #3277 (comment), it was mentioned

That [api.interchange.from_dataframe] is indeed on the roadmap for seaborn v0.13 now that it's landed in pandas stable

so I figured I'd open a PR to drive the conversation forwards

Just wanted to check:

  • would you still be open to this?
  • given that pandas 2.0.0 and 2.0.1 have some small bugs in the interchange protocol (for categorical columns, and those with missing values - the rest should already work) would it be OK to set 2.0.2 (probably out mid-May) as the minimum pandas version to try interchanging from?

Trying this out with the latest pandas nightly, the docs/tutorial notebooks all seem to "just work" with non-pandas DataFrames - here's an example:

image

@mwaskom
Copy link
Owner

mwaskom commented Apr 25, 2023

Yes, definitely still open to this, but I don't think that adding it to every interface function is the right approach. Most functions call down to one (of a couple, with gradual standardization) internal functions for parsing the data specification.

given that pandas 2.0.0 and 2.0.1 have some small bugs in the interchange protocol (for categorical columns, and those with missing values - the rest should already work) would it be OK to set 2.0.2 (probably out mid-May) as the minimum pandas version to try interchanging from?

I'm not sure I see the argument for gating it by pandas version. If someone passes a DataFrame that pandas 2.0.0 can handle, why not let it? (Also I thought that the exchange protocol was introduced in 1.5?)

@@ -889,3 +891,13 @@ def _disable_autolayout():
def _version_predates(lib: ModuleType, version: str) -> bool:
"""Helper function for checking version compatibility."""
return Version(lib.__version__) < Version(version)


def try_convert_to_pandas(data: object | None) -> pd.DataFrame:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why object | None and not Any?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just because Any turns off the type checker so I tend to avoid using it unless I have to, whereas as object prevents me from making assumptions about what properties the variable might have

def try_convert_to_pandas(data: object | None) -> pd.DataFrame:
if data is None:
return None
elif isinstance(data, pd.DataFrame):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guard important? will passing a pandas.DataFrame to pd.api.interchange.from_dataframe be costly? I'd expect it to be a no-op without looking closer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, thanks, it is in-fact a no-op

https://github.com/pandas-dev/pandas/blob/360bf218d68c703911731aec58a52b6501b2f4ce/pandas/core/interchange/from_dataframe.py#L48-L49

However, seaborn still supports versions of pandas older than those which support the interchange protocol, so I introduced this to keep it a no-op in such cases

@mwaskom mwaskom added the api label Apr 25, 2023
@mwaskom
Copy link
Owner

mwaskom commented Apr 25, 2023

Also is there an analogous solution for pd.Series? I guess many other DataFrame libraries don't 'have a concept of an "indexed" column so maybe not — those should go through the array protocol? I ask because most seaborn functions can accept vector data passed directly to x/y/etc. and I'd want to be similarly flexible.

@MarcoGorelli
Copy link
Contributor Author

Looks like the x=<vector>, y=<vector> syntax already works for non-pandas Series, as they're converted to numpy arrays anyway

if vector is not None and vector.shape != (1,):
vector = np.squeeze(vector)

For example, the following just works

image

@mwaskom
Copy link
Owner

mwaskom commented May 4, 2023

regplot is a very old function and much less pandas-oriented than most so it's not where I'd start to test.

@MarcoGorelli
Copy link
Contributor Author

Thanks - what's a more modern function you'd recommend testing? The notebooks from docs/_tutorial seem to work, e.g.

image

@codecov
Copy link

codecov bot commented May 7, 2023

Codecov Report

Merging #3340 (991c343) into master (129ce70) will decrease coverage by 0.01%.
The diff coverage is 91.66%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3340      +/-   ##
==========================================
- Coverage   98.24%   98.23%   -0.01%     
==========================================
  Files          77       77              
  Lines       24791    24824      +33     
==========================================
+ Hits        24357    24387      +30     
- Misses        434      437       +3     
Impacted Files Coverage Δ
tests/test_utils.py 98.40% <77.77%> (-0.51%) ⬇️
seaborn/utils.py 93.97% <91.66%> (-0.08%) ⬇️
seaborn/_core/data.py 98.93% <100.00%> (+0.02%) ⬆️
seaborn/_core/plot.py 99.38% <100.00%> (+<0.01%) ⬆️
seaborn/_oldcore.py 97.54% <100.00%> (ø)
seaborn/axisgrid.py 97.20% <100.00%> (+<0.01%) ⬆️
seaborn/categorical.py 95.52% <100.00%> (+<0.01%) ⬆️
seaborn/distributions.py 96.36% <100.00%> (+<0.01%) ⬆️
seaborn/regression.py 98.48% <100.00%> (+<0.01%) ⬆️
seaborn/relational.py 99.69% <100.00%> (+<0.01%) ⬆️

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented May 7, 2023

I tried amending load_dataset to return a non-pandas (polars) dataframe, and executing all the notebooks in doc/_tutorial, and it all worked

I've opened a PR to my own fork demonstrating this - it's easiest to see the diff, and the outputs, with reviewnb: https://app.reviewnb.com/MarcoGorelli/seaborn/pull/2/

I had to use the latest pandas nightly (in order to get some recent fixes which will be in version 2.0.2), which I installed with:

pip uninstall pandas -y
pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple pandas

To execute all the notebooks, I just did

git ls-files doc/_tutorial/*.ipynb | xargs jupyter nbconvert --to notebook --execute --inplace

Testing-wise, what would you like? OK to just "trust" the interchange protocol, as is done here, or would you expect another CI job re-running all the tests with a non-pandas dataframe library?

@MarcoGorelli MarcoGorelli marked this pull request as draft May 19, 2023 13:43
@@ -922,7 +924,7 @@ def _assign_variables_longform(self, data=None, **kwargs):
val in data
or (isinstance(val, (str, bytes)) and val in index)
)
except (KeyError, TypeError):
except (KeyError, TypeError, ValueError):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if val is a non-pandas Series, then checking val in data will throw ValueError

import numpy as np
import pandas as pd

import pytest


def maybe_convert_to_polars(df):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating several fixtures to return polars.DataFrame instead of pandas.DataFrame

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented May 19, 2023

I've added a CI job to test this, in which the fixtures are amended to return a polars DataFrame instead of a pandas one (where possible)

There are currently some CI failures in that job, which also appear when using the latest pandas nightly - might have to wait for pandas 2.0.2 to come out then, to use that in this CI workflow. Fortunately, that's expected to be quite soon, on Monday

EDIT: turns out the failures were from numpy nightly, not pandas nightly. For now I've downgraded to 2.0.1, which doesn't pull the numpy nightly

EDIT2: I've added a test which fails with pandas 2.0.1, but will pass with 2.0.2. Will update on Monday, just checking CI actually passes like this

@mwaskom
Copy link
Owner

mwaskom commented May 20, 2023

Hi @MarcoGorelli this is a very helpful proof of concept but just a heads up that I don't think I want to maintain this sort of test-everything-on-polars approach on an ongoing basis, so no need to kill yourself tying up every edge case in the way that tests are written.

@MarcoGorelli
Copy link
Contributor Author

😄 sure, I'll see if there's a simpler way round this, thanks

@MarcoGorelli
Copy link
Contributor Author

closing in favour of #3369

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants