Support interchange protocol #3340

MarcoGorelli · 2023-04-25T10:21:40Z

That [api.interchange.from_dataframe] is indeed on the roadmap for seaborn v0.13 now that it's landed in pandas stable

so I figured I'd open a PR to drive the conversation forwards

Just wanted to check:

would you still be open to this?
given that pandas 2.0.0 and 2.0.1 have some small bugs in the interchange protocol (for categorical columns, and those with missing values - the rest should already work) would it be OK to set 2.0.2 (probably out mid-May) as the minimum pandas version to try interchanging from?

Trying this out with the latest pandas nightly, the docs/tutorial notebooks all seem to "just work" with non-pandas DataFrames - here's an example:

mwaskom · 2023-04-25T23:05:41Z

Yes, definitely still open to this, but I don't think that adding it to every interface function is the right approach. Most functions call down to one (of a couple, with gradual standardization) internal functions for parsing the data specification.

given that pandas 2.0.0 and 2.0.1 have some small bugs in the interchange protocol (for categorical columns, and those with missing values - the rest should already work) would it be OK to set 2.0.2 (probably out mid-May) as the minimum pandas version to try interchanging from?

I'm not sure I see the argument for gating it by pandas version. If someone passes a DataFrame that pandas 2.0.0 can handle, why not let it? (Also I thought that the exchange protocol was introduced in 1.5?)

mwaskom · 2023-04-25T23:06:10Z

seaborn/utils.py

@@ -889,3 +891,13 @@ def _disable_autolayout():
 def _version_predates(lib: ModuleType, version: str) -> bool:
    """Helper function for checking version compatibility."""
    return Version(lib.__version__) < Version(version)
+
+
+def try_convert_to_pandas(data: object | None) -> pd.DataFrame:


Why object | None and not Any?

just because Any turns off the type checker so I tend to avoid using it unless I have to, whereas as object prevents me from making assumptions about what properties the variable might have

mwaskom · 2023-04-25T23:07:25Z

seaborn/utils.py

+def try_convert_to_pandas(data: object | None) -> pd.DataFrame:
+    if data is None:
+        return None
+    elif isinstance(data, pd.DataFrame):


Is this guard important? will passing a pandas.DataFrame to pd.api.interchange.from_dataframe be costly? I'd expect it to be a no-op without looking closer.

you're right, thanks, it is in-fact a no-op

https://github.com/pandas-dev/pandas/blob/360bf218d68c703911731aec58a52b6501b2f4ce/pandas/core/interchange/from_dataframe.py#L48-L49

However, seaborn still supports versions of pandas older than those which support the interchange protocol, so I introduced this to keep it a no-op in such cases

mwaskom · 2023-04-25T23:13:18Z

Also is there an analogous solution for pd.Series? I guess many other DataFrame libraries don't 'have a concept of an "indexed" column so maybe not — those should go through the array protocol? I ask because most seaborn functions can accept vector data passed directly to x/y/etc. and I'd want to be similarly flexible.

MarcoGorelli · 2023-04-26T15:06:29Z

Looks like the x=<vector>, y=<vector> syntax already works for non-pandas Series, as they're converted to numpy arrays anyway

seaborn/seaborn/regression.py

Lines 49 to 50 in 54c36b7

    
           if vector is not None and vector.shape != (1,): 
        
               vector = np.squeeze(vector)

For example, the following just works

mwaskom · 2023-05-04T11:35:06Z

regplot is a very old function and much less pandas-oriented than most so it's not where I'd start to test.

MarcoGorelli · 2023-05-04T15:41:53Z

Thanks - what's a more modern function you'd recommend testing? The notebooks from docs/_tutorial seem to work, e.g.

…ge-protocol

codecov · 2023-05-07T14:47:47Z

Codecov Report

Merging #3340 (991c343) into master (129ce70) will decrease coverage by 0.01%.
The diff coverage is 91.66%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3340      +/-   ##
==========================================
- Coverage   98.24%   98.23%   -0.01%     
==========================================
  Files          77       77              
  Lines       24791    24824      +33     
==========================================
+ Hits        24357    24387      +30     
- Misses        434      437       +3

Impacted Files	Coverage Δ
tests/test_utils.py	`98.40% <77.77%> (-0.51%)`	⬇️
seaborn/utils.py	`93.97% <91.66%> (-0.08%)`	⬇️
seaborn/_core/data.py	`98.93% <100.00%> (+0.02%)`	⬆️
seaborn/_core/plot.py	`99.38% <100.00%> (+<0.01%)`	⬆️
seaborn/_oldcore.py	`97.54% <100.00%> (ø)`
seaborn/axisgrid.py	`97.20% <100.00%> (+<0.01%)`	⬆️
seaborn/categorical.py	`95.52% <100.00%> (+<0.01%)`	⬆️
seaborn/distributions.py	`96.36% <100.00%> (+<0.01%)`	⬆️
seaborn/regression.py	`98.48% <100.00%> (+<0.01%)`	⬆️
seaborn/relational.py	`99.69% <100.00%> (+<0.01%)`	⬆️

MarcoGorelli · 2023-05-07T14:59:08Z

I tried amending load_dataset to return a non-pandas (polars) dataframe, and executing all the notebooks in doc/_tutorial, and it all worked

I've opened a PR to my own fork demonstrating this - it's easiest to see the diff, and the outputs, with reviewnb: https://app.reviewnb.com/MarcoGorelli/seaborn/pull/2/

I had to use the latest pandas nightly (in order to get some recent fixes which will be in version 2.0.2), which I installed with:

pip uninstall pandas -y
pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple pandas

To execute all the notebooks, I just did

git ls-files doc/_tutorial/*.ipynb | xargs jupyter nbconvert --to notebook --execute --inplace

Testing-wise, what would you like? OK to just "trust" the interchange protocol, as is done here, or would you expect another CI job re-running all the tests with a non-pandas dataframe library?

…ge-protocol

MarcoGorelli · 2023-05-19T14:36:38Z

seaborn/_oldcore.py

@@ -922,7 +924,7 @@ def _assign_variables_longform(self, data=None, **kwargs):
                    val in data
                    or (isinstance(val, (str, bytes)) and val in index)
                )
-            except (KeyError, TypeError):
+            except (KeyError, TypeError, ValueError):


if val is a non-pandas Series, then checking val in data will throw ValueError

MarcoGorelli · 2023-05-19T14:38:00Z

tests/conftest.py

 import numpy as np
 import pandas as pd

 import pytest


+def maybe_convert_to_polars(df):


updating several fixtures to return polars.DataFrame instead of pandas.DataFrame

MarcoGorelli · 2023-05-19T14:43:33Z

I've added a CI job to test this, in which the fixtures are amended to return a polars DataFrame instead of a pandas one (where possible)

There are currently some CI failures in that job, which also appear when using the latest pandas nightly - might have to wait for pandas 2.0.2 to come out then, to use that in this CI workflow. Fortunately, that's expected to be quite soon, on Monday

EDIT: turns out the failures were from numpy nightly, not pandas nightly. For now I've downgraded to 2.0.1, which doesn't pull the numpy nightly

EDIT2: I've added a test which fails with pandas 2.0.1, but will pass with 2.0.2. Will update on Monday, just checking CI actually passes like this

mwaskom · 2023-05-20T14:34:44Z

Hi @MarcoGorelli this is a very helpful proof of concept but just a heads up that I don't think I want to maintain this sort of test-everything-on-polars approach on an ongoing basis, so no need to kill yourself tying up every edge case in the way that tests are written.

MarcoGorelli · 2023-05-20T15:00:16Z

😄 sure, I'll see if there's a simpler way round this, thanks

MarcoGorelli · 2023-05-21T18:40:08Z

closing in favour of #3369

support interchange protocol

142086a

mwaskom reviewed Apr 25, 2023

View reviewed changes

mwaskom added the api label Apr 25, 2023

raise if trying to interchange before pd 2.0.2

55df47b

MarcoGorelli mentioned this pull request May 4, 2023

pandas 2.0 compat: address is_categorical_dtype deprecation #3355

Merged

MarcoGorelli added 4 commits May 7, 2023 12:50

Merge remote-tracking branch 'upstream/master' into support-interchan…

194a564

…ge-protocol

revert temporary change

088cb08

simplify

03c1717

fixup

7dd9ff6

mwaskom mentioned this pull request May 18, 2023

seaborn.object: TypeError: x given by both name and position raised by polars and not by pandas #3368

Closed

MarcoGorelli added 7 commits May 19, 2023 11:43

Merge remote-tracking branch 'upstream/master' into support-interchan…

bbfdadb

…ge-protocol

try adding polars workflow

ad48c8a

3.10

a0bd3f7

try fixup;

8edcf14

include pyarrow install

22df733

pandas nightly

fa37b56

wip

b5c4ff8

MarcoGorelli marked this pull request as draft May 19, 2023 13:43

fixup

064b2e6

MarcoGorelli commented May 19, 2023

View reviewed changes

MarcoGorelli added 2 commits May 19, 2023 15:47

reduce dependency to pandas 2.0.1

3f32596

test that all load_dataset examples can actually interchange

f4f3317

MarcoGorelli added 13 commits May 20, 2023 09:14

better msg

1103aa9

coverage

338e119

pyarrow

5a44bed

fix deps

73f35d5

gotta remember pyarrow

63c21ee

wip

0e9586f

wip

9f1927f

wip

9c4aedb

wip

e7e84f5

wip

30e1002

increase test coverage even more

1985028

pre-commit run -a

b8584ee

skip estimateaggregator tests for the polars fixtures

5b2532e

MarcoGorelli added 4 commits May 21, 2023 16:59

simplify

4897344

convert as soon as possible

8117fe6

try convert in facetgrid

3812e5f

convert in pairgrid

6494ef4

mwaskom mentioned this pull request May 21, 2023

Support polars and other data libraries via dataframe interchange #3369

Merged

remove separate workflow;

991c343

MarcoGorelli closed this May 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support interchange protocol #3340

Support interchange protocol #3340

MarcoGorelli commented Apr 25, 2023 •

edited

Loading

mwaskom commented Apr 25, 2023

mwaskom Apr 25, 2023

MarcoGorelli Apr 26, 2023

mwaskom Apr 25, 2023

MarcoGorelli Apr 26, 2023

mwaskom commented Apr 25, 2023 •

edited

Loading

MarcoGorelli commented Apr 26, 2023

mwaskom commented May 4, 2023

MarcoGorelli commented May 4, 2023

codecov bot commented May 7, 2023 •

edited

Loading

MarcoGorelli commented May 7, 2023 •

edited

Loading

MarcoGorelli May 19, 2023

MarcoGorelli May 19, 2023

MarcoGorelli commented May 19, 2023 •

edited

Loading

mwaskom commented May 20, 2023

MarcoGorelli commented May 20, 2023

MarcoGorelli commented May 21, 2023

Support interchange protocol #3340

Support interchange protocol #3340

Conversation

MarcoGorelli commented Apr 25, 2023 • edited Loading

mwaskom commented Apr 25, 2023

mwaskom Apr 25, 2023

Choose a reason for hiding this comment

MarcoGorelli Apr 26, 2023

Choose a reason for hiding this comment

mwaskom Apr 25, 2023

Choose a reason for hiding this comment

MarcoGorelli Apr 26, 2023

Choose a reason for hiding this comment

mwaskom commented Apr 25, 2023 • edited Loading

MarcoGorelli commented Apr 26, 2023

mwaskom commented May 4, 2023

MarcoGorelli commented May 4, 2023

codecov bot commented May 7, 2023 • edited Loading

Codecov Report

MarcoGorelli commented May 7, 2023 • edited Loading

MarcoGorelli May 19, 2023

Choose a reason for hiding this comment

MarcoGorelli May 19, 2023

Choose a reason for hiding this comment

MarcoGorelli commented May 19, 2023 • edited Loading

mwaskom commented May 20, 2023

MarcoGorelli commented May 20, 2023

MarcoGorelli commented May 21, 2023

MarcoGorelli commented Apr 25, 2023 •

edited

Loading

mwaskom commented Apr 25, 2023 •

edited

Loading

codecov bot commented May 7, 2023 •

edited

Loading

MarcoGorelli commented May 7, 2023 •

edited

Loading

MarcoGorelli commented May 19, 2023 •

edited

Loading