Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support polars and other data libraries via dataframe interchange #3369

Merged
merged 11 commits into from
Aug 23, 2023

Conversation

mwaskom
Copy link
Owner

@mwaskom mwaskom commented May 21, 2023

This PR leverages the dataframe interchange protocol to let seaborn consume objects from other dataframe libraries, such as polars.

Dataframes are converted to pandas objects upon consumption, and that is what is used internally by seaborn, so seaborn's statistical operations don't take advantage of any parallelism / out of core / etc. functionality offered by these libraries. While that would be ideal, I don't see it happening any time soon.

Nevertheless, this should make it easy to prep data in a library of choice and then pipe it to seaborn without thinking too much about the representation.

For testing, my approach is to use a simple mock object that is not (by inheritance) a pandas DataFrame, but that does "support" the interchange protocol. I think this is sufficient, with the assumption that pandas / other data libraries will together correctly implement the dataframe interchange itself. Testing whether that works correctly feels out of scope for seaborn's unit tests.

I wouldn't be surprised to learn of various edge cases as this roles out to people using alternative dataframe libraries heavily (I only did some light testing with polars using toy datasets) so we'll address those as they happen.

Thanks to @MarcoGorelli for getting the ball rolling with #3340 and advising on the approach.
Closes #3368

@codecov
Copy link

codecov bot commented May 21, 2023

Codecov Report

Merging #3369 (e298131) into master (af613f1) will decrease coverage by 0.01%.
The diff coverage is 97.97%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3369      +/-   ##
==========================================
- Coverage   98.33%   98.32%   -0.01%     
==========================================
  Files          77       77              
  Lines       24335    24381      +46     
==========================================
+ Hits        23929    23973      +44     
- Misses        406      408       +2     
Files Changed Coverage Δ
seaborn/_core/data.py 97.41% <92.59%> (-1.50%) ⬇️
seaborn/_base.py 97.56% <100.00%> (-0.11%) ⬇️
seaborn/_core/plot.py 98.26% <100.00%> (ø)
seaborn/_core/typing.py 96.00% <100.00%> (ø)
seaborn/axisgrid.py 97.20% <100.00%> (+<0.01%) ⬆️
seaborn/categorical.py 98.90% <100.00%> (ø)
seaborn/distributions.py 96.23% <100.00%> (+<0.01%) ⬆️
tests/_core/test_data.py 100.00% <100.00%> (ø)
tests/_core/test_plot.py 98.80% <100.00%> (+<0.01%) ⬆️
tests/conftest.py 100.00% <100.00%> (ø)
... and 2 more

@MarcoGorelli
Copy link
Contributor

thanks for the ping

I tried this out, but got a failure for the doc/_docstrings/FacetGrid.ipynb notebook:

tips = pl.from_pandas(sns.load_dataset('tips'))
g = sns.FacetGrid(tips, col="time",  row="sex")
g.map(sns.scatterplot, "total_bill", "tip")
SchemaError                               Traceback (most recent call last)
Cell In[4], line 2
      1 g = sns.FacetGrid(tips, col="time",  row="sex")
----> 2 g.map(sns.scatterplot, "total_bill", "tip")

File ~/seaborn-dev/seaborn/axisgrid.py:720, in FacetGrid.map(self, func, *args, **kwargs)
    717         warnings.warn(warning)
    719 # Iterate over the data subsets
--> 720 for (row_i, col_j, hue_k), data_ijk in self.facet_data():
    721 
    722     # If this subset is null, move on
    723     if not data_ijk.values.size:
    724         continue

File ~/seaborn-dev/seaborn/axisgrid.py:674, in FacetGrid.facet_data(self)
    670 # Here is the main generator loop
    671 for (i, row), (j, col), (k, hue) in product(enumerate(row_masks),
    672                                             enumerate(col_masks),
    673                                             enumerate(hue_masks)):
--> 674     data_ijk = data[row & col & hue & self._not_na]
    675     yield (i, j, k), data_ijk

File ~/seaborn-dev/.venv/lib/python3.10/site-packages/polars/series/series.py:439, in Series.__and__(self, other)
    437 if not isinstance(other, Series):
    438     other = Series([other])
--> 439 return self._from_pyseries(self._s.bitand(other._s))

SchemaError: cannot unpack series of type `list[bool]` into `bool`

@mwaskom
Copy link
Owner Author

mwaskom commented May 21, 2023

I tried this out, but got a failure for the doc/_docstrings/FacetGrid.ipynb notebook

Right, this just addresses the objects interface; the older code will need to be handled separately.

Are you seeing any errors in the objects.{class} docstring notebooks?

@MarcoGorelli
Copy link
Contributor

I see, thanks!

I tried

git ls-files doc/_docstrings/objects*.ipynb | xargs jupyter nbconvert --to notebook --execute --inplace
git ls-files doc/_tutorial/objects_interface.ipynb | xargs jupyter nbconvert --to notebook --execute --inplace

and both commands pass without errors (in my demo-support-interchange-protocol-2, where I make load_dataset return a polars DataFrame)

@mwaskom mwaskom changed the title Support dataframe interchange in objects interface Support polars and other data libraries via dataframe interchange Aug 23, 2023
@mwaskom
Copy link
Owner Author

mwaskom commented Aug 23, 2023

Alright after a bit of aditional work this should now support alternative dataframes throughout seaborn.

Wouldn't surprise me if there are still a few weird edge cases here and there but we'll need users to surface those as I don't do any actual work with these libraries.

Gonna merge over the failing codecov check which is on the pandas <2.0.2 warning. There's not an easy way to exercise that code, and it's just a warning.

Thanks @MarcoGorelli for getting the ball rolling and weighing in here.

@mwaskom mwaskom merged commit 58cf628 into master Aug 23, 2023
11 of 12 checks passed
@mwaskom mwaskom deleted the data_interchange branch August 23, 2023 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

seaborn.object: TypeError: x given by both name and position raised by polars and not by pandas
2 participants