Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use PyCapsule Interface instead of Dataframe Interchange Protocol #3782

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Nov 9, 2024

closes #3756
closes #3533
I'm hoping that this can supersede #3534

This means that you get support for quite a lot more, e.g.:

  • DuckDB:

image

image

In addition, this has no effect on existing pandas users, as there's already an early return for pandas https://github.com/MarcoGorelli/seaborn/blob/0bd85071284d45f38cbf419b8cf1efb2179eda24/seaborn/_core/data.py#L284-L285


I'm sorry for having introduced the Interchange Protocol in the first place. It's turned out to be fairly problematic, see pandas-dev/pandas#56732 (comment) as the associated discussion for more context


cc @WillAyd for comments

@WillAyd
Copy link

WillAyd commented Nov 9, 2024

Implementation wise I think this looks great. Nice work @MarcoGorelli

@MarcoGorelli MarcoGorelli force-pushed the pycapsule branch 3 times, most recently from 31146c8 to 9599662 Compare November 9, 2024 18:23
tests/_core/test_data.py Outdated Show resolved Hide resolved
@MarcoGorelli MarcoGorelli force-pushed the pycapsule branch 2 times, most recently from cf4ce2c to f516630 Compare November 9, 2024 18:39
Copy link
Owner

@mwaskom mwaskom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for this. I think my one question is about how compatible this will be for users that are currently benefitting from the (seemingly more-or-less built-in) interchange protocol. Do we need to provide backwards compatibility for them?

try:
import pyarrow
except ImportError as err:
msg = "PyArrow is required for non-pandas Dataframe support."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this generally a dependency of non-pandas dataframe libraries now? Or could this change introduce a regression for e.g. polars users who are currently leveraging the dataframe interchange protocol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review!

Polars doesn't depend on PyArrow, but polars.DataFrame.to_pandas always requires PyArrow. So, in practice, anyone working with both dataframe libraries may well already have PyArrow already installed

To avoid requiring PyArrow for the cases when it's not necessary, one way could be to do something like:

  • try using the interchange protocol
  • if it raises, then fall back to the PyCapsule Interface (which currently requires PyArrow)

This has the upside of not requiring PyArrow in some cases, but the downside of hiding issues where the interchange protocol silently produces invalid results

It may be possible to do this PyCapsule Interface conversion in the future without PyArrow but with something lighter instead, like arro3 by @kylebarron (who I'm ccing in case he has comments too)

What would be your preference?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some polars users may not have pyarrow installed. If seaborn needs to get pandas data, the only production-ready way to do Arrow -> pandas that I know of is using pyarrow.

As Marco mentions I'm working on arro3, which is a minimal library for Arrow in Python, but Pandas interop is not a primary concern, and it's not production-ready today.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW pandas 3.x is going to strongly incentivize users to install PyArrow, although it stops short of outright requiring it. In theory, the only people that shouldn't have PyArrow installed are those that operate in space/resource constrained environments, probably in headless environments like AWS Lambda where seaborn won't be used

Of course up to you how much you want to support non-PyArrow configurations, but the dataframe interchange protocol is relatively buggy and gets very little support, so you may find it easier altogether to force users towards PyArrow

def test_data_interchange(self, mock_long_df, long_df):
pytest.importorskip(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, TIL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants