-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Arrow PyCapsule Interface instead of Dataframe Interchange Protocol #3756
Comments
Thanks for flagging. Just skimmed your link but it looks like it's operating at a very different level from the dataframe interchange protocol? The relevance to seaborn (i.e., is there a simple way to be more agnostic about input data structure types) isn't super obvious. |
Apologies as I should have been more clear - the technical documentation I provided was just a reference, not something I'd expect seaborn to have to implement from scratch. The dataframe libraries that you would interact with should do most of the heavy lifting for that. @MarcoGorelli probably knows best here, but from a cursory glance of the seaborn source code, I think you could adopt the Arrow PyCapsule interface in a piece-wise fashion:
Step 1 I think would be pretty easy, and would immediately open up seaborn for use from polars, excluding any data types that polars has which pandas does not (most likely Decimal / aggregate types) Step 2 would take a little more time. I'm not sure if narwhals is even fully capable of abstracting all of the dataframe operations that seaborn needs today, but in theory this would make your dependencies more lightweight by dropping pandas Overall, rather than seaborn having to customize solutions towards the various dataframe type systems, the ecosystem would just converge on just the Arrow type system. Assuming seaborn still requires NumPy types for interactivity with matplotlib, there will still be a gap where Arrow types don't have a plottable equivalent, but I think that's better than the status quo where seaborn is tied to pandas type-system, given Arrow is better documented and more stable |
Thanks for elaborating.
To be clear, this is already the case:
This is a complete non-starter.
I can't see that changing any time soon, but I don't know what specifically is on matplotlib's roadmap. |
Thanks for the ping, and thanks both for comments! 🙏 It's true that Seaborn accepts Polars objects, but they fail if the object contains data types not recognised by the interchange protocol (#3533). (I think we all find this frustrating, and feel at least slightly let down by the interchange protocol, but that's a different story..) Seaborn currently uses pd.api.interchange.from_dataframe(data) and that's what fails for when the interchange protocol falls short. But if in pandas we first tried using the (superior, better maintained, less fallible) PyCapsule interface, then Seaborn's current code could "just work"
😆 fair enough So, in summary, there might be anything actionable on Seaborn's side here (though I hope the fallback in #3534 makes it into the next release). Still, good to catch up and hear your opinion on the topic 🙌 |
This would require waiting for pandas 3.0, which might take quite some more time I've opened #3782 to suggest going via PyArrow's PyCapsule Interface, which is widely used and robust I'm hoping this can be considered - if not, I'm hoping we can discuss some other alternatives, because in any case, if Seaborn ends up being the only (!) project still using the Interchange Protocol, then that'll introduce further risk |
This is something I've chatted with @MarcoGorelli offline about. At the time it was implemented in seaborn, the Dataframe Interchange Protocol was the best option for exchanging dataframe-like data. However, since that was implemented in seaborn, the PyArrow Capsule Interface has come along and solved many of the issues that the DataFrame Interchange Protocol left open.
Without knowing the current state of the interchange implementation of seaborn, switching to the PyArrow Capsule Interface should solve at least the following issues:
to_pandas
rather than erroring if interchanging to pandas doesn't work? #3533)size
parameter ofscatterplot
does not accept Float64 type #3519)The interface has been adopted by a good deal of projects already, some of which are being tracked in apache/arrow#39195
The text was updated successfully, but these errors were encountered: