Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for polars Categorical datatype #224

Open
AndrewZehrQC opened this issue Sep 19, 2024 · 2 comments
Open

Support for polars Categorical datatype #224

AndrewZehrQC opened this issue Sep 19, 2024 · 2 comments
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers usability

Comments

@AndrewZehrQC
Copy link

Currently, pipedag fails when trying to materialize a polars table with Categorical types

@windiana42 windiana42 added bug Something isn't working enhancement New feature or request usability labels Oct 4, 2024
@windiana42
Copy link
Member

@AndrewZehrQC the question is what do you expect for materialization in case of categorical. The ideal solution would be to preserve the speed of categoricals in SQL. This would mean to store it as an integer column and then write an additional table that can be used to map the integers back to strings. Dematerialization should automatically detect the additional table and map the data back to categoricals on the python side. @finn-rudolph at some point it might be nice if pydiverse transform could deal with such categorical columns within SQL.

@AndrewZehrQC I could imagine you rather had the quick bug fix in mind which would be converting a categorical column to a string column during materialization. It could be possible to also speed up this process by first writing very short strings per categorical and then send performing the replacement with the long strings in SQL. However, since the solution above would be so much nicer, I wouldn't invest too much on this transfer time speedup.

@windiana42 windiana42 added the good first issue Good for newcomers label Oct 4, 2024
@windiana42
Copy link
Member

When supporting categoricals in SQL, it might be convenient to dematerialize not just an sa.Table as sa.Alias, but to dematerialize already a join query with the categorical resolution. Even though sqlalchemy does support getting columns from a join query via prefixed column names, I actually don't like this option too much. However, returning a tuple with multiple elements (main table, dictionary of categorical table aliases) might yield quite a complex user interface. With pydiverse transform we can make this super nice giving you even access to both the joined string and the category integer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers usability
Projects
None yet
Development

No branches or pull requests

2 participants