Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pylibcudf to_arrow with multiple nested data types #17504

Merged
merged 4 commits into from
Dec 19, 2024

Conversation

mroeschke
Copy link
Contributor

@mroeschke mroeschke commented Dec 4, 2024

Description

Fixes the following case

In [25]: import pyarrow as pa, pylibcudf as plc

In [26]: pa_array = pa.array([[{"a": 1}]])

In [27]: pa_array.type
Out[27]: ListType(list<item: struct<a: int64>>)

In [28]: plc_table = plc.Table([plc.interop.from_arrow(pa_array)])

In [29]: plc.interop.to_arrow(plc_table)
RuntimeError: CUDF failure at: cpp/src/interop/to_arrow_schema.cpp:146: Number of field names and number of children doesn't match

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added bug Something isn't working non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package labels Dec 4, 2024
@mroeschke mroeschke self-assigned this Dec 4, 2024
@mroeschke mroeschke requested a review from a team as a code owner December 4, 2024 02:17
@github-actions github-actions bot added the Python Affects Python cuDF API. label Dec 4, 2024
def _table_to_schema(Table tbl, metadata):
if metadata is None:
metadata = [ColumnMetadata() for _ in range(len(tbl.columns()))]
metadata = [_maybe_create_nested_column_metadata(col) for col in tbl.columns()]
metadata = [ColumnMetadata(m) if isinstance(m, str) else m for m in metadata]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what metadata is here but I suspect this could use the same treatment? cc @vyasr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is Arrow metadata. Can we change this into if/else block? Like:

if metadata is None:
        metadata = [_maybe_create_nested_column_metadata(col) for col in tbl.columns()]
else:
        metadata = [_maybe_create_nested_column_metadata(m) if isinstance(m, str) else m for m in metadata]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks. Changed it to use this if/else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this didn't end up working.

From test_quantiles.py, the current invocation of pylibcudf.interop.to_arrow with a pylibcudf.Table involves metadata being the associated column names (i.e. list of strings) so I suppose the prior usage was OK

Copy link
Contributor

@vyasr vyasr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh I'm sorry, too many notifications. I missed this one.

Yes, the metadata is an instance of https://github.com/vyasr/cudf/blob/chore/update_ci_jobs/cpp/include/cudf/interop.hpp#L104, which we expose in Python here. I'm not sure if the question here is still relevant, but in general the reason that we need this parameter is to handle nested struct names. Almost every other form of metadata passed this way is redundant. I'll be working to get rid of this in the future using pyarrow APIs if possible so that we have parameter-free conversion to/from arrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I think in regards to this change there shouldn't be any modifications needed based on your answer

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit d8f469f into rapidsai:branch-25.02 Dec 19, 2024
106 checks passed
@mroeschke mroeschke deleted the bug/plc/nested_to_arrow branch December 19, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants