[Bug] Fix Arrow-FS parquet reader for larger files #17099

rjzamora · 2024-10-16T01:11:18Z

Description

Follow-up to #16684

There is currently a bug in dask_cudf.read_parquet(..., filesystem="arrow") when the files are larger than the "dataframe.parquet.minimum-partition-size" config. More specifically, when the files are not aggregated together, the output will be pd.DataFrame instead of cudf.DataFrame.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

python/dask_cudf/dask_cudf/io/tests/test_parquet.py

…s-parquet

python/dask_cudf/dask_cudf/expr/_expr.py

…s-parquet

madsbk

Looks good to me, thanks @rjzamora

fix result when file-aggregation is absent

f99b510

rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Oct 16, 2024

rjzamora self-assigned this Oct 16, 2024

github-actions bot added the Python Affects Python cuDF API. label Oct 16, 2024

rjzamora commented Oct 17, 2024

View reviewed changes

python/dask_cudf/dask_cudf/io/tests/test_parquet.py Show resolved Hide resolved

rjzamora added 2 commits October 17, 2024 09:06

Merge remote-tracking branch 'upstream/branch-24.12' into fix-arrow-f…

23b8f4b

…s-parquet

skip old arrow version

caf06e5

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 17, 2024

rjzamora marked this pull request as ready for review October 17, 2024 19:15

rjzamora requested a review from a team as a code owner October 17, 2024 19:15

Merge branch 'branch-24.12' into fix-arrow-fs-parquet

dcc6cc3

madsbk reviewed Oct 21, 2024

View reviewed changes

python/dask_cudf/dask_cudf/expr/_expr.py Outdated Show resolved Hide resolved

rjzamora added 2 commits October 21, 2024 07:01

Merge remote-tracking branch 'upstream/branch-24.12' into fix-arrow-f…

96d122f

…s-parquet

remove unused positional args

c93e19b

madsbk approved these changes Oct 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fix Arrow-FS parquet reader for larger files #17099

[Bug] Fix Arrow-FS parquet reader for larger files #17099

rjzamora commented Oct 16, 2024

madsbk left a comment

[Bug] Fix Arrow-FS parquet reader for larger files #17099

Are you sure you want to change the base?

[Bug] Fix Arrow-FS parquet reader for larger files #17099

Conversation

rjzamora commented Oct 16, 2024

Description

Checklist

madsbk left a comment

Choose a reason for hiding this comment