Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement to_cudf method for reading directly into GPU memory #19

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

weiji14
Copy link

@weiji14 weiji14 commented Oct 11, 2020

Adds a to_cudf method to allow reading parquet files into cudf.DataFrame objects. Fixes #17.

Let me know if I should add more docstrings or tests, this PR is fairly minimal at this point.

@martindurant
Copy link
Member

Is it not the case that the engine="cudf" keyword is actually ignored here, and it's only calling to_cudf that's important?
Passing that engine and then naively calling read() I suppose would error?

Furthermore, I think that as coded, you would get different behaviour for read/to_dask depending on whether you had called to_cudf first or not.

@weiji14
Copy link
Author

weiji14 commented Oct 13, 2020

Is it not the case that the engine="cudf" keyword is actually ignored here, and it's only calling to_cudf that's important?
Passing that engine and then naively calling read() I suppose would error?

Good point. Perhaps the engine argument should be passed in directly when reading, i.e. as to_cudf(engine='cudf') or to_cudf(engine='pyarrow'), instead of a more global kwarg. Same for read and to_dask, but this would be a breaking change, albeit one that might be necessary to offer more fine grained control over reading parquet files.

Furthermore, I think that as coded, you would get different behaviour for read/to_dask depending on whether you had called to_cudf first or not.

Will need to test this out, do you mean the reuse of the self._df variable?

@martindurant
Copy link
Member

It feels like you can either have the engine kwargs or the to_cudf method, not both. Either of these could solve the requirement, and I think the former may be the simpler.

do you mean the reuse of the self._df variable

Exactly - which is why I have a slight preference to setting the engine in init

@weiji14
Copy link
Author

weiji14 commented Oct 14, 2020

It feels like you can either have the engine kwargs or the to_cudf method, not both. Either of these could solve the requirement, and I think the former may be the simpler.

Yes the former implementation of using only engine is simpler. However, engine="pyarrow" is available for both the cudf.read_parquet and pd.read_parquet readers, so what happens when engine="pyarrow" is chosen, do we read using pandas or cudf? These are the engines supported by the two readers:

engine cudf.read_parquet pd.read_parquet
cudf ✔️
pyarrow ✔️ ✔️
fastparquet ✔️

Maybe we should deprecate the 'engine' kwarg, and pass it engine in at the to_cudf or to_dask readers? But again, this would be a backward incompatible change, so might need to think this over a bit more.

@martindurant
Copy link
Member

I would have engine=

  • fastparquet (implies pandas),
  • (py)arrow (implies pandas),
  • cudf (implies pyarrow)

and then any of these can work with read() and to_dask(); the engine= parameter already has this meaning for pandas and Dask.

@weiji14
Copy link
Author

weiji14 commented Oct 23, 2020

and then any of these can work with read() and to_dask(); the engine= parameter already has this meaning for pandas and Dask.

I'm ok with this. However, the implementation will be harder, and the codebase will need to change significantly to handle this. Currently, the parquet dataset is loaded lazily via dask first:

if self._df is None:
self._df = self._to_dask()

and .compute() is called when using read to return a pandas.DataFrame:

def read(self):
"""
Create single pandas dataframe from the whole data-set
"""
self._load_metadata()
return self._df.compute()

In the cudf world however, this would imply that dask_cudf will always be needed, but dask_cudf isn't a dependency anyone would need unless they have more than 1 GPU.

So yes, we could go with engine='cudf', but this would involve a significant refactoring effort on the backend.

@martindurant
Copy link
Member

Please remind me next week, and I can try: I'm sure we don't need to introduce cudf as a dependency in order to support the engine= keyword.

@weiji14
Copy link
Author

weiji14 commented Oct 23, 2020

Sure, let's check next week.

Oh, and we definitely don't need to introduce cudf as a dependency in intake-parquet. What I want to avoid is that people need to install both dask_cudf and cudf in order to use this new functionality.

@martindurant
Copy link
Member

I see. I don't know how dask/cudf are implemented internally, it should be possible to get the base information without dask and then pass to dask-cudf when calling to_dask. This is also a shortcoming of the non-cudf branch, dask is assumed in many places for various drivers.

@martindurant
Copy link
Member

Completely forgot about this from so long ago, sorry. #28 does some similar work to make dask optional, so passing the engine through will work now. If there is still interest, that is.
Separately, we are considering completely pulling apart the "file type" and "backend reader" logic in Intake generally, which would lead to far more but much simpler reader classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a to_cudf method for reading directly into GPU memory
2 participants