Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask-ml datasets.py related changes #36

Open
PeterDSteinberg opened this issue Oct 25, 2017 · 4 comments
Open

Dask-ml datasets.py related changes #36

PeterDSteinberg opened this issue Oct 25, 2017 · 4 comments

Comments

@PeterDSteinberg
Copy link
Contributor

See this issue 67 in dask-ml. When that issue is addressed, fix the dask-ml imports in xarray_filters and their usage in datasets.py of this repo. We had intended to cover that in #20 but needed to cover dask-ml 67 first.

@gbrener
Copy link
Contributor

gbrener commented Oct 25, 2017

cc @gpfreitas

@gbrener
Copy link
Contributor

gbrener commented Oct 26, 2017

I introduced some TODOs into @gpfreitas PR. In order to get the tests passing, we needed to:

  • change the priorities of dask_ml and scikit-learn methods
  • remove the chunk keyword arguments from some of the functions

Here is the commit where the TODO locations were introduced: 5e3c756

Comments from @gpfreitas on the subject:

[Assuming that] at some point in time, you will switch wholesale from sklearn.datasets.make_* functions to dask_ml.datasets.make_* functions. The big pain in the code is trying to support both at the same time, depending on which one is implemented where.
(I was doing a simple priority rule: if make_whatever is implemented in dask_ml, use that, otherwise, look in sklearn. You could keep looking at more packages by adding packages to that list sampling_source_packages.)
But that creates a pain with tests because some keywords are required in a dask backend (like chunks), while not valid in sklearn.
I wanted something that would be used like this:

from sklearn.datasets import make_blobs
from xarray_filters.datasets import Convert  # currently NpXyTransformer
X, y = make_blobs(n_features=..., n_samples=...)
ds = Convert(X, y).to_mldataset(arg1=.., arg2=...)  # or to_frame, to_array, etc.  type-specific args that you can see in the signature of the `to_*` method.

and if you want the data coming from dask everything would be the same, except the first line, which would be:

from dask_ml.datasets import make_blobs

Peter wants it to be more like this:

from xarray_filters.datasets import make_blobs
ds = make_blobs(n_features=..., n_samples=..., use_dask=True, astype='mldataset', **kwargs)  # use_dask=False for sklearn backend, you have to figure out which kwargs are valid, and that it depends on `astype` 

@gpfreitas
Copy link
Contributor

gpfreitas commented Oct 26, 2017

In the end, it seemed to me the best way forward was to use the API I wanted as a "low-level" API, that is more strict (no *args, **kwargs) and use that to implement the higher level api that Peter wanted (through a wrapping function called _make_base in the code that tries to generate functions with useful signatures and docstrings, at least in python 3).

The main issue at the API level here is that there are keyword arguments in the various xarray_filters.datasets.make_* functions that depends on two different things (one is already confusing): the type of the underlying X,y arrays (dask vs numpy) and the data structure you are converting to (mldataset, xarray.Dataset, pandas.DataFrame, dask.dataframe, given by what you pass to the astype keyword when calling xarray_filters.datasets.make_*).

Example:

make_blobs(..., chunks=2, astype='frame', columns=...)
  • chunks can only be used because the underlying arrays are dask arrays. It would error with NumPy arrays (which is what sklearn returns).
  • columns can only be used because we are saying astype='frame'; it's a dataframe-specific keyword argument. Pandas or dask make no difference.

This is not exclusive to the dataframe conversion. Converting to MLDatasets or xarray.Dataset objs have the same issue (naming dimensions, etc.)

This may be the best we can do to provide a simple one-liner for generating data, but I'm not sure. I don't have any ideas that are strictly better than what we have.

@gbrener
Copy link
Contributor

gbrener commented Oct 27, 2017

After speaking with @gpfreitas, it might make sense for us to refactor our approach, so that Python 2 compatibility and integration with dask-ml become simpler. Here are some ideas:

  • Remove NpXyTransformer in favor of conditional branches on the keyword arguments in make_blobs, make_classification, make_regression, etc...
  • Use the newly-simplified kwarg-handling logic to introduce use_dask=True (or whatever makes sense from an API standpoint)
  • Have the scikit-learn methods fall back to dask-ml's API when possible, instead of returning the wrapper function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants