Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Major fixes to datasets.py #20

Merged
merged 22 commits into from
Oct 26, 2017
Merged

[WIP] Major fixes to datasets.py #20

merged 22 commits into from
Oct 26, 2017

Conversation

gpfreitas
Copy link
Contributor

Addressing issue #17.

MLDatasets now returned by default for the make_* functions in
datasets.py.

When shape and n_samples are passed to a make_* function, shape takes
precedence.

Added tests to check the above.
@gpfreitas gpfreitas changed the title Minor fixes to datasets.py [WIP] Minor fixes to datasets.py Oct 3, 2017
Guilherme Pereira de Freitas added 9 commits October 3, 2017 16:07
Also minor fix: shape was taking precedence over n_samples only if
n_samples was supplied. That is fixed now.
For now, installing it under ./src, from pip, but we can change that
later.
1. Made the mechanism more robust; now all functions from
sklearn.datasets have been wrapped. But some of them may still fail at
runtime (and we may want to disregard them, unless we want the code to
be much messier for just 3 `make_*` (sampling) functions.

2. Easy to use just sklearn or just dask_ml functions or any priority
over them; if we end up with more libraries with `datasets.py` files, we
can just put the modules in a priority list, like `[dask_ml.datsets,
sklearn.datasets, my_new_module.datasets]` (earlier packages have
priority here). This is related to a new function
utils.get_first_matching_attribute.

3. Pointing to dask_ml functions instead of dask-glm

4. Opened an issue on dask_ml about introspection:
dask/dask-ml#58
Pluse some code cleanup due to the following issue getting fixed:

dask/dask-ml#58
The one error causing problem right now is that you can't do

df['y'] = ...something

When df is a dask dataframe backed by a dask array. So we need a
workaround here.
@gpfreitas gpfreitas changed the title [WIP] Minor fixes to datasets.py [WIP] Major fixes to datasets.py Oct 24, 2017
@gpfreitas
Copy link
Contributor Author

gpfreitas commented Oct 24, 2017

Right now (commit 08e99e3) the test suite passes if we use only functions from sklearn. You can check that yourself, by changing the line

_sampling_source_packages = [dask_ml.datasets, sklearn.datasets] # give priority to packages that come first

to _sampling_source_packages = [dask_ml.datasets, sklearn.datasets][1:] and rerunning the test suite (just pytest from the root of the repo).

Using the dask_ml backends (so using the code from that commit as written), we get some failures in the test suite (including unit and doctests):

============================= test session starts ==============================
platform darwin -- Python 3.5.4, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/gpfreitas/gh/ContinuumIO/xarray_filters, inifile: pytest.ini
plugins: cov-2.3.1
collected 22 items

xarray_filters/datasets.py ..FF
xarray_filters/reshape.py ...
xarray_filters/tests/test_datasets.py .
xarray_filters/tests/test_pipeline.py ..
xarray_filters/tests/test_reshape.py ........
xarray_filters/tests/test_ts_grid_tools.py FFFF

The failures in datasets.py all have to do with the fact that you can't do

df['y'] = ...something

to add a column 'y' when df is a dask dataframe backed by a dask array. So we need a
workaround here.

Many of the other failures (in test_ts_grid_tools.py) come from a missing chunks argument (as dask_ml is the default backend, and its sampling functions in dask_ml.datasets require a chunks argument).

@gpfreitas
Copy link
Contributor Author

Also, the original code for datasets.py assumed only sklearn.datasets functions would be used. That assumption is reflected in docstrings and variable names (like skl_sampler_func). Now that we use multiple backends, that should be changed.

@gpfreitas
Copy link
Contributor Author

gpfreitas commented Oct 24, 2017

The name NpXyTransformer is also not great: it's weird, and it doens't communicate the right assumptions anymore. It was originally initialized by a pair of X, y numpy arrays, but we were planning on having the class be initialized by any number of arrays with the same size in the first dimension, be them NumPy or dask arrays (it already works with dask arrays, except for the to_frame method, as mentioned above). Something like "DataConverter" would be nicer.

@gpfreitas
Copy link
Contributor Author

gpfreitas commented Oct 24, 2017

@PeterDSteinberg

I'd suggest merging this because all the functionality related to MLDatasets seems to work.

The remaining problems listed above could be addressed in other issues.

If we want tests to pass before merging, we could do the little change that makes it support just the sklearn.datasets functions.

@gbrener
Copy link
Contributor

gbrener commented Oct 25, 2017

Just fixed the outstanding merge conflicts after speaking to @gpfreitas .

@PeterDSteinberg
Copy link
Contributor

I made a reminder issue for us to come back and fix any temporary dask-ml fixes we do here:

#36

@gbrener gbrener merged commit be662be into master Oct 26, 2017
@gbrener gbrener deleted the make_funcs branch October 26, 2017 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants