[WIP] Major fixes to datasets.py #20

gpfreitas · 2017-10-03T20:10:47Z

Addressing issue #17.

MLDatasets now returned by default for the make_* functions in datasets.py. When shape and n_samples are passed to a make_* function, shape takes precedence. Added tests to check the above.

Also minor fix: shape was taking precedence over n_samples only if n_samples was supplied. That is fixed now.

For now, installing it under ./src, from pip, but we can change that later.

1. Made the mechanism more robust; now all functions from sklearn.datasets have been wrapped. But some of them may still fail at runtime (and we may want to disregard them, unless we want the code to be much messier for just 3 `make_*` (sampling) functions. 2. Easy to use just sklearn or just dask_ml functions or any priority over them; if we end up with more libraries with `datasets.py` files, we can just put the modules in a priority list, like `[dask_ml.datsets, sklearn.datasets, my_new_module.datasets]` (earlier packages have priority here). This is related to a new function utils.get_first_matching_attribute. 3. Pointing to dask_ml functions instead of dask-glm 4. Opened an issue on dask_ml about introspection: dask/dask-ml#58

Pluse some code cleanup due to the following issue getting fixed: dask/dask-ml#58

Regards datasets.py

The one error causing problem right now is that you can't do df['y'] = ...something When df is a dask dataframe backed by a dask array. So we need a workaround here.

gpfreitas · 2017-10-24T17:04:03Z

Right now (commit 08e99e3) the test suite passes if we use only functions from sklearn. You can check that yourself, by changing the line

xarray_filters/xarray_filters/datasets.py

Line 571 in 08e99e3

    
           _sampling_source_packages = [dask_ml.datasets, sklearn.datasets]  # give priority to packages that come first

to _sampling_source_packages = [dask_ml.datasets, sklearn.datasets][1:] and rerunning the test suite (just pytest from the root of the repo).

Using the dask_ml backends (so using the code from that commit as written), we get some failures in the test suite (including unit and doctests):

============================= test session starts ==============================
platform darwin -- Python 3.5.4, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/gpfreitas/gh/ContinuumIO/xarray_filters, inifile: pytest.ini
plugins: cov-2.3.1
collected 22 items

xarray_filters/datasets.py ..FF
xarray_filters/reshape.py ...
xarray_filters/tests/test_datasets.py .
xarray_filters/tests/test_pipeline.py ..
xarray_filters/tests/test_reshape.py ........
xarray_filters/tests/test_ts_grid_tools.py FFFF

The failures in datasets.py all have to do with the fact that you can't do

df['y'] = ...something

to add a column 'y' when df is a dask dataframe backed by a dask array. So we need a
workaround here.

Many of the other failures (in test_ts_grid_tools.py) come from a missing chunks argument (as dask_ml is the default backend, and its sampling functions in dask_ml.datasets require a chunks argument).

gpfreitas · 2017-10-24T17:06:40Z

Also, the original code for datasets.py assumed only sklearn.datasets functions would be used. That assumption is reflected in docstrings and variable names (like skl_sampler_func). Now that we use multiple backends, that should be changed.

gpfreitas · 2017-10-24T17:11:30Z

The name NpXyTransformer is also not great: it's weird, and it doens't communicate the right assumptions anymore. It was originally initialized by a pair of X, y numpy arrays, but we were planning on having the class be initialized by any number of arrays with the same size in the first dimension, be them NumPy or dask arrays (it already works with dask arrays, except for the to_frame method, as mentioned above). Something like "DataConverter" would be nicer.

gpfreitas · 2017-10-24T17:18:40Z

@PeterDSteinberg

I'd suggest merging this because all the functionality related to MLDatasets seems to work.

The remaining problems listed above could be addressed in other issues.

If we want tests to pass before merging, we could do the little change that makes it support just the sklearn.datasets functions.

gbrener · 2017-10-25T20:46:57Z

Just fixed the outstanding merge conflicts after speaking to @gpfreitas .

PeterDSteinberg · 2017-10-25T22:12:01Z

I made a reminder issue for us to come back and fix any temporary dask-ml fixes we do here:

#36

Minor fixes to datasets.py

04b4dd6

MLDatasets now returned by default for the make_* functions in datasets.py. When shape and n_samples are passed to a make_* function, shape takes precedence. Added tests to check the above.

gpfreitas changed the title ~~Minor fixes to datasets.py~~ [WIP] Minor fixes to datasets.py Oct 3, 2017

Guilherme Pereira de Freitas added 9 commits October 3, 2017 16:07

Add tests for datasets.py

2aff675

Also minor fix: shape was taking precedence over n_samples only if n_samples was supplied. That is fixed now.

Add dask_glm to environment files.

9804e65

For now, installing it under ./src, from pip, but we can change that later.

Minor doc fix.

539ce77

Merge master into make_funcs

e5cf899

Replace dask-glm with dask-ml

bcba23f

Better interop with dask_ml.

fa29222

Pluse some code cleanup due to the following issue getting fixed: dask/dask-ml#58

Tests pass again when wrapping sklearn funcs

dcbb064

Regards datasets.py

datasets works but for to_frame with dask

08e99e3

The one error causing problem right now is that you can't do df['y'] = ...something When df is a dask dataframe backed by a dask array. So we need a workaround here.

gpfreitas changed the title ~~[WIP] Minor fixes to datasets.py~~ [WIP] Major fixes to datasets.py Oct 24, 2017

Guilherme Pereira de Freitas and others added 2 commits October 25, 2017 15:35

Remove some code that should not be there

3e3f3c7

Merge branch 'master' into make_funcs

fbfa304

gbrener added 4 commits October 25, 2017 16:23

Get tests passing

5e3c756

Get build passing for Python 3

0ca8eeb

Add recipe-fetch script, until dask-ml is merged into conda-forge

d02b705

Minor tweak to gh url, so that changes get picked up

4af3917

PeterDSteinberg mentioned this pull request Oct 25, 2017

Dask-ml datasets.py related changes #36

Open

gbrener added 5 commits October 25, 2017 17:23

Fix script permissions, make some cleanups

e675717

Add requests and jinja2 to before_script section

70da5ab

Update destination pathname so conda-build finds dask-ml recipe

709fdc0

Update .travis.yml

454e01a

Make fetch script Python 2.7 compatible

8950786

Fix py27 test failure

47fc255

gbrener merged commit be662be into master Oct 26, 2017

gbrener deleted the make_funcs branch October 26, 2017 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Major fixes to datasets.py #20

[WIP] Major fixes to datasets.py #20

gpfreitas commented Oct 3, 2017

gpfreitas commented Oct 24, 2017 •

edited

Loading

gpfreitas commented Oct 24, 2017

gpfreitas commented Oct 24, 2017 •

edited

Loading

gpfreitas commented Oct 24, 2017 •

edited

Loading

gbrener commented Oct 25, 2017

PeterDSteinberg commented Oct 25, 2017

[WIP] Major fixes to datasets.py #20

[WIP] Major fixes to datasets.py #20

Conversation

gpfreitas commented Oct 3, 2017

gpfreitas commented Oct 24, 2017 • edited Loading

gpfreitas commented Oct 24, 2017

gpfreitas commented Oct 24, 2017 • edited Loading

gpfreitas commented Oct 24, 2017 • edited Loading

gbrener commented Oct 25, 2017

PeterDSteinberg commented Oct 25, 2017

gpfreitas commented Oct 24, 2017 •

edited

Loading

gpfreitas commented Oct 24, 2017 •

edited

Loading

gpfreitas commented Oct 24, 2017 •

edited

Loading