-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Major fixes to datasets.py #20
Conversation
MLDatasets now returned by default for the make_* functions in datasets.py. When shape and n_samples are passed to a make_* function, shape takes precedence. Added tests to check the above.
Also minor fix: shape was taking precedence over n_samples only if n_samples was supplied. That is fixed now.
For now, installing it under ./src, from pip, but we can change that later.
1. Made the mechanism more robust; now all functions from sklearn.datasets have been wrapped. But some of them may still fail at runtime (and we may want to disregard them, unless we want the code to be much messier for just 3 `make_*` (sampling) functions. 2. Easy to use just sklearn or just dask_ml functions or any priority over them; if we end up with more libraries with `datasets.py` files, we can just put the modules in a priority list, like `[dask_ml.datsets, sklearn.datasets, my_new_module.datasets]` (earlier packages have priority here). This is related to a new function utils.get_first_matching_attribute. 3. Pointing to dask_ml functions instead of dask-glm 4. Opened an issue on dask_ml about introspection: dask/dask-ml#58
Pluse some code cleanup due to the following issue getting fixed: dask/dask-ml#58
Regards datasets.py
The one error causing problem right now is that you can't do df['y'] = ...something When df is a dask dataframe backed by a dask array. So we need a workaround here.
Right now (commit 08e99e3) the test suite passes if we use only functions from sklearn. You can check that yourself, by changing the line xarray_filters/xarray_filters/datasets.py Line 571 in 08e99e3
to Using the dask_ml backends (so using the code from that commit as written), we get some failures in the test suite (including unit and doctests):
The failures in
to add a column Many of the other failures (in |
Also, the original code for |
The name |
I'd suggest merging this because all the functionality related to MLDatasets seems to work. The remaining problems listed above could be addressed in other issues. If we want tests to pass before merging, we could do the little change that makes it support just the sklearn.datasets functions. |
Just fixed the outstanding merge conflicts after speaking to @gpfreitas . |
I made a reminder issue for us to come back and fix any temporary dask-ml fixes we do here: |
Addressing issue #17.