Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default arguments for xarray_filters.datasets.make_* functions #17

Open
4 of 7 tasks
PeterDSteinberg opened this issue Sep 19, 2017 · 4 comments
Open
4 of 7 tasks
Assignees

Comments

@PeterDSteinberg
Copy link
Contributor

PeterDSteinberg commented Sep 19, 2017

@gpfreitas This is related to issue #5 and #6 and tries to condense them into a TODO list.

Items to do related to the argument specs of make_* functions from xarray_filters.datasets:

  • Make MLDataset be the default return value rather than Dataset
  • Remove the requirement for the n_samples argument in this case: MLDataset(make_blobs(n_samples=2000, shape=(200,10))) where n_samples can be taken from shape
  • For functions that exist in dask_glm, e.g. make_classification, we should default to making a MLDataset as in the xarray_filters.datasets so far, but use dask_glm's funcs for a dask.array in each DataArray rather than sklearn.datasets numpy based approach.
    • Provide a use_dask_glm=True keyword to control whether the functions in dask_glm.datasets are used.
  • Change the sequence of acceptable strings for astype to the following (or equivalent way of specifying the data structures below as the output type):
    ( 'pandas.dataframe','dask.array', 'dask.dataframe', 'numpy.ndarray', ,'dataset', 'mldataset')
  • xnames should be layers
  • docstring edits - See below: This is current docstring for make_blobs from xarray_filters - I think it needs more of the docs from the transformation part explained, e.g. that it typically outputs N-D DataArrays in an MLDataset or any differences between sklearn and xarray_filters like n_samples versus shape:
In [3]: ?make_blobs
Signature: make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None, *, astype='dataset', **kwargs)
Docstring:
Like sklearn.datasets.samples_generator.make_blobs, but with added functionality.

Parameters
---------------------
Same parameters/arguments as sklearn.datasets.samples_generator.make_blobs, in addition to the following
keyword-only arguments:

astype: str
    One of ('array', 'dataframe', 'dataset', 'mldataset') or None to return an NpXyTransformer. See documentation
    of NpXyTransformer.astype.

**kwargs: dict
    Optional arguments that depend on astype. See documentation of
    NpXyTransformer.astype.

Note - where I said dask_glm above - also look at dask-ml

@PeterDSteinberg
Copy link
Contributor Author

PeterDSteinberg commented Sep 22, 2017

Other TODOs I need to add:

  • Ensure that the named dims can be controlled, i.e. that dims like x,y,z,t can be named rather than dim_0 dim_1 by default.

@gpfreitas
Copy link
Contributor

gpfreitas commented Oct 3, 2017

  • MLDataset default: check
  • no need for n_samples when shape passed: check (I chose to let shape overrides n_samples)
  • layers instead of xnames: check

I think letting shape be a dict should be enough for letting the user customize dimension names.

So, what's left is the harder part:

  • support dask data structures, see dask-glm
  • change the sequence of acceptable strings for astype (already supported in master)

For astype, @PeterDSteinberg, we should leave the to_* methods intact, right? So, passing astype='numpy.ndarray' would call XyTransformer.to_array. Sounds good?

@gpfreitas
Copy link
Contributor

Working on the dask-glm support.

@PeterDSteinberg
Copy link
Contributor Author

Note the dask-ml / dask-glm related work is being addressed in a separate issue: #36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants