-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make automatic centering in PCA methods optional #808
base: main
Are you sure you want to change the base?
Conversation
Once the maintainers confirm approval of the proposed changes, I'll be happy to update the |
dask_ml/decomposition/pca.py
Outdated
warnings.warn( | ||
"Whitening requires centering. Please, ensure that your data " | ||
"is already centered, in order to avoid unexpected behavior.", | ||
RuntimeWarning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should have some way to silence this warning. I'm inclined to just document this requirement, and trust that the user has done it (no warning).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll remove the warning (the new param has been already added to the doc-string). Please, let me know if any other documentation-related changes are necessary. I've checked api.rst
and compose.rst
but it doesn't seem that we need any updates to either.
From the latest CI runs (e.g., this job). black, version 19.10b0
would reformat /home/vsts/work/1/s/tests/test_pca.py
Oh no! 💥 💔 💥
1 file would be reformatted, 99 files would be left unchanged. Oops - pretty sure that I did this check locally. Will re-run Update: This is strange. I'll check if my local branch has actually propagated, and is the same as what the Azure pipeline has picked up. Result of ./ci/code_checks.sh run locally from the root dir of `dask-ml`:Checking flake8...
Checking flake8... DONE
Checking black...
black, version 20.8b1
would reformat /git/dask-ml/dask_ml/model_selection/utils_test.py
would reformat /git/dask-ml/tests/ensemble/test_blockwise.py
would reformat /git/dask-ml/tests/model_selection/test_keras.py
would reformat /git/dask-ml/tests/model_selection/test_hyperband.py
would reformat /git/dask-ml/tests/model_selection/dask_searchcv/test_model_selection.py
would reformat /git/dask-ml/tests/model_selection/test_incremental.py
Oh no! 💥 💔 💥
6 files would be reformatted, 94 files would be left unchanged.
Checking black... DONE
Checking isort...
5.6.4
ERROR: /git/dask-ml/tests/test_spectral_clustering.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/test_incremental_pca.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/linear_model/test_glm.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/model_selection/dask_searchcv/test_model_selection_sklearn.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/docs/dimensions.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/dask_ml/_compat.py Imports are incorrectly sorted and/or formatted.
... Update (2): Seems that it's the same code version, both locally and remotely. Locally:
Remotely: The most recent commit (and only one since Tom's review): 01dd9d4. I'll try reproducing Update (3): Victory! It was indeed due to the different local version of |
266ae3f
to
4c82add
Compare
Hi @TomAugspurger - I'm writing to check if you've had a chance to look into the further changes that I pushed since your review, and whether you'd like any additional refinements to be done on top of that. |
It'll be a week or so before I can take a look. Thanks for your patience! In the meantime, perhaps @jameslamb has a chance to glance through this (no worries if you don't though)? |
It's all good - just wanted to check if this is still on your radar. I'm happy to wait as long as necessary, and incorporate potential further feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some time to go through and review today. Overall, I'm in favor of the change and I can understand how it could be a nice optimization for experienced users.
Please consider some of the review comments I left.
I also think it would improve understanding for people who find this pull request from search or see it in a changelog if you change the PR title to something like "Make automatic centering in PCA methods optional". By itself, the fact that there is now a parameter called center
doesn't tell you much about the functional benefit of the changes in this pull request.
@@ -127,6 +127,7 @@ def __init__( | |||
self, | |||
n_components=None, | |||
whiten=False, | |||
center=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing new arguments in the middle of a signature is a breaking change for anyone using these classes with positional arguments. Can you please put this new argument at the end of the signature instead of in the middle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hristog you marked this as resolved, but I don't think it is yet. Could you put copy
back between whiten
and batch_size
and make center
the last keyword argument?
dask_ml/decomposition/pca.py
Outdated
center : bool, optional (default True) | ||
When False (True by default), the underlying data gets centered at zero | ||
by subtracting the mean of the data from the data itself. | ||
|
||
PCA is performed on centered data due to its being a regression model, | ||
without an intercept. As such, its pricipal components originate at the | ||
origin of the transformed space. | ||
|
||
`center` set to False may be employed when performing PCA on already | ||
centered data. | ||
|
||
Since centering is a required step as part of whitening, `center` set | ||
to False and `whiten` set to True is a combination which may result in | ||
unexpected behavior, if performed on not previously centered data. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
center : bool, optional (default True) | |
When False (True by default), the underlying data gets centered at zero | |
by subtracting the mean of the data from the data itself. | |
PCA is performed on centered data due to its being a regression model, | |
without an intercept. As such, its pricipal components originate at the | |
origin of the transformed space. | |
`center` set to False may be employed when performing PCA on already | |
centered data. | |
Since centering is a required step as part of whitening, `center` set | |
to False and `whiten` set to True is a combination which may result in | |
unexpected behavior, if performed on not previously centered data. | |
center : bool, optional (default True) | |
When True (the default), the underlying data gets centered at zero | |
by subtracting the mean of the data from the data itself. | |
PCA is performed on centered data due to its being a regression model, | |
without an intercept. As such, its principal components originate at the | |
origin of the transformed space. | |
``center=False`` may be employed when performing PCA on already | |
centered data. | |
Since centering is a required step as part of whitening, ``center`` set | |
to False and ``whiten`` set to True is a combination which may result in | |
unexpected behavior, if performed on not previously centered data. | |
I believe there are a few mistakes in this documentation as it's currently written.
- two backticks should be used for inline code formatting, since these docstrings are interpreted as reStructuredText by Sphinx
- the docstring should say that the default behavior (
center=True
) is to automatically center the data (it currently says the opposite)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the docstring should say that the default behavior (center=True) is to automatically center the data (it currently says the opposite)
Well spotted, thanks!
Updating as per the double backtick suggestion. I'll update the rest of the affected docstrings, to conform to the same guideline (unless we'd like to do that in a separate, dedicated PR?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I was a maintainer here, I'd prefer that a docstring-only change be made in a separate pull request, and that this comment only apply to the lines you are already adding / changing as of this pull request
dask_ml/decomposition/pca.py
Outdated
@@ -149,21 +164,28 @@ class PCA(sklearn.decomposition.PCA): | |||
>>> dX = da.from_array(X, chunks=X.shape) | |||
>>> pca = PCA(n_components=2) | |||
>>> pca.fit(dX) | |||
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, | |||
svd_solver='auto', tol=0.0, whiten=False) | |||
PCA(n_components=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why were these changes necessary? They seem unrelated to the current pull request. Could you please revert them or explain why they're necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really depends whether those are considered to be demonstrative examples (in doctest style) or actual doctests which are meant to pass.
If the case is the latter, then the proposed changes should probably still hold.
For example, in their original state, these doctests fail when run with:
py.test --verbose --doctest-modules dask_ml/
If, on the other hand, they're just meant to serve for demonstration only (section 15 of the numpydoc
standard), then I'll be definitely happy to revert back these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is adding one new parameter with a default value set to the current behavior. I wouldn't expect any tests to fail, and if they are I'd like to know more about what failed (a link to logs, if you have it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the doctests get run as part of the CI steps, and I imagine the doctests haven't been updated for quite some time (haven't checked with git blame
) or the intention has never been that they get run as actual first-class-citizen test (but, instead, serve for demonstration).
E.g., when running against the latest main
:
$ py.test --verbose --doctest-modules dask_ml/
...
dask-ml/dask_ml/wrappers.py:92: DocTestFailure
___________________ [doctest] dask_ml.decomposition.pca.PCA ____________________
142
143 Examples
144 --------
145 >>> import numpy as np
146 >>> import dask.array as da
147 >>> from dask_ml.decomposition import PCA
148 >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
149 >>> dX = da.from_array(X, chunks=X.shape)
150 >>> pca = PCA(n_components=2)
151 >>> pca.fit(dX)
Expected:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
Got:
PCA(n_components=2)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, ok I see. I'm not too familiar with that type of test, apologies. But it does seem unrelated to the change in this pull request, so I think it would be better to revert those changes here and open a separate issue / PR about the general topic of "doctest tests are failing / not run"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall if we run doctests here or not. I wouldn't worry about that failing one. Scikit-Learn changed the repr of their objects to only show the parameters that are not the default, so the docstring changed.
self.copy = copy | ||
self.batch_size = batch_size | ||
self.svd_solver = svd_solver | ||
self.iterated_power = iterated_power | ||
self.random_state = random_state | ||
|
||
def _check_params(self): | ||
super()._check_params() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of defining an empty _check_params()
on the parent class and then calling it here? In my opinion, that is adding a bit of additional indirection for no functional benefit. Could you please remove this line and the parent class's empty _check_params()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is to future-proof this by ensuring any checks that get eventually introduced in the superclass, get considered at this level as well.
Happy to remove the perceived as redundant check, at the current stage. Just my opinion is that it introduces a bit more exposure to bugs, if the class hierarchy doesn't get utilized in cases like this. But it may as well be done at a further stage.
Thanks for your review, @jameslamb! I'll try to address your comments and make changes, wherever applicable, ASAP. |
center
param to PCA (#734)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that https://github.com/dask/dask-ml/pull/808/files#r615299186 is the only outstanding issue. Good to go once that's fixed.
b2689d3
to
d241040
Compare
black
/pyflakes
passing: Yes.tests/test_pca.py
tests/test_incremental_pca.py
IncrementalPCA
doesn't get affected by the new parameter andIncrementalPCA(..., center=False ,...)
is unsupported.whiten=True
does not imply, nor enforcecenter=True
. Instead it warns about possible unexpected behavior, if the underlying data hasn't been centered already.