Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scikitlearn clustering methods cleanup #209

Open
ablaom opened this issue Mar 12, 2020 · 2 comments
Open

Scikitlearn clustering methods cleanup #209

ablaom opened this issue Mar 12, 2020 · 2 comments
Labels
help wanted Extra attention is needed

Comments

@ablaom
Copy link
Member

ablaom commented Mar 12, 2020

I think these methods could do with a review. Here are a few things that look like issues to me:

  • Some models do not implement a transform method (presumably because python scikit-learn does not) but could, no? Example: DBSCAN

  • Recall that MLJ makes a distinction between report and fitted_params: the latter is for the learned parameters (in this case what is needed to assign a new observation to a class), and everything else goes in report. It seems that in the scikitlearn clustering wraps everything is just lumped into fitted_params. In particular this has led to inconsistency with the Clustering.jl models KMeans and KMedoids (which separate things correctly, as far as I can tell).

cc: @tlienart

@ablaom ablaom added the help wanted Extra attention is needed label Mar 12, 2020
@tlienart
Copy link
Collaborator

tlienart commented Mar 13, 2020

* Some models do not implement a transform method (presumably because python scikit-learn  does not) but could, no? Example: DBSCAN

Hmm, I think this requires a bit more thinking. For instance what would you suggest the transformation of DBSCAN would correspond to? DBScan doesn't have centroids etc, you run the algorithm and you get an "answer" in the form of labels. So the only meaningful thing I would see here is to map the transform to the predict which just gets those labels. But I think this is confusing and would prefer the distinction. What do you think?

* Recall that MLJ makes a distinction between `report` and `fitted_params`: the latter is for the learned parameters (in this case what is needed to assign a new observation to a class), and everything else goes in `report`. It seems that in the scikitlearn clustering wraps everything is just lumped into `fitted_params`. In particular this has led to inconsistency with the Clustering.jl models KMeans and KMedoids (which separate things correctly, as far as I can tell).

Yes so what I did for all sklearn models is to dump in fitted params the output of the fit method of Sklearn. I think it makes sense to be close to scikitlearn so that people know where to find things but granted this is probably not ideal with MLJ's internal report/fitted_params split.

I'm not very happy with this separation report and fitted_params (sorry for bringing this back up), I understand their purpose but in practice I'd just want to bother with one of them. Can I maybe suggest that fitted_params be part of the report? i.e. that all of the fields of fitted_params are also returned in report? The big advantage is that it makes things easier to find, you can just tell people to use report always while internally we use fitted_params.

Anyway back on topic: (1) I need your thoughts first (2) the refactoring is doable but will be annoying and I'm wondering whether adiding fitted_params in the report might not just make our life generally easier.

@ablaom
Copy link
Member Author

ablaom commented Mar 15, 2020

@tlienart You are right. DBSCAN is not like KMeans clustering. I stand corrected. However, I do wonder if the sk-learn way of conceptualising this class of clustering problems is the most useful. Might it not be better to view DBSCAN as a static transformer that transforms features X into labels y (or perhaps, labelled features (X, y)). So there is no fit method - or rather fit in our implementation - would be a no-op. (Perhaps this is what you are also suggesting above?) The problem with this is there is then no interface point for returning any other information about transformation procedure, apart from the labels. What are your thoughts?

I'm not very happy with this separation report and fitted_params (sorry for bringing this back up), I understand their purpose but in practice I'd just want to bother with one of them. Can I maybe suggest that fitted_params be part of the report? i.e. that all of the fields of fitted_params are also returned in report?

There was quite a bit of discussion in deciding upon this design where good points were made for separating the two interface points, in my view. I think there would need to be a pretty strong argument for changing the design now. I do think the name report is unfortunate, as is a bit vague. In retrospect extras would have been better. So then people are less likely to look for learned parameters there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants