-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Outlier Detection API in MLJ #780
Comments
@davnn Thanks indeed for looking into this. I should like to help out with this discussion but currently am on leave until 10th May. More then. |
Okay, OutlierDetection.jl looks like substantial piece of work! We should definitely make a nice MLJ API to handle it. Let me start by acknowledging a super effort here to wrap your head around the MLJ API. I realize this takes a bit of work, and I can see the evidence of this in your code. I am not so familiar with OD so I will start with a couple of naive questions.
|
Right, that's how I would differentiate outlier detection from rare class classification.
Using scores enables interesting ensemble learning approaches. E.g. (1) being able to use different detectors on the same data and (2) being able to use different detectors on different data. I think it's best to show examples. First prepare some example data. using MLJ # or MLJBase
using OutlierDetection # installed from master
X, train, test = rand(1000, 10), 1:500, 501:1000;
nodemachine(model, X) = transform(machine(model, X), X) Example for use case (1), multiple detectors, single data: Xs = source(table(X));
scores_knn = map(k -> nodemachine(KNN(k=k), Xs), [5, 10, 15])
scores_lof = map(k -> nodemachine(LOF(k=k), Xs), [5, 10, 15])
network = transform(machine(Binarize()), scores_knn..., scores_lof...)
fit!(network, rows=train)
network(rows=test) Example for use case (2) multiple detectors, multiple data: Xs1 = source(table(X[:,1:5]))
Xs2 = source(table(X[:,6:10]))
scores_knn = map(k -> nodemachine(KNN(k=k), Xs1), [5, 10, 15])
scores_lof = map(k -> nodemachine(LOF(k=k), Xs2), [5, 10, 15])
network = transform(machine(Binarize()), scores_knn..., scores_lof...)
fit!(network, rows=train)
network(rows=test) I would say both use cases are quite common, but there are, as far as I know, no libraries that support those use cases nicely (also in other programming languages).
That would certainly be a nice way to approach the problem. I have to read a bit more about the topic as there are many possible ways to create such a probabilistic scoring function, e.g. unify, all with their pros and cons. Edit: What do you think about Edit2: Is
Not very deep - I will definitely look into it more thoroughly. Generally, it appears that methods like |
I experimented a bit with different strategies regarding scoring and evaluation. Regarding (3):
Thus, still not really sure how to proceed with this API. Regarding (4): |
Thanks for these further explanations and for your continued patience. I do think we are making progress and I have a few ideas about the API. However, allow me to press you on some details that I don't quite understand yet.
This is very helpful. I'm a little confused still why you need the threshold to construct the probability scoring rule. Indeed, in your vanilla implementation, it seems the If I'm missing the point here (quite likely) that's fine. If however this is easily explained, this would be helpful. More important is that I really understand your use cases for separating the raw-scoring from the classification. Is the idea in use case (1) that you want to pool the raw training scores for multiple detectors, and simultaneously generate vector-valued scores for each new observation - one element per detector? And that your Regarding just case (2) (multiple data) I'm not sure this can be immediately fit into the learning networks scheme, unless you begin with a step that splits some super data table
Very sorry, its
Whether or not this becomes an interface we use, it is possible to make this work in the learning network context. The idea is that instead of using MLJ
EvoTreesRegressor = @iload EvoTreeRegressor
# helper to extract the names of the 2 most important features from a
# `report`:
function important_features(report)
importance_given_feature = report.feature_importances
names = first.(sort(collect(importance_given_feature), by=last, rev=true))
return Symbol.(names[1:2])
end
# model and dummy data:
rgs = EvoTreesRegressor()
X, y = @load_boston
# # LEARNING NETWORK
Xs = source(X)
ys = source(y)
# define a node returning best feature names:
mach = machine(rgs, Xs, ys)
r = node(report, mach) # now `r()` returns a report
features = node(important_features, r)
# define a node returning the reduced data:
Xreduced = node((X, features) -> selectcols(X, features), Xs, features)
# build a new machine based on reduced data:
mach2 = machine(rgs, Xreduced, ys)
# define the prediction node for model trained on reduced data:
yhat = predict(mach2, Xreduced)
fit!(yhat)
yhat() |
Mmm. Now I think about it, pooling scores from different detectors doesn't make sense. I'm pretty confused about what it is you want here. How do interpret this signature |
Hi again, sorry for the late reply, life is quite busy at the moment.
You're right you would only have to specify a normalization strategy in advance, the threshold would only be necessary when converting the scores to classes and the mentioned
In the current API that's the case, yes.
I think multiple data was the wrong formulation; in the end it's about splitting the features of a single dataset to multiple detectors, which admittedly is mainly useful specific use cases. In the code example, I learn multiple KNN detectors for features 1 to 5, and multiple LOF detectors for features 6:10, thus the output should be for the whole dataset, not one feature subset. I think this feature, however, does not need any special API is it appears to be working quite well already.
Very interesting, thanks for explaining how this works! I will have a look if this could improve the API. I think the main open point is how to implement the probabilistic normalization such that it works for individual models and learning networks consisting of multiple combined models. I did a proof of concept enriching each detector with a One more open points from previously is how could possibly access the raw scores?
|
@davnn Thanks again for your patience. I think I understand the goals sufficiently now and a design is coalescing in my mind. Before getting your feedback, I am playing around with a POC to make sure of some details. I hope to get back to you soon. |
Thank you! FYI: I'm currently working on what I think is a nice API for MLJ, however I think that this API is mainly suited for single-model use-cases, not the more flexible use cases described above. The design is as follows (for both supervised and unsupervised models):
Evaluation is still a bit problematic as you have to write a wrapper to turn unsupervised models into supervised surrogates (not possible e.g. for a library only depending on the model interface). Something like function MLJ.evaluate(model::T, X, y, measure; args...) where {T <: Unsupervised}
ptype = prediction_type(measure)
@assert ptype in (:probabilistic, :deterministic)
Xs, ys = source(X), source(y)
# transform unsupervised model to supervised surrogate
mach = ptype == :probabilistic ?
machine(Probabilistic(), Xs, ys, predict=predict(machine(model, Xs), Xs)) :
machine(Deterministic(), Xs, ys, predict=predict_mode(machine(model, Xs), Xs))
evaluate!(mach; measure=measure, args...)
end Unfortunately for the use cases above with multiple models, this API is not nice to work with, and would likely require some ensemble / meta model wrappers. EDIT: If you would like to check the API out, I pushed the changes to the normalization-extended branch. |
Okay the POC is taking a little longer, so in the mean time here is a sketch of my proposal. I will continue my work on the POC after getting your feedback. If you are able to provide this soon, I should be able to push the POC along soon also. There are three types of model objects I think we should cater for in a design:
I propose introducing two new types: Bare detectorsBareDetector <: Model The
|
First of all, I'm very grateful for your input @ablaom, thank you!
I would merge the bare detectors and integrated detectors, by adding a
I would start with probabilistic only as I'm not aware of a model that doesn't fit into this schema.
I'm actually quite happy with the following API (check out the normalization-extended branch if you would like to try it out): using MLJ
using OutlierDetection # installed from normalization-extended branch
X, train, test = rand(1000, 10), 1:500, 501:1000;
Xs = source(table(X));
# this helper function should probably be provided
scores(mach) = node((mach, Xs) -> (report(mach).scores, transform(mach, Xs)), mach, Xs)
# basic use case (integrated detectors)
mach = machine(KNN(), X)
fit!(mach, rows=train)
transform(mach, rows=test) # raw scores
predict(mach, rows=test) # normalized scores / probas
predict_mode(mach, rows=test) # labels
# (1), multiple detectors, single data:
machines_knn = map(k -> machine(KNN(k=k), Xs), [5, 10, 15])
machines_lof = map(k -> machine(LOF(k=k), Xs), [5, 10, 15])
knn_scores, lof_scores = scores.(machines_knn), scores.(machines_lof)
network = transform(machine(Scores()), knn_scores..., lof_scores...)
fit!(network, rows=train)
network(rows=test)
# (2) multiple detectors, multiple data:
Xs1 = source(table(X[:,1:5]))
Xs2 = source(table(X[:,6:10]))
scores_knn = map(k -> machine(KNN(k=k), Xs1), [5, 10, 15])
scores_lof = map(k -> machine(LOF(k=k), Xs2), [5, 10, 15])
network = transform(machine(Scores()), knn_scores..., lof_scores...)
fit!(network, rows=train)
network(rows=test) Remaining "warts" that come to my mind:
By the way, I'm on vacation the next two weeks. |
Hi @ablaom, sorry for my rushed previous answer, I wanted to make the proposal before my vacation. I just read https://alan-turing-institute.github.io/MLJ.jl/dev/working_with_categorical_data/ and now understand your idea behind using Regarding the
Would you fix the type of the raw scores to Edit: Regarding the static helpers; it doesn't seem possible to infer the Regarding "treating unsupervised models as probabilistic", I have to experiment with this approach in the next days. |
@davnn Thanks for the update. I am close to finishing a POC which should accomodate the evalution business without wrapping detectors as supervised models. I am now convinced that: (i) detectors should get their own new abstract types, and (ii) normalised scoring should be viewed as just a special case of raw score prediction, rather than "optional" functionality with distinct interface (for composability). I hope you will be patient a little longer while I finish this more detailed proposal which I am confident will meet all your use-cases and be extensible. |
Hi @davnn, Okay, I have finished my POC. Thanks all the feedback, the new branch Please ignore part (ii) of my previous comment. I think we are nearly on Quick overview of the proposal implemented by the POC.
To give you a quick flavor of some of this, here is your excellent # data:
X, train, test = rand(1000, 10), 1:500, 501:1000;
# detectors:
knn_detectors = [KNN(k=k) for k in [5, 10, 15]]
lof_detectors = [LOF(k=k) for k in [5, 10, 15]]
detectors = vcat(knn_detectors, lof_detectors)
# network:
Xs = source(table(X));
machs = map(d -> machine(d, Xs), detectors)
augmented_scores = map(m -> augmented_transform(m, Xs), machs)
score_transformer = OutlierDetection.Scores() # modified version! see Note below
augmented_probs = transform(machine(score_transformer), augmented_scores...)
probs = last(augmented_probs)
fit!(probs, rows=train)
julia> probs(rows=test)
500-element UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
UnivariateFinite{OrderedFactor{2}}(inlier=>0.153, outlier=>0.847)
UnivariateFinite{OrderedFactor{2}}(inlier=>0.147, outlier=>0.853)
UnivariateFinite{OrderedFactor{2}}(inlier=>0.235, outlier=>0.765)
UnivariateFinite{OrderedFactor{2}}(inlier=>0.513, outlier=>0.487)
UnivariateFinite{OrderedFactor{2}}(inlier=>0.35, outlier=>0.65)
You won't be able to run this without KNN and LOF buying into the Trying the proposal outYou can see how the wrappers You can see an example of implementing the basic Note. To make my wrapper work, I needed to make one Points for further discussionI do hope this proposal can work for you with little modification, as I don't really 1. Although the new interface does not require OutlierDetection.jl julia> ProbabilisicDetector(knn=KNN(), lof=LOF())
ProbabilisticUnsupervisedDetector(
normalize = OutlierDetection.normalize,
combine = OutlierDetection.combine_mean,
knn = KNN(
k = 5,
metric = Distances.Euclidean(0.0),
algorithm = :kdtree,
leafsize = 10,
reorder = true,
parallel = false,
reduction = :maximum,
normalize = OutlierDetection.normalize, <--- dead
classify = OutlierDetection.classify, <--- dead
threshold = 0.9)
lof = LOF(
k = 5,
metric = Distances.Euclidean(0.0),
algorithm = :kdtree,
leafsize = 10,
reorder = true,
parallel = false,
normalize = OutlierDetection.normalize, <--- dead
classify = OutlierDetection.classify, <--- dead
threshold = 0.9)) @163 I suppose you could keep the strapped on functionality in those models 2. Pipelines. Given the possibilities provided by the wrappers, do 3. Where should code go? Assuming you are happy with the design, 4. Do you have objections to the changes to the 5. Since we are essentially introducing a new task, we could 6. What about transformers that drop observations in data (or |
It looks like the repo doesn't exist yet? |
My apologies. Private repo. I've invited you. |
Sorry but I didn't get an invite |
Oh bother. An unfortunate slip on my part. There is a davnnn as well. You should have it now, and I have copied the invite to someone with your name in the julia slack channel. |
Thanks a lot for your input! I'm still working on it. How would you split the packages for outlier detection models? Each model in an individual package? Bundle packages with similar "backend", e.g. OutlierDetectionNeighbors, OutlierDetectionFlux, OutlierDetectionPy..,? Or keep all models in one repo? |
Of course that's up to you. I've spent a lot of time breaking MLJ functionality up into different packages and wish I had started out more modular. From the point of view of the MLJ user, splitting them won't make them less discoverable, so long as all the packages are "registered" with MLJ. The user will be able to do, eg, As I say, I would put any meta-models (eg, the Probabilistic wrapper I'm proposing) in their own package (MLJOutlierDetection). And it would be great if you would be happy to maintain this over the longer term. If you're happy with the proposal, I can set MLJOutlierDetection up as a package providing the wrapper if you need me to. |
👍
I don't think it's necessary to add this new method. If we standardize the report key (e.g.
👍
👍
👍
👍
👍 Additionally we provide a
👍
👍 Removed the hyperparameter again.
Don't really know to be honest.
I explain the repo structure below.
I would return only the test scores (and don't add the
I don't think a special name for scoring is necessary if it does not provide any additional functionality compared to
Not sure yet. Explanation of changes I have split OutlierDetection.jl it into multiple repos. OutlierDetection.jl now only includes some helper functions and the probabilistic and deterministic wrappers. We could move that package to OutlierDetectionInterface.jl defines a unified interface for outlier detection algorithms in the Basic testing and benchmarking now lives in OutlierDetectionTest.jl and OutlierDetectionBenchmark.jl. All the algorithms live in their corresponding "backend" repository, e.g. OutlierDetectionNeighbors.jl. I have created 3 pull requests from your proposal:
The only missing pull request is the addition of the current outlier detection models into the MLJ registry. I have already tested it locally and Let me know what you think about that structure! |
I've not looked at this in detail but at first glance this all looks good and I'll get to it soon.
I'm a bit surprised you don't like this. I thought Could you say what you don't like about |
I just thought it's not necessary to add a new method to the API (just because it's always easier to add than to remove something from the API imho). Generally, I like the idea of an Edit: You could even define a default implementation that caches the training data prediction results for those methods if the result is not available directly from |
Fair point.
okay, let's put this off for now |
Hi @ablaom, once all packages are registered, I will add the models to the MLJ registry. Currently, the detectors are named Best David |
Yes registering the models would be the next step. You're welcome to have a go but I usually do this, as it still a bit of a quirky operation which sometimes requires compatibility updates elsewhere in the eco-system.
Right. Make user you have the Have you thought about where the |
alright, I tested it locally (only the OD packages) and it lgtm.
👍
Currently, the code resides in OutlierDetection.jl, which can be seen as the main entry point to the Edit: I would also not have registered the wrappers as they are only really useful when importing |
Okay, then do make a PR to MLJModels and I will take a look!
All good. I hadn't realised you had already found a home. Actually quite happy for another org to take responsibility for the code. :-)
Makes sense. If you do have wrappers in registered packages, just set |
Hi @ablaom, I was now working with the API for a bit and one major shortcoming appears to be that you need to re-learn the models each time you want a different output. For example using MLJ, OutlierDetection
X = rand(10, 100)
KNN = @iload KNNDetector pkg=OutlierDetectionNeighbors
knn = KNN()
knn_raw = machine(knn, X) |> fit!
knn_probabilistic = machine(ProbabilisticDetector(knn), X) |> fit!
knn_deterministic = machine(DeterministicDetector(knn), X) |> fit! would learn three times the exact same model with different outputs. Do you see any way of improving on this? |
Well, a model can implement more than one operation, eg, MMI.fit(...)
...
network_mach = machine(ProbabilisticUnsupervisedDetector(), Xs, predict=X_probs, transform=X_raw)
return!(network_mach, model, verbosity)
end where In the case of the |
I know that this is possible, but that doesn't improve the sitatuation imho. In reality the score transformations are static operations, which should not be bound to the training of the model. I have to say that the first implementation with the ugly tuple Maybe we should add the proposed Edit: From MLJ's viewpoint I would not implement the detector-specific changes and like to keep the API as general as possible, such that it fits detectors naturally; see the proposed type hierarchy mentioned below. By the way, the I'm happy to help adding the necessary features to make things work.
Yes with helpers to identify the threshold, which you don't know without looking at your training results. The Edit: The downside of such a change would be that we lose custom functionality for detectors, no default |
That might be so, but for now at least, I prefer to avoid breaking changes to MLJModelInterface for the near future. See my comments at your suggestion. I suggest you work the best you can with the status quo. I'm a bit confused by your comments, possibly because of the edits. Could you more succinctly summarize the proposed changes (consistent with current API)? |
Alright, I think we can close this issue now as we have found an initial, workable API.
I will open another issue to discuss possible future MLJ changes in more detail. |
@davnn I wonder if you would be able to help me complete the above checklist, especially the last point. I think enough time has passed to consider the API as "stable for now", yes? The essential points are to explain the contract a model enters into when it implements a subtype of one of the following six abstract types: abstract type UnsupervisedDetector <: UnsupervisedAnnotator end
abstract type SupervisedDetector <: SupervisedAnnotator end
abstract type ProbabilisticSupervisedDetector <: SupervisedDetector end
abstract type ProbabilisticUnsupervisedDetector <: UnsupervisedDetector end
abstract type DeterministicSupervisedDetector <: SupervisedDetector end
abstract type DeterministicUnsupervisedDetector <: UnsupervisedDetector end I am increasingly hazy about some of the final details. |
Sorry for being a bit hesitant to implement the changes in the docs. I'm simply not sure if we can consider the API "stable for now" as it is, in my opinion, closely related to other open issues, especially the clustering API. Once we want to evaluate clustering algorithms, or implement semi-supervised clusterers, we would basically have to define abstract type UnsupervisedClusterer <: UnsupervisedAnnotator end
abstract type SupervisedClusterer <: SupervisedAnnotator end
abstract type ProbabilisticSupervisedClusterer <: SupervisedClusterer end
abstract type ProbabilisticUnsupervisedClusterer <: UnsupervisedClusterer end
abstract type DeterministicSupervisedClusterer <: SupervisedClusterer end
abstract type DeterministicUnsupervisedClusterer<: UnsupervisedClusterer end I think we agree that this is not the way to go. Would you say a clustering API is out of scope for 1.0? |
@davnn I think we are on the same page in terms of future directions, just not on the same page in terms of timeframe. Moving from types to a more flexible traits-only system requires a breaking change to MLJModelInterface, on which about two dozen repositories depend; many of these will be broken by the changes. Yes, I am proposing to add documentation we will need to change later, but I expect this is normal for a project as complex as MLJ. We are building a car. But we are also driving it and writing the manual at the same time. In any case, I would be happy to mark the Outlier Detection section of the manual as "Experimental", as we have done elsewhere. |
Hi everyone,
I would like to discuss how an outlier/anomaly detection API could look like in MLJ.
Is your feature request related to a problem? Please describe.
I'm working on a package for outlier detection, OutlierDetection.jl, and its MLJ integration, but there are a couple of problems I stumbled into.
What do I mean when talking about Outlier Detection:
To differentiate between outlier detection and classification, I would rather consider (imbalanced) one-class classification as binary classification task. It looks like binary classification is quite straight forward to implement in MLJ, but working with scores is tricky. The most challenging aspect is accessing a fit result (training scores) later on in a pipeline/network when deciding to convert outlier scores to binary labels.
Additionally, it should be possible to evaluate unsupervised (ranking) algorithms with
evaluate
, but that should be relatively straight forward I believe.Describe the solution you'd like
Provide a beautiful API for outlier detection with MLJ.
Currently,
transform
/predict
return train/test score tuples, which makes it possible to use learning networks / ensembles, but I would like to discuss if that's a feasible API design, or if something like a wrapped ensemble model makes more sense?Let's see how the current API looks like in practice with MLJ.
KNN
is an outlier detection model andBinarize
transforms tuples of train/test scores to binary labels,X
is a dataframe andtrain
/test
are indices.Basic usage:
Linear pipeline:
Learning networks:
I would love to hear your feedback!
Describe alternatives you've considered
At first, I tried to use the
fitted_params
/report
functionality to access training scores (achieved when fitting the model), which works alright in toy examples with a single model, but doesn't work with learning networks / ensembles.Additional context
Previous discussion about Anomaly detection integration: #51
The text was updated successfully, but these errors were encountered: