-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preproces to remove correlated features #313
Comments
Hey, have you looked into https://github.com/topepo/caret/blob/master/pkg/caret/R/findCorrelation.R? Other than that, gladly create a Best, Florian |
I did look at the code for Additionally, I think the optimal solution to this problem would keep the maximal number of features while maintaining each pairwise correlation below the threshold. I mean if this was not the case one can just omit features at random until each pairwise correlation is below the threshold. If the objective is to acquire close to the largest possible subset of features with the mentioned pairwise correlation constraint then This does not mean I think |
Sorry for being blunt, why isn't this a filter? |
ah ok, findCorrelation gives a mapping threshold -> set of features, but what we would need is that [set of features] is monotonic in [threshold] (but that is the case, right?!), and ideally we want the points of [threshold] where [set of features] changes. I still believe this should be implemented in filters if we have to re-implement this in any case. |
I guess for reference: |
This is not the case unfortunately. See mlr-org/mlr3filters#61 (comment). Currently I am thinking of setting this a side and using another approach altogether which would measure association with the target while removing redundancy like mRMR does (mlr-org/mlr3filters#61 (comment)). Unfortunately mRMR performs worse in my current project compared to using Information gain for instance which selects less features with better trained learner performance. So I thought about two simple to implement solutions in mlr3filters
|
I used this in a project of mine quite successfully. The nicest way to make this work would be to add an The caret The pose = po("select")
pose$param_set$values$selector = function(task) {
fn = task$feature_names
data = task$data(cols = fn)
drop = caret::findCorrelation(cor(data), cutoff = 0.6, exact = TRUE, names = TRUE)
setdiff(fn, drop)
}
pose$train(list(tsk("sonar"))) gives $output
<TaskClassif:sonar> (208 x 24)
* Target: Class
* Properties: twoclass
* Features (23):
- dbl (23): V1, V10, V12, V17, V22, V24, V28, V31, V34, V37, V40,
V44, V47, V5, V51, V53, V54, V55, V56, V57, V58, V60, V7 to tune over the Would it make sense to turn this into its own (Edit: Fixed bug where |
Sorry for the delayed replay. I was traveling with very limited connectivity.
This sounds great. Why isn't it implemented? Is it due to not having a simple solution to handle multi class tasks? Why not just return a column per class and let the user decide on the aggregation?
I like you.
I would be grateful if you could provide an example on how to create a selector with tunable parameters. This is mlr3 book material in my opinion.
There is no need if users can easily create their own tunable selector functions. |
I haven't tested the following code, but tuning should work somewhere similar to using this ps = ParamSet$new(list(ParamDbl$new("cutoff", 0, 1)))
ps$trafo = function(x, param_set) {
cutoff = x$cutoff
x$selector = function(task) {
fn = task$feature_names
data = task$data(cols = fn)
drop = caret::findCorrelation(cor(data), cutoff = cutoff, exact = TRUE, names = TRUE)
setdiff(fn, drop)
}
x$cutoff = NULL
x
} We will think about if (Edit: bugfixes) |
I am not able to get it to work. Anyhow for now I can just stick to using a custom Preproc function when I wish to do this sort of thing.. Thanks. |
A full example would be library("mlr3")
library("mlr3pipelines")
library("paradox")
library("mlr3tuning")
ps = ParamSet$new(list(ParamDbl$new("cutoff", 0, 1)))
ps$trafo = function(x, param_set) {
cutoff = x$cutoff
x$select.selector = function(task) {
fn = task$feature_names
data = task$data(cols = fn)
drop = caret::findCorrelation(cor(data), cutoff = cutoff, exact = TRUE, names = TRUE)
setdiff(fn, drop)
}
x$cutoff = NULL
x
}
pipeline = po("select") %>>% lrn("classif.rpart")
inst = TuningInstance$new(
task = tsk("iris"),
learner = pipeline,
resampling = rsmp("cv"),
measures = msr("classif.ce"),
param_set = ps,
terminator = term("none"),
# don't need the following line for optimization, this is for
# demonstration that different features were selected
bm_args = list(store_models = TRUE)
)
tnr("grid_search")$tune(inst) Just for demonstration that different cutoff values result in different features being selected, we can run the following to inspect the trained models: (Edit: note this inspects only the trained models of the first CV fold of each evaluated model. The features being excluded depends on the training data seen by the pipeline and may be different in different folds, even at the same cutoff value) inst$archive(unnest = "tune_x")[order(cutoff),
list(cutoff, classif.ce,
featurenames = lapply(resample_result, function(x)
x$learners[[1]]$model$classif.rpart$train_task$feature_names
))] which gives cutoff classif.ce featurenames
1: 0.0000000 0.26666667 Sepal.Length
2: 0.1111111 0.26000000 Sepal.Length,Sepal.Width
3: 0.2222222 0.25333333 Sepal.Length,Sepal.Width
4: 0.3333333 0.25333333 Sepal.Length,Sepal.Width
5: 0.4444444 0.25333333 Sepal.Length,Sepal.Width
6: 0.5555556 0.25333333 Sepal.Length,Sepal.Width
7: 0.6666667 0.25333333 Sepal.Length,Sepal.Width
8: 0.7777778 0.25333333 Sepal.Length,Sepal.Width
9: 0.8888889 0.06666667 Petal.Width,Sepal.Length,Sepal.Width
10: 1.0000000 0.08666667 Petal.Length,Petal.Width,Sepal.Length,Sepal.Width |
Thank you. |
I guess we should either add this as a |
I don't like either solution, but I think for time being this should be a documented example. I hope we have some nicer way of doing this at some point, however. Maybe when mlr-org/paradox#231 is solved we could have parameterized selector functions. I don't think this should be a specialized pipeop because I believe "do something non-filter-like to select features" is a quite general use-case. |
Added to the gallery here: mlr-org/mlr3gallery#9 |
Hi,
I felt compelled to continue the discussion (mlr-org/mlr3filters#61) .
I tried to make a function to remove correlated features similar to
findCorrelation()
but without actually looking what that function does internally so they don't look too similar.I came up with this
It can most likely be simplified. But it is besides the point. Results of testing were interesting to me.
findCor()
identifies 32 features to be removed in this data set whilefindCorrelation()
withexact = TRUE
finds 37 features, and withexact = FALSE
finds 38 features. This trend appears to be present at different threshold values.The correlation of the remaining features
findCor()
keeps more features while still maintaining the maximum allowed absolute correlation.Comparison with
step_corr()
22 out of 60 features are kept so this removes the same amount of features as
findCorrelation()
withexact = FALSE
The function as a pipeop
I apologize for the rant.
If you like it I can submit a PR.
All the best,
Milan
The text was updated successfully, but these errors were encountered: