-
-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add section about Filter-based feature selection #398
Conversation
ping |
Danke, versuche es die Woche einzubauen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intro
A broader, more general introduction is needed which explains what filters and their "scores" are and how they differ compared to wrappers (one paragraph).
Conclusion
The following points should be briefly mentioned (maybe even in the intro, depends how you structure the section):
- Filters reduce the feature space and by that make models "simpler"
- They can be integrated into the tuning layer of the learner
- They can make use of caching and only need to be calculated once (large advantage compared to wrappers)
We should also mention ensemble filters, even though they are not yet available (only in the old mlr).
book/optimization.qmd
Outdated
### Filter-based Feature Selection | ||
|
||
A common usecase for filters is to conduct feature selection based on the filter scores. | ||
This can be achived using `r ref("mlr_pipeops_filter", text = "PipeOpFilter")`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using filters in pipelines is an option but we should probably start out with a generic intro, i.e. showing some "plain" examples how to calculate filter scores and move the pipelines handling below it.
book/optimization.qmd
Outdated
A common usecase for filters is to conduct feature selection based on the filter scores. | ||
This can be achived using `r ref("mlr_pipeops_filter", text = "PipeOpFilter")`. | ||
This PipeOp takes as input a Task, applies the filter, and selects the features based on the calculated scores. | ||
The method how the features are subset can be defined in four different ways, each corresponding to a different parameter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terms "method" and "parameter" clash a bit here as they refer to the same thing context-wise in this sentence in its current form.
Suggestion for structuring and wording:
Method: "subset based on the 'best' X features'
Parameter: filter.nfeat
book/optimization.qmd
Outdated
We will first subset the features based on the `r ref("mlr3filters::mlr_filters_information_gain", text = "FilterInformationGain")` and then fit a `r ref("mlr3learners::mlr_learners_classif.lda", text = "LDA")`. | ||
We will tune the parameter `filter.frac` with a simple grid search and visualize the classification error for the different fractions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point it would be good to mention that the tuning of filter hyperpars can be fusioned with learner hyperpar tuning. This is in fact one of the big advantages of filters: they don't need an extra layer of tuning (as wrappers do).
Maybe even worth putting this into a "tip" block.
book/optimization.qmd
Outdated
) | ||
``` | ||
|
||
We can see that using 70% - 100% of the features seems to produce fairly similar results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some more beefy content required here, this does not help the user much.
Maybe 2-3 sentences describing the results and then 1-2 interpreting/discussing them (e.g. "the learner seems to produce best results with most features present which means that only few noise is in the data and most features are important) etc.
It might be worth searching for a task in which feature selection actually has an effect on the performance (and the feature space), otherwise users might think: "what do I need it for, it does not make a difference".
This might not be easy though and if we keep it like this, we need to explain to the user what the actual point is (and why we see 70-100% here and what that means). (Note: it also highly depends on the learner).
(I know it's an example and you know that, but in the book we should address this in a somewhat scientific detail, maybe even with some references :) )
Co-authored-by: Patrick Schratz <[email protected]>
Co-authored-by: Patrick Schratz <[email protected]>
I don't think we should mention the caching, because afaik we don't cache intermediate results in pipelines. |
This sounds like you don't really want to change anything substantial.
Apparently we don't have it yet in mlr3pipelines (mlr-org/mlr3pipelines#16) which is somewhat of a pity as it is very important for filters.
This is quite general reply. I wouldn't have raised them if my feeling was they are appropriately covered.
Sorry to be frank here but this sounds like an excuse for low motivation against medium complicated change requests to me ;)
The belong to the topic and are available in {mlr}, hence one sentence about them (whether they exist and/or why not) is certainly of interest to the reader. I stand with my opinion that important information/content is missing or should be restructured and would not be happy seeing this getting merged in it's current form. If you do, I might want to rewrite it afterwards. Please don't get this wrong, it's great we finally get a section about filters (and I could have done so earlier!) but if we have one, I'd like to see it being high quality and rather spent a bit more time on it. |
We discussed this a little today, and it was suggested to move the feature selection out of the optimization chapter. |
@sebffischer @be-marc how does this PR fit in with #412 ? |
Thanks for the reminder. I moved the section into the chapter feature-selection. I guess the part is not perfect but now Marvin will rework this chaüter. Maybe we should just merge it and he can decide which parts to keep and what not? |
I think everything here is covered now in Chapters 5 and 6 |
TODO:
* [ ] Mention filter-based feature selection in the previous sections that list the different methods(is already mentioned)