feat: add section about Filter-based feature selection #398

sebffischer · 2022-08-16T09:22:00Z

Fixes a typo
Adds brief section about how to do filter-based feature selection

TODO:
* [ ] Mention filter-based feature selection in the previous sections that list the different methods (is already mentioned)

sebffischer · 2022-08-18T11:03:05Z

ping

pat-s · 2022-08-22T07:42:58Z

Danke, versuche es die Woche einzubauen.

pat-s

Intro

A broader, more general introduction is needed which explains what filters and their "scores" are and how they differ compared to wrappers (one paragraph).

Conclusion

The following points should be briefly mentioned (maybe even in the intro, depends how you structure the section):

Filters reduce the feature space and by that make models "simpler"
They can be integrated into the tuning layer of the learner
They can make use of caching and only need to be calculated once (large advantage compared to wrappers)

We should also mention ensemble filters, even though they are not yet available (only in the old mlr).

pat-s · 2022-08-24T07:53:22Z

book/optimization.qmd

+### Filter-based Feature Selection
+
+A common usecase for filters is to conduct feature selection based on the filter scores.
+This can be achived using `r ref("mlr_pipeops_filter", text = "PipeOpFilter")`.


Using filters in pipelines is an option but we should probably start out with a generic intro, i.e. showing some "plain" examples how to calculate filter scores and move the pipelines handling below it.

book/optimization.qmd

pat-s · 2022-08-24T07:58:46Z

book/optimization.qmd

+A common usecase for filters is to conduct feature selection based on the filter scores.
+This can be achived using `r ref("mlr_pipeops_filter", text = "PipeOpFilter")`.
+This PipeOp takes as input a Task, applies the filter, and selects the features based on the calculated scores.
+The method how the features are subset can be defined in four different ways, each corresponding to a different parameter:


The terms "method" and "parameter" clash a bit here as they refer to the same thing context-wise in this sentence in its current form.

Suggestion for structuring and wording:

Method: "subset based on the 'best' X features'
Parameter: filter.nfeat

book/optimization.qmd

pat-s · 2022-08-24T08:02:01Z

book/optimization.qmd

+We will first subset the features based on the `r ref("mlr3filters::mlr_filters_information_gain", text = "FilterInformationGain")` and then fit a `r ref("mlr3learners::mlr_learners_classif.lda", text = "LDA")`.
+We will tune the parameter `filter.frac` with a simple grid search and visualize the classification error for the different fractions.


At this point it would be good to mention that the tuning of filter hyperpars can be fusioned with learner hyperpar tuning. This is in fact one of the big advantages of filters: they don't need an extra layer of tuning (as wrappers do).

Maybe even worth putting this into a "tip" block.

book/optimization.qmd

pat-s · 2022-08-24T08:12:15Z

book/optimization.qmd

+  )
+```
+
+We can see that using 70% - 100% of the features seems to produce fairly similar results.


There's some more beefy content required here, this does not help the user much.
Maybe 2-3 sentences describing the results and then 1-2 interpreting/discussing them (e.g. "the learner seems to produce best results with most features present which means that only few noise is in the data and most features are important) etc.

It might be worth searching for a task in which feature selection actually has an effect on the performance (and the feature space), otherwise users might think: "what do I need it for, it does not make a difference".
This might not be easy though and if we keep it like this, we need to explain to the user what the actual point is (and why we see 70-100% here and what that means). (Note: it also highly depends on the learner).

(I know it's an example and you know that, but in the book we should address this in a somewhat scientific detail, maybe even with some references :) )

Co-authored-by: Patrick Schratz <[email protected]>

sebffischer · 2022-08-24T10:08:21Z

I don't think we should mention the caching, because afaik we don't cache intermediate results in pipelines.
The other things you suggest are partly already mentioned in the intro.
In general I do not want to overdo it here, because it is a tutorial on usage.
Also I don't think we need mention ensemble filters in the mlr3 book if they are not avilable.

pat-s · 2022-08-24T19:19:02Z

I don't think we should mention the caching, because afaik we don't cache intermediate results in pipelines.
The other things you suggest are partly already mentioned in the intro.
In general I do not want to overdo it here, because it is a tutorial on usage.
Also I don't think we need mention ensemble filters in the mlr3 book if they are not avilable.

This sounds like you don't really want to change anything substantial.

I don't think we should mention the caching, because afaik we don't cache intermediate results in pipelines.

Apparently we don't have it yet in mlr3pipelines (mlr-org/mlr3pipelines#16) which is somewhat of a pity as it is very important for filters.
This is one of the large benefits using filters over wrappers.
AFAIR I didn't implement the same logic back as in {mlr} because we wanted to do it in pipelines but eventually this will never happen.

The other things you suggest are partly already mentioned in the intro.

This is quite general reply. I wouldn't have raised them if my feeling was they are appropriately covered.

In general I do not want to overdo it here, because it is a tutorial on usage.

Sorry to be frank here but this sounds like an excuse for low motivation against medium complicated change requests to me ;)

Also I don't think we need mention ensemble filters in the mlr3 book if they are not avilable.

The belong to the topic and are available in {mlr}, hence one sentence about them (whether they exist and/or why not) is certainly of interest to the reader.

I stand with my opinion that important information/content is missing or should be restructured and would not be happy seeing this getting merged in it's current form. If you do, I might want to rewrite it afterwards.

Please don't get this wrong, it's great we finally get a section about filters (and I could have done so earlier!) but if we have one, I'd like to see it being high quality and rather spent a bit more time on it.

sebffischer · 2022-09-02T12:32:09Z

We discussed this a little today, and it was suggested to move the feature selection out of the optimization chapter.

RaphaelS1 · 2022-10-31T19:26:29Z

@sebffischer @be-marc how does this PR fit in with #412 ?

sebffischer · 2022-11-02T14:55:12Z

@sebffischer @be-marc how does this PR fit in with #412 ?

Thanks for the reminder. I moved the section into the chapter feature-selection. I guess the part is not perfect but now Marvin will rework this chaüter. Maybe we should just merge it and he can decide which parts to keep and what not?

RaphaelS1 · 2022-12-19T10:03:45Z

I think everything here is covered now in Chapters 5 and 6

sebffischer added 4 commits August 16, 2022 11:19

feat: add section about PipeOpFilter

3d1e632

...

500be07

typo

7aa8dd5

add ggplot library call

b1bd53e

sebffischer requested review from mllg and pat-s August 16, 2022 11:32

typo

a1228e3

pat-s reviewed Aug 24, 2022

View reviewed changes

sebffischer and others added 2 commits August 24, 2022 10:58

Update book/optimization.qmd

d249e64

Co-authored-by: Patrick Schratz <[email protected]>

Update book/optimization.qmd

a2327d4

Co-authored-by: Patrick Schratz <[email protected]>

addres suggestions by patrick

852dd1b

merge main into filters and fix conflicts

91ee690

RaphaelS1 closed this Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add section about Filter-based feature selection #398

feat: add section about Filter-based feature selection #398

sebffischer commented Aug 16, 2022 •

edited

Loading

sebffischer commented Aug 18, 2022

pat-s commented Aug 22, 2022

pat-s left a comment

pat-s Aug 24, 2022

pat-s Aug 24, 2022

pat-s Aug 24, 2022

pat-s Aug 24, 2022

sebffischer commented Aug 24, 2022 •

edited

Loading

pat-s commented Aug 24, 2022

sebffischer commented Sep 2, 2022

RaphaelS1 commented Oct 31, 2022

sebffischer commented Nov 2, 2022

RaphaelS1 commented Dec 19, 2022

		We will first subset the features based on the `r ref("mlr3filters::mlr_filters_information_gain", text = "FilterInformationGain")` and then fit a `r ref("mlr3learners::mlr_learners_classif.lda", text = "LDA")`.
		We will tune the parameter `filter.frac` with a simple grid search and visualize the classification error for the different fractions.

feat: add section about Filter-based feature selection #398

feat: add section about Filter-based feature selection #398

Conversation

sebffischer commented Aug 16, 2022 • edited Loading

sebffischer commented Aug 18, 2022

pat-s commented Aug 22, 2022

pat-s left a comment

Choose a reason for hiding this comment

Intro

Conclusion

pat-s Aug 24, 2022

Choose a reason for hiding this comment

pat-s Aug 24, 2022

Choose a reason for hiding this comment

pat-s Aug 24, 2022

Choose a reason for hiding this comment

pat-s Aug 24, 2022

Choose a reason for hiding this comment

sebffischer commented Aug 24, 2022 • edited Loading

pat-s commented Aug 24, 2022

sebffischer commented Sep 2, 2022

RaphaelS1 commented Oct 31, 2022

sebffischer commented Nov 2, 2022

RaphaelS1 commented Dec 19, 2022

sebffischer commented Aug 16, 2022 •

edited

Loading

sebffischer commented Aug 24, 2022 •

edited

Loading