New param `use_groups` for `PipeOpSubsample` and rework for `task_filter_ex()` #834

advieser · 2024-10-03T18:57:31Z

Closes #567

If use_groups = TRUE (default), we subsample whole groups. This leads to frac not being fully accurate.
We currently don't support stratification (stratify = TRUE) and subsampling grouped data at the same time, same as with Resamplings in mlr3.
This changes the default behavior for tasks with a column with role "group".

Right now, task$row_roles$use is not respected when use_groups = TRUE. Question is, how we would want to handle that? If a group contains any row that is not in task$row_roles$use , we ignore the group for subsampling?

…r if groups exist)

R/PipeOpSubsample.R

tests/testthat/test_pipeop_subsample.R

R/PipeOpSubsample.R

…_filter_ex

mb706 · 2024-10-17T08:05:21Z

todo: test with task where trailing rows were $filter()ed out

…o po_subsample_use_groups # Conflicts: # NEWS.md

R/PipeOpSubsample.R

mb706 · 2024-11-05T09:34:24Z

R/utils.R

+      }, by = row_id]
+
+      # Use "new_groups" to update the group entries.
+      new_data[, (task$col_roles$group) := new_groups$group]


Test cases we want to check: Consider task with 3 rows, group 'a', 'b', 'c'; row roles "use" c(1, 1, 2, 2, 3)

row_ids c(1, 1, 2, 2, 3) -> we get flattened task that has the same groups as before (has ids 1, 2, 3, 4, 5, groups 'a', 'b', 'c', 'a', 'b')

row_ids c(1, 1, 1, 1, 2, 2, 3) -> add group a_1, i.e. task has c(1, 2, 3, 4, 5, 6, 7), groups 'a', 'b', 'c', 'a', 'b', 'a_1', 'a_1'

row_ids c(1, 1, 1, 1, 3) -> same as above, but 'b' disappears: c(1, 3, 4, 5, 6), groups 'a', 'c', 'a', 'a_1', 'a_1'

row_ids 3 -> groups a and b disappear: task has row-id 3, group 'c'

row_ids c(1, 1, 1, 1, 3, 1, 1) -> c(1, 3, 4, 5, 6, 7, 8), groups 'a', 'c', 'a', 'a_1', 'a_1', 'a_2', 'a_2'

advieser added 6 commits October 3, 2024 20:41

added param use_groups for subsampling grouped data (changes behaviou…

a1d8b95

…r if groups exist)

test for subsampling grouped data

0206cc7

document

cd714cc

updated NEWS.mde

af7244c

fixed error in examples

a09779c

fixed another bug in examples

d967849

mb706 requested changes Oct 8, 2024

View reviewed changes

R/PipeOpSubsample.R Outdated Show resolved Hide resolved

R/PipeOpSubsample.R Outdated Show resolved Hide resolved

R/PipeOpSubsample.R Outdated Show resolved Hide resolved

tests/testthat/test_pipeop_subsample.R Outdated Show resolved Hide resolved

changed logic for grouped data

35b544a

mb706 reviewed Oct 10, 2024

View reviewed changes

R/PipeOpSubsample.R Outdated Show resolved Hide resolved

advieser added 6 commits October 15, 2024 18:20

removed probs for sampling

acf68fc

reworked task_filter_ex to handle non-feature/non-target col_roles

0d5fc84

typo, remove dev comment

d46e9cd

tests for grouped data (except optimal prop), tests for reworked task…

0283529

…_filter_ex

fixing subsampling for replace=TRUE and introducing new problems

d83135b

...

78dd63b

advieser added 10 commits October 17, 2024 16:44

implement suggestions from code review

e200f19

ugly WIP: let task_filter_ex rename groups for duplicated rows

ec5b81e

WIP: task_factor_ex; can't change task in-place obvsly

87e7812

moved tests for task_filter_ex to own file

d0ca7a6

corrected test file name + WIP tests for task_filter_ex

32449cb

WIP task_filter_ex

9198829

WIP tests pipeopsubsample

d4dc63a

reworked task_filter_ex, this could actually work

6fc245d

updated test task_filter_ex

b28e070

improved task_filter_ex

5292f5c

advieser changed the title ~~New param use_groups for PipeOpSubsample~~ New param use_groups for PipeOpSubsample and rework for task_filter_ex() Oct 22, 2024

advieser added 3 commits November 2, 2024 20:59

typo

9096aca

fix + restructure comments

8701d9d

task_filter_ex tests

27d883c

advieser force-pushed the po_subsample_use_groups branch from 7dc3eeb to 27d883c Compare November 2, 2024 20:06

advieser added 4 commits November 2, 2024 22:17

handling for duplicates in used rows in task_filter_ex

05c54d3

simplified data.table expression

53879fc

reworked tests

be3e77c

Merge branch 'master' of https://github.com/mlr-org/mlr3pipelines int…

cf7bca8

…o po_subsample_use_groups # Conflicts: # NEWS.md

mb706 reviewed Nov 5, 2024

View reviewed changes

advieser and others added 5 commits November 5, 2024 19:05

docs: more detail for use_groups

4f540ae

removed test that relies on POFU

ce7cd6d

replace J with list for global function definition

f2d6952

replaced J with list for global function definition

0de1054

tests for group renaming given duplicates in row_roles use

daecb54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New param `use_groups` for `PipeOpSubsample` and rework for `task_filter_ex()` #834

New param `use_groups` for `PipeOpSubsample` and rework for `task_filter_ex()` #834

advieser commented Oct 3, 2024

mb706 commented Oct 17, 2024

mb706 Nov 5, 2024 •

edited

Loading

mb706 Nov 5, 2024

New param use_groups for PipeOpSubsample and rework for task_filter_ex() #834

Are you sure you want to change the base?

New param use_groups for PipeOpSubsample and rework for task_filter_ex() #834

Conversation

advieser commented Oct 3, 2024

mb706 commented Oct 17, 2024

mb706 Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

mb706 Nov 5, 2024

Choose a reason for hiding this comment

New param `use_groups` for `PipeOpSubsample` and rework for `task_filter_ex()` #834

New param `use_groups` for `PipeOpSubsample` and rework for `task_filter_ex()` #834

mb706 Nov 5, 2024 •

edited

Loading