Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New param use_groups for PipeOpSubsample and rework for task_filter_ex() #834

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

advieser
Copy link
Collaborator

@advieser advieser commented Oct 3, 2024

Closes #567

If use_groups = TRUE (default), we subsample whole groups. This leads to frac not being fully accurate.
We currently don't support stratification (stratify = TRUE) and subsampling grouped data at the same time, same as with Resamplings in mlr3.
This changes the default behavior for tasks with a column with role "group".

Right now, task$row_roles$use is not respected when use_groups = TRUE. Question is, how we would want to handle that? If a group contains any row that is not in task$row_roles$use , we ignore the group for subsampling?

R/PipeOpSubsample.R Outdated Show resolved Hide resolved
R/PipeOpSubsample.R Outdated Show resolved Hide resolved
R/PipeOpSubsample.R Outdated Show resolved Hide resolved
tests/testthat/test_pipeop_subsample.R Outdated Show resolved Hide resolved
R/PipeOpSubsample.R Outdated Show resolved Hide resolved
@mb706
Copy link
Collaborator

mb706 commented Oct 17, 2024

todo: test with task where trailing rows were $filter()ed out

@advieser advieser changed the title New param use_groups for PipeOpSubsample New param use_groups for PipeOpSubsample and rework for task_filter_ex() Oct 22, 2024
R/PipeOpSubsample.R Outdated Show resolved Hide resolved
}, by = row_id]

# Use "new_groups" to update the group entries.
new_data[, (task$col_roles$group) := new_groups$group]
Copy link
Collaborator

@mb706 mb706 Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases we want to check: Consider task with 3 rows, group 'a', 'b', 'c'; row roles "use" c(1, 1, 2, 2, 3)

  • row_ids c(1, 1, 2, 2, 3) -> we get flattened task that has the same groups as before (has ids 1, 2, 3, 4, 5, groups 'a', 'b', 'c', 'a', 'b')
  • row_ids c(1, 1, 1, 1, 2, 2, 3) -> add group a_1, i.e. task has c(1, 2, 3, 4, 5, 6, 7), groups 'a', 'b', 'c', 'a', 'b', 'a_1', 'a_1'
  • row_ids c(1, 1, 1, 1, 3) -> same as above, but 'b' disappears: c(1, 3, 4, 5, 6), groups 'a', 'c', 'a', 'a_1', 'a_1'
  • row_ids 3 -> groups a and b disappear: task has row-id 3, group 'c'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • row_ids c(1, 1, 1, 1, 3, 1, 1) -> c(1, 3, 4, 5, 6, 7, 8), groups 'a', 'c', 'a', 'a_1', 'a_1', 'a_2', 'a_2'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PipeOpSubsample: allow stratification by other columns, eg. group column
2 participants