-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New param use_groups
for PipeOpSubsample
and rework for task_filter_ex()
#834
base: master
Are you sure you want to change the base?
Conversation
todo: test with task where trailing rows were |
use_groups
for PipeOpSubsample
use_groups
for PipeOpSubsample
and rework for task_filter_ex()
7dc3eeb
to
27d883c
Compare
…o po_subsample_use_groups # Conflicts: # NEWS.md
}, by = row_id] | ||
|
||
# Use "new_groups" to update the group entries. | ||
new_data[, (task$col_roles$group) := new_groups$group] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test cases we want to check: Consider task with 3 rows, group 'a', 'b', 'c'; row roles "use" c(1, 1, 2, 2, 3)
- row_ids c(1, 1, 2, 2, 3) -> we get flattened task that has the same groups as before (has ids 1, 2, 3, 4, 5, groups 'a', 'b', 'c', 'a', 'b')
- row_ids c(1, 1, 1, 1, 2, 2, 3) -> add group a_1, i.e. task has c(1, 2, 3, 4, 5, 6, 7), groups 'a', 'b', 'c', 'a', 'b', 'a_1', 'a_1'
- row_ids c(1, 1, 1, 1, 3) -> same as above, but 'b' disappears: c(1, 3, 4, 5, 6), groups 'a', 'c', 'a', 'a_1', 'a_1'
- row_ids 3 -> groups a and b disappear: task has row-id 3, group 'c'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- row_ids c(1, 1, 1, 1, 3, 1, 1) -> c(1, 3, 4, 5, 6, 7, 8), groups 'a', 'c', 'a', 'a_1', 'a_1', 'a_2', 'a_2'
Closes #567
If
use_groups = TRUE
(default), we subsample whole groups. This leads tofrac
not being fully accurate.We currently don't support stratification (
stratify = TRUE
) and subsampling grouped data at the same time, same as with Resamplings in mlr3.This changes the default behavior for tasks with a column with role "group".
Right now,
task$row_roles$use
is not respected whenuse_groups = TRUE
. Question is, how we would want to handle that? If a group contains any row that is not intask$row_roles$use
, we ignore the group for subsampling?