Feature suggestion: `most()` and `assert_count_true()` #539

billdenney · 2023-04-27T13:56:59Z

I sometimes get dirty data that has multiple values that I need to choose one from. In a recent example, I received a dataset where an individual had multiple values for their sex (both male and female when they definitely did not undergo gender reassignment between the measurements).

To work with these types of issues, I think that two different types of functions can help:

most(x) is a companion to any() and all() from base R. It takes in a vector, x, and returns true if more than half of the values are isTRUE(x).

assert_count_true(x, n) takes in a logical vector x and an expected count that should be isTRUE(x), n. If sum(isTRUE(x)) == n, then it returns x. If a different number are TRUE, then it returns an error indicating the mismatch in count.

The text was updated successfully, but these errors were encountered:

matanhakim · 2024-01-30T18:15:18Z

Is this suggestion still relevant?
If so, I might take a crack at most().

billdenney · 2024-01-30T19:27:32Z

These are different from some of the typical janitor functions, so I'd like @sfirke to weigh in on if they feel like a good fit.

sfirke · 2024-01-30T19:50:27Z

I'm fine with adding most(). In your example, might you use it like:

dat %>%
  group_by(id) %>%
  filter(most(gender == "male"))

To get the data for all participants for whom most of their gender values are male ? Just checking my understanding.

Should it take an option cutoff value that defaults to 0.5? Then maybe we would be talking about calling it at_least() ...

Not trying to muddy the waters, just want to get precise on design and use cases.

sfirke · 2024-01-30T19:52:08Z

Could you share an example of using assert_count_true()? With mtcars or similar? I don't quite grasp how I would use it. There have been some talks about assertive checks in janitor, I can't remember what approach we landed on, but in general I support them. janitor::compare_df_cols_same is made for assertion.

billdenney · 2024-01-30T20:48:21Z

I like at_least(x, fraction = 0.5) more than most() as it covers a more general case with no more user effort.

For assert_count_true(), I often use something like it in my data cleaning routines. I have data where I know that one particular row is bad. I want to make sure that I only match that one row and no more or fewer (or maybe 5 rows or...). My use case looks like:

cleaned_data <-
  data |>
  mutate(
    age =
      case_when(
        assert_count_true(Person == "Bill" & Age == 40, count = 1) ~ 29, # The fountain of youth :)
        TRUE ~ age
      )
  )

My implementation looks something like (if I'm using deparse() correctly, I typed directly into the issue it is not tested code-- the idea is to tell the user the actual called value for x):

assert_count_true <- function(x, n = 1) {
  stopifnot(is.logical(x))
  if (any(is.na(x)) {
    stop(deparse(x), " has NA values")
  }
  if (sum(x) != n) {
    stop(deparse(x), " expected ", n, " TRUE values, but ", sum(x), " were found")
  }
  x
}

billdenney · 2024-01-31T11:39:40Z

Here's some better, working code for assert_count_true() with more helpful and grammatically correct error messages:

assert_count_true <- function(x, n = 1) {
  stopifnot(is.logical(x))
  if (any(is.na(x))) {
    stop(deparse(substitute(x)), " has NA values")
  }
  if (sum(x) != n) {
    stop_message <-
      sprintf(
        "`%s` expected %g `TRUE` %s but %g %s found.",
        deparse(substitute(x)),
        n,
        ngettext(n, "value", "values"),
        sum(x),
        ngettext(sum(x), "was", "were")
      )
    stop(stop_message)
  }
  x
}

foo <- c(TRUE, TRUE, FALSE)
assert_count_true(foo, n = 1)
#> Error in assert_count_true(foo, n = 1): `foo` expected 1 `TRUE` value but 2 were found.

bar <- c("Bill", "Sam", "Matan")
assert_count_true(bar == "Bill", n = 1)
#> [1]  TRUE FALSE FALSE

bar <- c("Bill", "Sam", "Matan")
assert_count_true(bar == "Bill", n = 2)
#> Error in assert_count_true(bar == "Bill", n = 2): `bar == "Bill"` expected 2 `TRUE` values but 1 was found.

^{Created on 2024-01-31 with reprex v2.0.2}

billdenney mentioned this issue May 22, 2024

Add assert_count_true() to verify that an expected number of values are TRUE #573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature suggestion: `most()` and `assert_count_true()` #539

Feature suggestion: `most()` and `assert_count_true()` #539

billdenney commented Apr 27, 2023

matanhakim commented Jan 30, 2024

billdenney commented Jan 30, 2024

sfirke commented Jan 30, 2024

sfirke commented Jan 30, 2024

billdenney commented Jan 30, 2024 •

edited

Loading

billdenney commented Jan 31, 2024 •

edited

Loading

Feature suggestion: most() and assert_count_true() #539

Feature suggestion: most() and assert_count_true() #539

Comments

billdenney commented Apr 27, 2023

matanhakim commented Jan 30, 2024

billdenney commented Jan 30, 2024

sfirke commented Jan 30, 2024

sfirke commented Jan 30, 2024

billdenney commented Jan 30, 2024 • edited Loading

billdenney commented Jan 31, 2024 • edited Loading

Feature suggestion: `most()` and `assert_count_true()` #539

Feature suggestion: `most()` and `assert_count_true()` #539

billdenney commented Jan 30, 2024 •

edited

Loading

billdenney commented Jan 31, 2024 •

edited

Loading