Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookup ambiguity in data-expressions #76

Closed
ghost opened this issue Sep 24, 2018 · 11 comments
Closed

Lookup ambiguity in data-expressions #76

ghost opened this issue Sep 24, 2018 · 11 comments
Labels
dsl feature a feature request or enhancement
Milestone

Comments

@ghost
Copy link

ghost commented Sep 24, 2018

@JohnMount commented on Sep 23, 2018, 2:35 PM UTC:

dplyr::select() returns a wrong column, should probably throw a "no such column/value" style exception.

library("dplyr")
packageVersion("dplyr")

    ## [1] '0.7.6'

packageVersion("rlang")

    ## [1] '0.2.2'

packageVersion("tidyselect")

    ## [1] '0.2.4'

y <- "x"

dxy <- data.frame(x = 1, y = 2)
dx <- data.frame(x = 1) 

# returns column x (correct)
dx %>%
 select(y)

    ##   x
    ## 1 1

# returns column y (incorrect)
dxy %>%
 select(y)

    ##   y
    ## 1 2

This issue was moved by romainfrancois from tidyverse/dplyr#3851.

@ghost
Copy link
Author

ghost commented Sep 24, 2018

@romainfrancois commented on Sep 24, 2018, 7:35 AM UTC:

This is the documented behavior. Columns have priority in dxy %>% select(y) so the y column is selected. There is no y in dx so y is evaluated (to "x").

This is in essence, tidyselect semantics, I guess what you describe is [ standard evaluation semantics.

@krlmlr
Copy link
Member

krlmlr commented Sep 24, 2018

This was previously discussed in #49, with the resolution to keep the behavior. But that issue didn't include the example here:

library(tidyselect)

# Dangerous: two way to interpret
x <- "y"
vars_select("x", x)
#>   x 
#> "x"
vars_select("x", x:y)
#> Error in is_character(x, encoding = encoding, n = 1L): object 'y' not found
vars_select("x", -x)
#> named character(0)
vars_select("y", x)
#>   y 
#> "y"
vars_select("y", x:y)
#>   y 
#> "y"
vars_select("y", -x)
#> named character(0)


# Safe
vars_select("x", !!x)
#> Error: Unknown column `y` 
#> Backtrace:
#>  ─tryCatch(...)
#>  ─vars_select("x", !!x)
#>  ─vars_select_eval(.vars, quos) /home/kirill/git/R/tidyselect/R/vars-select.R:118:2
#>  ─map_if(ind_list, is_character, match_strings, names = TRUE) /home/kirill/git/R/tidyselect/R/vars-select.R:236:2
#>  ─map(.x[sel], .f, ...) /tmp/RtmpqZVXSG/R.INSTALL62801d434702/purrr/R/map.R:112:2
#>  ─.f(.x[[i]], ...) /tmp/RtmpqZVXSG/R.INSTALL62801d434702/purrr/R/map.R:104:2
#>  ─bad_unknown_vars(vars, unknown) /home/kirill/git/R/tidyselect/R/vars-select.R:272:4
vars_select("x", (!!x):y)
#> Error: Unknown column `y` 
#> Backtrace:
#>  ─tryCatch(...)
#>  ─"y":y
#>  ─match_strings(x) /home/kirill/git/R/tidyselect/R/vars-select.R:243:4
#>  ─bad_unknown_vars(vars, unknown) /home/kirill/git/R/tidyselect/R/vars-select.R:272:4
vars_select("x", -!!x)
#> Error: Unknown column `y` 
#> Backtrace:
#>  ─tryCatch(...)
#>  ─-"y"
#>  ─match_strings(x) /home/kirill/git/R/tidyselect/R/vars-select.R:257:4
#>  ─bad_unknown_vars(vars, unknown) /home/kirill/git/R/tidyselect/R/vars-select.R:272:4
vars_select("y", !!x)
#>   y 
#> "y"
vars_select("y", (!!x):y)
#>   y 
#> "y"
vars_select("y", -!!x)
#> named character(0)

# Correct (but why is the error message misleading?)
x <- rlang::sym("y")
vars_select("x", !!x)
#> Error in .f(.x[[i]], ...): object 'y' not found
vars_select("x", (!!x):y)
#> Error in is_character(x, encoding = encoding, n = 1L): object 'y' not found
vars_select("x", -!!x)
#> Error in is_character(x): object 'y' not found
vars_select("y", !!x)
#>   y 
#> "y"
vars_select("y", (!!x):y)
#>   y 
#> "y"
vars_select("y", -!!x)
#> named character(0)

Created on 2018-09-24 by the reprex package (v0.2.1.9000)

For now the easiest remedy might be to give an appropriate warning in the documentation, and refer to select(!!x), select(one_of()) or select_at(vars()) for programming.

@lionel- lionel- added the dsl label Sep 9, 2019
@lionel- lionel- changed the title dplyr::select() returns a wrong column, should probably throw a "no such column/value" style exception Data expressions should not look up values outside the data mask Sep 9, 2019
@lionel- lionel- added the feature a feature request or enhancement label Sep 9, 2019
lionel- added a commit to lionel-/tidyselect that referenced this issue Oct 18, 2019
@lionel-
Copy link
Member

lionel- commented Oct 18, 2019

This type of variable lookup in selection contexts now triggers a message:

vars <- c("cyl", "disp")
mtcars %>% dplyr::select(vars) %>% head()
#> Note: Selecting non-column variables is brittle.
#> ℹ If the data contains `vars` it will be selected instead.
#> ℹ Use `all_of(vars)` to silence this message.
#>                   cyl disp
#> Mazda RX4           6  160
#> Mazda RX4 Wag       6  160
#> Datsun 710          4  108
#> Hornet 4 Drive      6  258
#> Hornet Sportabout   8  360
#> Valiant             6  225

This is the first step towards deprecation of this behaviour.

@lionel- lionel- added this to the future milestone Nov 7, 2019
@lionel- lionel- changed the title Data expressions should not look up values outside the data mask Lookup ambiguity in data-expressions Nov 14, 2019
@rcragun
Copy link

rcragun commented Jun 23, 2021

I think I found an odd case that is part of this issue. Piping a selection helper into across() produces different behavior than typing the helper inside across():

# Very simple function: returns input
self = function(x){x}

# Data to manipulate
dtemp = tibble(var = 1:2)

# No message when selection helper is inside across()
dtemp %>% mutate(across(matches("var"), self))

# Generates message when selection helper is piped to across()
dtemp %>% mutate(matches("var") %>% across(self))

The last line recommends all_of(), but the line before it does not.

I do not know if this is an issue with the code that generates the message or with my understanding of the pipe operator. I think most users understand a %>% b to be equivalent to b(a), but perhaps a %>% b evaluates a and then applies b() to the value returned from evaluating a while b(a) evaluates a in an environment somehow determined by b(). Even if this is how the pipe works, the message could be more helpful to the user because this example does not clearly reference an external vector.

dplyr version: 1.0.6
tidyverse version: 1.3.1

@lionel-
Copy link
Member

lionel- commented Jun 28, 2021

I think most users understand a %>% b to be equivalent to b(a), but perhaps a %>% b evaluates a and then applies b() to the value returned from evaluating a

This is more or less how %>% works. On the other hand, the base pipe |> of R 4.1 operates at parsing time rather than evaluation time (in other words it's like a macro instead of like a function), which means a |> b() is perfectly equivalent to b(a). With the base pipe you won't get the message.

@rcragun
Copy link

rcragun commented Aug 6, 2021

Is there any way to identify when the user has used %>% and add to the warning the suggestion that that may be the reason for the unexpected (to them) behavior?

@lionel-
Copy link
Member

lionel- commented Aug 16, 2021

hmm not really. The check would have to be implemented at the level of mutate() and this sort of linting doesn't seem like a great fit for that function.

@hadley
Copy link
Member

hadley commented Aug 11, 2022

While this combinations of behaviours certainly has a couple of somewhat surprising edge cases, we believe that the individual behaviours are sound, and can't see an obvious way to avoid the problem that arises in their combination. So our plan is to leave tidyselect semantics as they currently are, but we'll certainly revisit if we discover that these problems are more widespread than we initially thought.

@hadley hadley closed this as completed Aug 11, 2022
@yannsay-impact
Copy link

The warning message below mentions dplyr::select and the FAQ mentions also tidyr:::pivot_longer. I have came across the same message with tidyr::unite. I am guessing it is all the functions that can use an Argument type: tidy-select. Would it help to update the FAQ to not only name select and pivot_longer

── Warning ('test-xxx.R:120:3'): expect equal ────────────────────────
Using an external vector in selections was deprecated in tidyselect 1.1.0.
i Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(variables_to_add)

  # Now:
  data %>% select(all_of(variables_to_add))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

@yfarjoun
Copy link

This is an old issue, but I'm still not managing to find good documentation with example for how to resolve other sources of this warning.

I seem to be getting it when using mutate (with .data[[.]]) and group_by_at but I have not managed to resolve this.

I'm worried that in a future version the warning will become an error, but am confused about how to use external vectors/variables in methods other than select and pivot_longer

@amelialouise
Copy link

amelialouise commented May 17, 2024

Ran into this old closed issue today with similar concerns as @yfarjoun above. Some additional documentation I found that helped me to resolve the Using an external vector in selections was deprecated in tidyselect 1.1.0 warning came from the last bullet point on this dplyr_tidy_select help page:

If you want the user to be able to supply a tidyselect specification in a function argument, embrace the function argument, e.g. select(df, {{ vars }}).

Embracing the function argument in the select() statement resolved the warning in my case. So the use of all_of() and any_of() could be avoided. Not sure how strict the use of these functions may become, but below is a reprex that shows some interesting ways to encounter the warning and resolve it (some ways are without doing what the warning says).

reprex:

library(dplyr)
mydata <- tibble(var1 = c("a", "b", "c"), var2 = 11:13)

# functions that run but generate the warning 
select_and_rename_bad1 <- function(.data, old_var1, old_var2){
  .data %>% 
    select(old_var1, old_var2) %>%
    rename(
      "new_name_v1" = old_var1,
      "new_name_v2" = old_var2
    )
}

select_and_rename_bad2 <- function(.data, vars = c(old_var1, old_var2)){
  .data %>% 
    select(vars) %>% 
    rename(
      "new_name_v1" = old_var1,
      "new_name_v2" = old_var2
    )
}

Generate the warning by using either function, e.g.

mydata %>% 
  select_and_rename_bad2( 
    vars = c(
      old_var1 = "var1", 
      old_var2 = "var2"
  )
)

Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
Please use all_of() or any_of() instead.
# Was:
data %>% select(vars)
# Now:
data %>% select(all_of(vars))
See https://tidyselect.r-lib.org/reference/faq-external-vector.html.

Btw, the warning seems to get suppressed after it appears once in an R session. So if you try both functions above in the same R session, the second one you try might seem to actually not generate a warning. Refresh your R session and try again to confirm it's actually still generating it.

That dplyr_tidy_select help page says we should embrace {{ }} the function argument in our select statement, and sure enough, modifying the second function above with vars embraced runs and generates no warning:

# fixed version of `select_and_rename_bad2`
select_and_rename_good <- function(.data, vars = c(old_var1, old_var2)){
  .data %>% 
    select({{vars}}) %>% 
    rename(
      "new_name_v1" = old_var1,
      "new_name_v2" = old_var2
    )
}

Try it out

mydata %>% 
  select_and_rename_good(
    vars = c(
      old_var1 = "var1", 
      old_var2 = "var2"
    )
)

I failed a bunch of times trying to resolve the warning for functions that listed the variables in the function argument instead of listing them in a character vector in the function argument, e.g. functions like select_and_rename_bad1 above. The warning message suggested to use all_of or any_of in the select statement for each variable:

Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
Please use all_of() or any_of() instead.
# Was:
data %>% select(old_var1)
# Now:
data %>% select(all_of(old_var1))
See https://tidyselect.r-lib.org/reference/faq-external-vector.html.Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
Please use all_of() or any_of() instead.
# Was:
data %>% select(old_var2)
# Now:
data %>% select(all_of(old_var2))
See https://tidyselect.r-lib.org/reference/faq-external-vector.html.

The first modified select statement I tried for this function resulted in a massive error message because all_of() and any_of() operate on character vectors:
select(all_of(old_var1, old_var2))

The correct way to modify the select statement is

select(all_of(old_var1), all_of(old_var2))

Some other modified select statements that ran but generated the same warning:

select(c(old_var1, old_var2))
select(all_of(c(old_var1, old_var2)))
select(any_of(c({{old_var1}}, {{old_var2}})))

Session info

R 4.2.2
dplyr 1.1.2 & dplyr 1.1.4
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22631)

R 4.3.1
dplyr 1.1.3
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2019 x64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dsl feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

7 participants