01-functions-01.qmd

---
title: "useR to programmeR"
subtitle: "👋 & Functions 1"
author: "Emma Rand and Ian Lyttle <br>WiFi: Posit Conf 2024<br>Password: conf2024"
format: 
  revealjs:
    theme: [simple, styles.scss]
    footer: <https://pos.it/programming_r_24>
    slide-number: true
    chalkboard: true
    code-link: true
    code-line-numbers: false
bibliography: references.bib
---

# 👋 Welcome

## Introductions

This is a two-day, hands-on workshop for those who have embraced the tidyverse and want to improve their R programming skills and, especially, reduce the amount of duplication in their code.

-   do you have experience equivalent to an introductory data science course using tidyverse?
-   are you comfortable with the [Whole game](https://r4ds.hadley.nz/whole-game.html) chapter of [R for Data Science (2nd Edition)](https://r4ds.hadley.nz/) by by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.

\

## Material

-   <http://pos.it/programming_r_24>

## The team

::: columns
::: {.column width="55%"}
Emma Rand

🐘[\@3mma\@mastodon.social](https://mastodon.social/@3mma)


Lionel Henry
:::

::: {.column width="45%"}
Ian Lyttle

🐘[\@ijlyttle\@vis.social](https://mastodon.social/@ijlyttle@vis.social)

Jonathan McPherson
:::
:::

 
## Standing on the shoulders of

-   [R for Data Science (2e)](https://r4ds.hadley.nz/) @wickham2023

-   [The tidyverse style guide](https://style.tidyverse.org/index.html) @wickham-style

-   [Programming with dplyr vignette](https://dplyr.tidyverse.org/articles/programming.html) @dplyr

### WiFi

-   Network: Posit Conf 2024
-   Password: conf2024

## Introductions

To each other! With help from Yorkshire!

. . .Posit Conf 2024

::: columns
::: {.column width="36%"}
::: {style="font-size: 40%;"}
![](images/quality-street.png){width="300"}

*Ingredients*: Sugar, Glucose syrup, Cocoa mass, Vegetable fats (Palm, Rapeseed, Sunflower, Coconut,Mango kernel/ Sal/ Shea), Sweetened condensed skimmed milk (Skimmed milk, Sugar), Cocoa butter, Dried whole milk, Glucose-fructose syrup, Coconut, Lactose and proteins from whey (from Milk), Whey powder (from Milk), Hazelnuts, Skimmed milk powder, Butter (from Milk), Emulsifiers (Sunflower lecithin, E471), Flavourings, Butterfat (from Milk), Fat-reduced cocoa powder, Salt, Lactic acid.
:::
:::

::: {.column width="32%"}
::: {style="font-size: 40%;"}
![](images/after-eight-thin-mint-squares-25-piece-box.jpg){width="300"}

*Ingredients*: Sugar, Semi-Sweet Chocolate (Sugar, Chocolate, Cocoa Butter, Milkfat, Soy and Sunflower Lecithin, Natural Vanilla Flavor), Glucose Syrup, Peppermint Oil, Citric Acid, Invertase.
:::
:::

::: {.column width="32%"}
::: {style="font-size: 40%;"}
![](images/haribo-strawbs.jpeg){width="300"}

*Ingredients*: Glucose Syrup, Sugar, Starch, Acid: Citric Acid, Flavouring, Fruit and Plant Concentrates: Aronia, Blackcurrant, Elderberry, Grape, Lemon, Orange, Safflower Spirulina, Caramelised Sugar Syrup, Glazing Agents: Beeswax, Carnauba Wax, Elderberry Extract.
:::
:::
:::

## Code of Conduct

[Code of Conduct](https://posit.co/code-of-conduct/). **Please Review**

-   💙 Treat everyone with respect
-   🧡 Everyone should feel welcome and safe
-   Red lanyard = ❌📷

Reporting:

-   🗣️ any posit::conf staff member (t-shirt) or Info desk
-   📧 `codeofconduct@posit.com`
-   ☎️ 844-448-1212

## Housekeeping

-   There are gender-neutral bathrooms on levels 3, 4, 5, 6 & 7
-   Meditation/prayer rooms: in 503. Open Mon/Tues 0700 - 1900, Wed 0700 - 1700
-   Lactation room: in 509 Open Mon/Tues 0700 - 1900, Wed 0700 - 1700

## 🙏 to

-   Lionel and Jonathan

-   colleagues, friends and learners at Schneider Electric, University of York and RForwards!

-   Posit team and especially Mine Çetinkaya-Rundel

. . .

-   Ian!

. . .

-   Experience 🍱 🥗 🌮 🍴 🕐

## Prerequisites

We built this course using the most-recent versions of R (4.4) and RStudio (2024.04). However, things *should* work with at least R 4.2 and RStudio 2023.03. You will need packages:

-   {tidyverse}
-   {palmerpenguins}
-   {devtools}
-   {here}

🎬 Detailed instructions for installing these were covered in [Prerequisites](pre-reqs.html)

## Schedule {.smaller}

| Time          | Activity                                                                                    |
|:----------------------|:------------------------------------------------|
| 09:00 - 10:30 | [Functions 1](01-functions-01.html) Introduction, vector and dataframe functions, embracing |
| 10:30 - 11:00 | ☕ *Coffee break*                                                                           |
| 11:00 - 12:30 | [Functions 2](02-functions-02.html) Plot functions, style and side effects                  |
| 12:30 - 13:30 | 🍱 🥗 🌮 🍴 *Lunch break*                                                                   |
| 13:30 - 15:00 | [Iteration 1](03-iteration-01.html) Introduction and modifying multiple columns             |
| 15:00 - 15:30 | ☕ *Coffee break*                                                                           |
| 15:30 - 17:00 | [Iteration 2](04-iteration-02.html) Reading and writing multiple files                      |

## How we will work

-   stickies (TODO, update with current colors)

    -   🟦 I'm all good, I'm done

    -   🟪 I could do with some help

-   Discord

-   no stupid questions

-   🎬 Action!

## Learning Objectives

At the end of this section you will be able to:

::: {style="font-size: 80%;"}
-   explain the rationale for writing functions
-   write vector functions
    -   that take one or more vectors as input and output a vector
    -   that take one or more vectors as input and output a single value
-   specify defaults for function argument
-   write functions that take dataframes as input and output a dataframe
-   using embracing to allow data masking and tidy selection within functions
:::

# Set up

## Project

https://github.com/posit-conf-2024/programming-r-exercises

🎬 Create a Project:

```{r}
#| eval: false

usethis::use_course("posit-conf-2024/programming-r-exercises")
```

## 

```         
> usethis::use_course("posit-conf-2024/programming-r-exercises")
✔ Downloading from 'https://github.com/posit-conf-2024/programming-r-exercises/zipball/HEAD'
Downloaded: 0.26 MB  
✔ Download stored in 'C:/Users/er13/OneDrive - University of York/Desktop/Desktop/posit-conf-2024-programming-r-exercises-978baff.zip'
✔ Unpacking ZIP file into 'posit-conf-2024-programming-r-exercises-978baff/' (45 files extracted)
Shall we delete the ZIP file ('posit-conf-2024-programming-r-exercises-978baff.zip')?

1: Not now
2: Yeah
3: Nope
```

🎬 Choose the option that means yes!

## 

```         
✔ Deleting 'posit-conf-2024-programming-r-exercises-978baff.zip'
✔ Opening project in RStudio
```

. . .

RStudio will restart

## Create a `.R`

```{r}
#| eval: false

usethis::use_r("functions-01")
```

## Packages

🎬 Load packages:

```{r}
library(tidyverse)
library(palmerpenguins)
```

```         
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors'
```

## Load `penguins`

🎬 Load `penguins` data set

```{r}
data(penguins)
glimpse(penguins)
```

# Why write functions?

## Rationale

-   impact from code: reach and clarity
-   efficiency: update code in one place, decrease error rate, improve ability to reuse

## Example

We have several measurements:

-   `bill_length_mm`
-   `bill_depth_mm`
-   `flipper_length_mm`
-   `body_mass_g`

These are on very different scales

## 

```{r}
#| echo: false
#| layout-ncol: 2

penguins |> ggplot(aes(x = bill_length_mm)) +
  geom_histogram(bins = 20) +
  theme_gray(base_size = 22)
penguins |> ggplot(aes(x = bill_depth_mm)) +
  geom_histogram(bins = 20) +
  theme_gray(base_size = 22)
penguins |> ggplot(aes(x = flipper_length_mm)) +
  geom_histogram(bins = 20) +
  theme_gray(base_size = 22)
penguins |> ggplot(aes(x = body_mass_g)) +
  geom_histogram(bins = 20) +
  theme_gray(base_size = 22)
```

## Example

-   difficult to plot on same axis or determine what value is large for that variable

-   A common solution is to apply a $z$ score transformation to each variable.

-   Normalises the values to have a mean of 0 and a standard deviation of 1

$$z = \frac{x - \bar{x}}{s.d.}$$

## Apply transformation

We can apply the same transformation to each variable:

```{r}
penguins <- penguins |>
  mutate(
    z_bill_length_mm = (bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) / sd(bill_length_mm, na.rm = TRUE),
    z_bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm, na.rm = TRUE)) / sd(bill_depth_mm, na.rm = TRUE),
    z_flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm, na.rm = TRUE)) / sd(flipper_length_mm, na.rm = TRUE),
    z_body_mass_g = (body_mass_g - mean(body_mass_g, na.rm = TRUE)) / sd(body_mass_g, na.rm = TRUE)
  )
```

## Long, unclear

`(bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) / sd(bill_length_mm, na.rm = TRUE)`

-   Quite a lot of code
-   Difficult to determine what the transformation is

How to shorten and make more clear?

## Coping and pasting

-   is error prone

How to make fewer mistakes?

. . .

Writing a function:

-   can be named to make transformation transparent
-   will make code shorter
-   can be reused

🔑️ You may think you have to write complex functions - you don't! Start with the simple things.

# Types of function

## Types of function {auto-animate="true"}

We will cover two types of function

1.  vector functions: one of more vectors as input, one vector as output

. . .

2.  data frame functions: df as input and df as output

## Types of function {auto-animate="true"}

We will cover two types of function

1.  vector functions: one of more vectors as input, one vector as output

    i.  output same length as input. "mutate" functions will work well in `mutate()` and `filter()`. Principles of writing functions

    ii. summary functions: input is vector, output is a single value

2.  data frame functions: df as input and df as output

# Vector functions

## Output same length as input

-   output same length as input
-   work well in `mutate()`
-   appropriate for the *z*-transformation example

## General

To turn your code into a function you need:

-   a name
-   the arguments - which represent the bits that vary
-   the code body for the function

. . .

``` r
name <- function(arguments) {
  code body
}
```

## Function name

Use a verb - [The tidyverse style guide](https://style.tidyverse.org/index.html) [@wickham-style] but good advice regardless

. . .

Difficulty in naming? Should this be two or three functions?

. . .

What should we call the function we write to do a $z$ score transformation?

## Arguments

-   the input vector

-   additional arguments

Naming conventions

-   x for the vector input

``` r
name <- function(x) {
  body does things with x
}
```

## Example

$$z = \frac{x - \bar{x}}{s.d.}$$

``` r
penguins <- penguins |>
  mutate(
    z_bill_length_mm = (bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) / sd(bill_length_mm, na.rm = TRUE),
    z_bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm, na.rm = TRUE)) / sd(bill_depth_mm, na.rm = TRUE),
    z_flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm, na.rm = TRUE)) / sd(flipper_length_mm, na.rm = TRUE),
    z_body_mass_g = (body_mass_g - mean(body_mass_g, na.rm = TRUE)) / sd(body_mass_g, na.rm = TRUE)
  )
```

## Example

Identify the arguments: the things that vary across calls

::: {style="font-size: 60%;"}
``` r
(bill_length_mm    - mean(bill_length_mm,    na.rm = TRUE)) / sd(bill_length_mm,    na.rm = TRUE)
(bill_depth_mm     - mean(bill_depth_mm,     na.rm = TRUE)) / sd(bill_depth_mm,     na.rm = TRUE)
(flipper_length_mm - mean(flipper_length_mm, na.rm = TRUE)) / sd(flipper_length_mm, na.rm = TRUE)
(body_mass_g       - mean(body_mass_g,       na.rm = TRUE)) / sd(body_mass_g,       na.rm = TRUE)
```
:::

\

. . .

::: {style="font-size: 60%;"}
``` r
(🟧 - mean(🟧, na.rm = TRUE)) / sd(🟧, na.rm = TRUE)
(🟧 - mean(🟧, na.rm = TRUE)) / sd(🟧, na.rm = TRUE)
(🟧 - mean(🟧, na.rm = TRUE)) / sd(🟧, na.rm = TRUE)
(🟧 - mean(🟧, na.rm = TRUE)) / sd(🟧, na.rm = TRUE)
```
:::

🟧 is x

## Example

Put into the template

``` r
name <- function(x) {
  body does things with x
}
```

\

```{r}
to_z <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
```

## Apply

Rewrite the call to `mutate()` as:

```{r}
penguins <- penguins |>
  mutate(
    z_bill_length_mm = to_z(bill_length_mm),
    z_bill_depth_mm = to_z(bill_depth_mm),
    z_flipper_length_mm = to_z(flipper_length_mm),
    z_body_mass_g = to_z(body_mass_g)
  )
```

. . .

Much shorter, much more clear.

## A modification

`mean()` has a `trim` argument: `mean(x, trim = 0, na.rm = FALSE, ...)`

*the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.*

. . .

Suppose we want to specify the *middle* proportion left rather than the proportion trimmed from each end \[\^1\]

## A modification

-   A value of 0.1 for `trim` trims 0.1 from each end leaving 0.8 in the middle

-   trim = (1 - middle)/2

![Trim is the proportion trimmed off each end; middle is what's left](images/vector-functions-trim.png){fig-alt="schematic of trim and middle demonstrating that trim = (1 - middle)/2"}

## Add an argument

```{r}
to_z <- function(x, middle) {
  trim = (1 - middle)/2
  (x - mean(x, na.rm = TRUE, trim = trim)) / sd(x, na.rm = TRUE)
}
```

## Try it out

::: {style="font-size: 80%;"}
```{r}
to_z(penguins$bill_length_mm, middle = 0.2)

```
:::

## But what if we forget?

```{r}
#| error: true
to_z(penguins$bill_length_mm)

```

## Give a default

Give defaults whenever possible:

```{r}
to_z <- function(x, middle = 1) {
  trim = (1 - middle)/2
  (x - mean(x, na.rm = TRUE, trim = trim)) / sd(x, na.rm = TRUE)
}

```

## Try it out

::: {style="font-size: 80%;"}
```{r}
to_z(penguins$bill_length_mm)
```
:::

## Your turn

🎬 Write a function that performs the Box-Cox power transformation using the value of (non-zero) lambda ($\lambda$) supplied.

## Your turn: Box-Cox

$$bc = \frac{x^{\lambda} - 1}{\lambda}  \text{ for }\lambda \ne 0$$

-   Set the default $\lambda = 1$

. . .

-   Still have time? Check $\lambda \ne 0$
-   Still have time? Check and amend for:

$$
bc = \begin{cases}
  \frac{x^{\lambda} - 1}{\lambda} & \text{for }\lambda \ne 0\\    
  log(x) & \text{for }\lambda = 0    
\end{cases}
$$

## A solution 1

```{r}
to_box_cox <- function(x, lambda = 1) {
  (x^lambda - 1) / lambda
}
```

## A solution 1 - test

```{r}
vals <- rexp(10000, 10) 
vals |> hist()
```

## A solution 1 - test

```{r}
to_box_cox(vals, 0.3) |> 
  hist()
```

## A solution 2

Check $\lambda \ne 0$

## A solution 3

Check and amend for:

$$
bc = \begin{cases}
  \frac{x^{\lambda} - 1}{\lambda} & \text{for }\lambda \ne 0\\    
  log(x) & \text{for }\lambda = 0    
\end{cases}
$$

## Types of function

We will cover two types of function

1.  vector functions: one of more vectors as input, one vector as output

    i.  ✔️ output same length as input.

    **ii. ➡️ summary functions: input is vector, output is a single value**

2.  data frame functions: df as input and df as output

## Summary functions

-   input is vector
-   output is a single value
-   could be used in `summarise()`

## Example

Write a function to compute the standard error of a sample.

$$s.e. = \frac{s.d.}{\sqrt{n}}$$

## Example

```{r}
sd_error <- function(x){
  sd(x, na.rm = TRUE) / sqrt(sum(!is.na(x)))
}

```

. . .

Note: `sum(TRUE)` = 1 and `sum(FALSE)` = 0 Thus,`sum(!is.na(x))` gives you the number of `TRUE` (i.e., the number of non-NA values) and is a bit shorter than `length(x[!is.na(x)])`

## Try it out

🎬 Call the function on `penguins$bill_length_mm`

```{r}
sd_error(penguins$bill_length_mm)
```

. . .

Or in a pipeline

```{r}
penguins |> 
  summarise(se = sd_error(bill_length_mm))
```

## Your turn

🎬 Write a function to compute the sums of squares (sum of the squared deviations from the mean)

$$SS(x) = \sum{(x - \bar{x})^2}$$

or

$$SS(x) = s^2 * (n-1)$$

## A solution - 1

```{r}
sum_sq <- function(x){
 sum((x[!is.na(x)] - mean(x[!is.na(x)]))^2)
}
```

. . .

🎬 Try it out

```{r}
sum_sq(penguins$bill_length_mm)
```

## Types of function

We will cover two types of function

1.  vector functions: one of more vectors as input, one vector as output

    i.  ✔️ output same length as input.

    ii. ✔️ summary functions: input is vector, output is a single value

**2. ➡️ data frame functions: df as input and df as output**

# Dataframe functions

## Dataframe functions

Dataframe as input and Dataframe as output

. . .

For example, we might summarise one of our columns like this:

```{r}
penguins |> 
  summarise(mean = mean(bill_length_mm, na.rm = TRUE),
            n = sum(!is.na(bill_length_mm)),
            sd = sd(bill_length_mm, na.rm = TRUE),
            se = sd_error(bill_length_mm))
```

Output is a dataframe

## Dataframe functions

and summarise several dataframes in the same way

Good candidate for a function to avoid repetitive code: `my_summary()`

## Define `my_summary()` function

```{r}
my_summary <- function(df, column){
  df |> 
  summarise(mean = mean(column, na.rm = TRUE),
            n = sum(!is.na(column)),
            sd = sd(column, na.rm = TRUE),
            se = sd_error(column))
}
```

## Use function

```{r}
#| error: true
my_summary(penguins, bill_length_mm)
```

😕

## Tidy evaluation

`tidyverse` functions like `dplyr::summarise()` use "tidy evaluation" so you can refer to the names of variables inside dataframes. For example, you can use:

either

``` r
penguins |> summarise(mean = mean(bill_depth_mm))
```

Or

``` r
summarise(penguins, mean = mean(bill_depth_mm))
```

rather than `$` notation

``` r
summarise(penguins, mean = mean(penguins$bill_depth_mm))
```

. . .

This is known as data-masking: the dataframe environment masks the user environment by giving priority to the dataframe.

## Data masking is great....

and makes life easier when working interactively

. . .

But not so useful in functions

Because of data-masking, `summarise()` in `my_summary()` is looking for a column literally called `column` in the dataframe that has been passed in. It is not looking in the variable `column` for the name of column you want to give it.

. . .

[Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html)

## Fix `my_summary()` function

The solution is to use embracing: `{{ var }}`

```{r}
my_summary <- function(df, column){
  df |> 
  summarise(mean = mean({{ column }}, na.rm = TRUE),
            n = sum(!is.na({{ column }})),
            sd = sd({{ column }}, na.rm = TRUE),
            se = sd_error({{ column }}),
            .groups = "drop")
}
```

. . .

-   look inside `column` variable
-   style with spaces
-   `.groups = "drop"` to avoid message and leave the data in an ungrouped state

## Use function

```{r}
my_summary(penguins, bill_length_mm)
```

🎉

## When to embrace?

When tidy evaluation is used

## Your turn

🎬 Write a function to calculate the median, maximum and minimum values of a variable grouped by another variable.

## A solution - 1

```{r}
my_summary <- function(df, summary_var, group_var){
  df |> 
    group_by({{ group_var }}) |> 
  summarise(median = median({{summary_var  }}, na.rm = TRUE),
            minimum = min({{summary_var  }}, na.rm = TRUE),
            maximum = max({{summary_var  }}, na.rm = TRUE),
            .groups = "drop")
}
```

## Your turn

🎬 Try it out

```{r}
my_summary(penguins, bill_length_mm, species)
```

## A solution - 2

Improvement: Have a default of `NULL` for the grouping variable

```{r}
my_summary <- function(df, summary_var, group_var = NULL){
  df |> 
    group_by({{ group_var }}) |> 
    summarise(median = median({{summary_var  }}, na.rm = TRUE),
            minimum = min({{summary_var  }}, na.rm = TRUE),
            maximum = max({{summary_var  }}, na.rm = TRUE),
            .groups = "drop")
}
```

## Your turn

🎬 Try it out

```{r}
my_summary(penguins, bill_length_mm)
```

## Your turn

🎬 Try it out with more than one group

```{r}
#| error: true
my_summary(penguins, bill_length_mm, c(species, island),)
```

😕

## A solution - 3

Use `pick()` which allows you to select a subset of columns inside a data masking function:

. . .

```{r}
my_summary <- function(df, summary_var, group_var = NULL){
  df |> 
    group_by(pick({{ group_var }})) |> 
    summarise(median = median({{summary_var  }}, na.rm = TRUE),
            minimum = min({{summary_var  }}, na.rm = TRUE),
            maximum = max({{summary_var  }}, na.rm = TRUE),
            .groups = "drop")
}
```

## 

🎬 Try it out with more than one group

```{r}
my_summary(penguins, bill_length_mm, c(species, island))
```

## Extras

-   Short cuts:

    -   put cursor on a function call and press F2 to find its definition
    -   Ctrl+. opens section/file search

## Summary ☕

::: {style="font-size: 80%;"}
-   Writing functions can make you more efficient and make your code more readable. This can be just for your benefit.

-   Vector functions take one of more vectors as input; output can be a vector (useful in `mutate()` and `filter()`) or a single value (useful in `summarise()`)

-   Dataframe functions take a dataframe as input and output a dataframe

-   Give arguments a default where possible

-   We use `{{ var }}` embracing to manage data masking

-   We use `pick()` to select more than one variable
:::

## References