Skip to content

Commit

Permalink
Fix typo (closes #1681) + various other copy edits
Browse files Browse the repository at this point in the history
  • Loading branch information
mine-cetinkaya-rundel authored Sep 2, 2024
1 parent 643ab1b commit 9a9ec24
Showing 1 changed file with 37 additions and 36 deletions.
73 changes: 37 additions & 36 deletions data-transform.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ You'll learn how to do all that (and more!) in this chapter, which will introduc
The goal of this chapter is to give you an overview of all the key tools for transforming a data frame.
We'll start with functions that operate on rows and then columns of a data frame, then circle back to talk more about the pipe, an important tool that you use to combine verbs.
We will then introduce the ability to work with groups.
We will end the chapter with a case study that showcases these functions in action and we'll come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g., numbers, strings, dates).
We will end the chapter with a case study that showcases these functions in action. In later chapters, we'll return to the functions in more detail as we start to dig into specific types of data (e.g., numbers, strings, dates).

### Prerequisites

In this chapter we'll focus on the dplyr package, another core member of the tidyverse.
We'll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.
In this chapter, we'll focus on the dplyr package, another core member of the tidyverse.
We'll illustrate the key ideas using data from the nycflights13 package and use ggplot2 to help us understand the data.

```{r}
#| label: setup
Expand All @@ -32,14 +32,14 @@ library(tidyverse)
Take careful note of the conflicts message that's printed when you load the tidyverse.
It tells you that dplyr overwrites some functions in base R.
If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()` and `stats::lag()`.
So far we've mostly ignored which package a function comes from because most of the time it doesn't matter.
So far, we've mostly ignored which package a function comes from because it doesn't usually matter.
However, knowing the package can help you find help and find related functions, so when we need to be precise about which package a function comes from, we'll use the same syntax as R: `packagename::functionname()`.

### nycflights13

To explore the basic dplyr verbs, we're going to use `nycflights13::flights`.
To explore the basic dplyr verbs, we will use `nycflights13::flights`.
This dataset contains all `r format(nrow(nycflights13::flights), big.mark = ",")` flights that departed from New York City in 2013.
The data comes from the US [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr), and is documented in `?flights`.
The data comes from the US [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr) and is documented in `?flights`.

```{r}
flights
Expand All @@ -48,24 +48,24 @@ flights
`flights` is a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas.
The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
There are a few options to see everything.
If you're using RStudio, the most convenient is probably `View(flights)`, which will open an interactive scrollable and filterable view.
If you're using RStudio, the most convenient is probably `View(flights)`, which opens an interactive, scrollable, and filterable view.
Otherwise you can use `print(flights, width = Inf)` to show all columns, or use `glimpse()`:

```{r}
glimpse(flights)
```

In both views, the variables names are followed by abbreviations that tell you the type of each variable: `<int>` is short for integer, `<dbl>` is short for double (aka real numbers), `<chr>` for character (aka strings), and `<dttm>` for date-time.
These are important because the operations you can perform on a column depend so much on its "type".
In both views, the variable names are followed by abbreviations that tell you the type of each variable: `<int>` is short for integer, `<dbl>` is short for double (aka real numbers), `<chr>` for character (aka strings), and `<dttm>` for date-time.
These are important because the operations you can perform on a column depend heavily on its "type."

### dplyr basics

You're about to learn the primary dplyr verbs (functions) which will allow you to solve the vast majority of your data manipulation challenges.
You're about to learn the primary dplyr verbs (functions), which will allow you to solve the vast majority of your data manipulation challenges.
But before we discuss their individual differences, it's worth stating what they have in common:

1. The first argument is always a data frame.

2. The subsequent arguments typically describe which columns to operate on, using the variable names (without quotes).
2. The subsequent arguments typically describe which columns to operate on using the variable names (without quotes).

3. The output is always a new data frame.

Expand All @@ -86,14 +86,15 @@ flights |>
```

dplyr's verbs are organized into four groups based on what they operate on: **rows**, **columns**, **groups**, or **tables**.
In the following sections you'll learn the most important verbs for rows, columns, and groups, then we'll come back to the join verbs that work on tables in @sec-joins.
In the following sections, you'll learn the most important verbs for rows, columns, and groups. Then, we'll return to the join verbs that work on tables in @sec-joins.
Let's dive in!

## Rows

The most important verbs that operate on rows of a dataset are `filter()`, which changes which rows are present without changing their order, and `arrange()`, which changes the order of the rows without changing which are present.
Both functions only affect the rows, and the columns are left unchanged.
We'll also discuss `distinct()` which finds rows with unique values but unlike `arrange()` and `filter()` it can also optionally modify the columns.
We'll also discuss `distinct()` which finds rows with unique values.
Unlike `arrange()` and `filter()` it can also optionally modify the columns.

### `filter()`

Expand All @@ -102,7 +103,7 @@ The first argument is the data frame.
The second and subsequent arguments are the conditions that must be true to keep the row.
For example, we could find all flights that departed more than 120 minutes (two hours) late:

[^data-transform-1]: Later, you'll learn about the `slice_*()` family which allows you to choose rows based on their positions.
[^data-transform-1]: Later, you'll learn about the `slice_*()` family, which allows you to choose rows based on their positions.

```{r}
flights |>
Expand Down Expand Up @@ -170,9 +171,9 @@ We'll learn more about what's happening here and why in @sec-order-operations-bo

`arrange()` changes the order of the rows based on the value of the columns.
It takes a data frame and a set of column names (or more complicated expressions) to order by.
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
If you provide more than one column name, each additional column will be used to break ties in the values of the preceding columns.
For example, the following code sorts by the departure time, which is spread over four columns.
We get the earliest years first, then within a year the earliest months, etc.
We get the earliest years first, then within a year, the earliest months, etc.

```{r}
flights |>
Expand All @@ -191,7 +192,7 @@ Note that the number of rows has not changed -- we're only arranging the data, w

### `distinct()`

`distinct()` finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows.
`distinct()` finds all the unique rows in a dataset, so technically, it primarily operates on the rows.
Most of the time, however, you'll want the distinct combination of some variables, so you can also optionally supply column names:

```{r}
Expand All @@ -204,7 +205,7 @@ flights |>
distinct(origin, dest)
```

Alternatively, if you want to the keep other columns when filtering for unique rows, you can use the `.keep_all = TRUE` option.
Alternatively, if you want to keep other columns when filtering for unique rows, you can use the `.keep_all = TRUE` option.

```{r}
flights |>
Expand All @@ -213,7 +214,7 @@ flights |>

It's not a coincidence that all of these distinct flights are on January 1: `distinct()` will find the first occurrence of a unique row in the dataset and discard the rest.

If you want to find the number of occurrences instead, you're better off swapping `distinct()` for `count()`, and with the `sort = TRUE` argument you can arrange them in descending order of number of occurrences.
If you want to find the number of occurrences instead, you're better off swapping `distinct()` for `count()`. With the `sort = TRUE` argument, you can arrange them in descending order of the number of occurrences.
You'll learn more about count in @sec-counts.

```{r}
Expand All @@ -229,10 +230,10 @@ flights |>
- Flew to Houston (`IAH` or `HOU`)
- Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn't leave late
- Arrived more than two hours late but didn't leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight

2. Sort `flights` to find the flights with longest departure delays.
2. Sort `flights` to find the flights with the longest departure delays.
Find the flights that left earliest in the morning.

3. Sort `flights` to find the fastest flights.
Expand Down Expand Up @@ -265,8 +266,8 @@ flights |>
)
```

By default, `mutate()` adds new columns on the right hand side of your dataset, which makes it difficult to see what's happening here.
We can use the `.before` argument to instead add the variables to the left hand side[^data-transform-2]:
By default, `mutate()` adds new columns on the right-hand side of your dataset, which makes it difficult to see what's happening here.
We can use the `.before` argument to instead add the variables to the left-hand side[^data-transform-2]:

[^data-transform-2]: Remember that in RStudio, the easiest way to see a dataset with many columns is `View()`.

Expand All @@ -279,7 +280,7 @@ flights |>
)
```

The `.` is a sign that `.before` is an argument to the function, not the name of a third new variable we are creating.
The `.` indicates that `.before` is an argument to the function, not the name of a third new variable we are creating.
You can also use `.after` to add after a variable, and in both `.before` and `.after` you can use the variable name instead of a position.
For example, we could add the new variables after `day`:

Expand Down Expand Up @@ -371,7 +372,7 @@ See `?select` for more details.
Once you know regular expressions (the topic of @sec-regular-expressions) you'll also be able to use `matches()` to select variables that match a pattern.
You can rename variables as you `select()` them by using `=`.
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
The new name appears on the left-hand side of the `=`, and the old variable appears on the right-hand side:
```{r}
flights |>
Expand Down Expand Up @@ -587,10 +588,10 @@ flights |>
)
```

Uhoh!
Something has gone wrong and all of our results are `NA`s (pronounced "N-A"), R's symbol for missing value.
Uh-oh!
Something has gone wrong, and all of our results are `NA`s (pronounced "N-A"), R's symbol for missing value.
This happened because some of the observed flights had missing data in the delay column, and so when we calculated the mean including those values, we got an `NA` result.
We'll come back to discuss missing values in detail in @sec-missing-values, but for now we'll tell the `mean()` function to ignore all missing values by setting the argument `na.rm` to `TRUE`:
We'll come back to discuss missing values in detail in @sec-missing-values, but for now, we'll tell the `mean()` function to ignore all missing values by setting the argument `na.rm` to `TRUE`:

```{r}
flights |>
Expand All @@ -616,7 +617,7 @@ Means and counts can get you a surprisingly long way in data science!

### The `slice_` functions

There are five handy functions that allow you extract specific rows within each group:
There are five handy functions that allow you to extract specific rows within each group:

- `df |> slice_head(n = 1)` takes the first row from each group.
- `df |> slice_tail(n = 1)` takes the last row in each group.
Expand Down Expand Up @@ -740,7 +741,7 @@ You can learn more about it in the [dplyr 1.1.0 blog post](https://www.tidyverse

2. Find the flights that are most delayed upon departure from each destination.

3. How do delays vary over the course of the day.
3. How do delays vary over the course of the day?
Illustrate your answer with a plot.

4. What happens if you supply a negative `n` to `slice_min()` and friends?
Expand Down Expand Up @@ -768,7 +769,7 @@ You can learn more about it in the [dplyr 1.1.0 blog post](https://www.tidyverse
```
b. Write down what you think the output will look like, then check if you were correct, and describe what `arrange()` does.
Also comment on how it's different from the `group_by()` in part (a).
Also, comment on how it's different from the `group_by()` in part (a).
```{r}
#| eval: false
Expand Down Expand Up @@ -853,10 +854,10 @@ When we plot the skill of the batter (measured by the batting average, `performa
```{r}
#| warning: false
#| fig-alt: |
#| A scatterplot of number of batting performance vs. batting opportunites
#| A scatterplot of the number of batting performances vs. batting opportunities
#| overlaid with a smoothed line. Average performance increases sharply
#| from 0.2 at when n is ~100 to 0.25 when n is ~1000. Average performance
#| continues to increase linearly at a much shallower slope reaching
#| continues to increase linearly at a much shallower slope, reaching
#| 0.3 when n is ~12,000.
batters |>
Expand All @@ -882,8 +883,8 @@ You can find a good explanation of this problem and how to overcome it at <http:
## Summary

In this chapter, you've learned the tools that dplyr provides for working with data frames.
The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`), those that manipulate the columns (like `select()` and `mutate()`), and those that manipulate groups (like `group_by()` and `summarize()`).
The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`), those that manipulate the columns (like `select()` and `mutate()`) and those that manipulate groups (like `group_by()` and `summarize()`).
In this chapter, we've focused on these "whole data frame" tools, but you haven't yet learned much about what you can do with the individual variable.
We'll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.
We'll return to that in the Transform part of the book, where each chapter provides tools for a specific type of variable.

In the next chapter, we'll pivot back to workflow to discuss the importance of code style, keeping your code well organized in order to make it easy for you and others to read and understand your code.
In the next chapter, we'll pivot back to workflow to discuss the importance of code style and keeping your code well organized to make it easy for you and others to read and understand.

0 comments on commit 9a9ec24

Please sign in to comment.