forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
factors.qmd
441 lines (335 loc) · 16.3 KB
/
factors.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
# Factors {#sec-factors}
```{r}
#| echo: false
source("_common.R")
```
## Introduction
Factors are used for categorical variables, variables that have a fixed and known set of possible values.
They are also useful when you want to display character vectors in a non-alphabetical order.
We'll start by motivating why factors are needed for data analysis[^factors-1] and how you can create them with `factor()`. We'll then introduce you to the `gss_cat` dataset which contains a bunch of categorical variables to experiment with.
You'll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.
[^factors-1]: They're also really important for modelling.
### Prerequisites
Base R provides some basic tools for creating and manipulating factors.
We'll supplement these with the **forcats** package, which is part of the core tidyverse.
It provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!) using a wide range of helpers for working with factors.
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
## Factor basics
Imagine that you have a variable that records month:
```{r}
x1 <- c("Dec", "Apr", "Jan", "Mar")
```
Using a string to record this variable has two problems:
1. There are only twelve possible months, and there's nothing saving you from typos:
```{r}
x2 <- c("Dec", "Apr", "Jam", "Mar")
```
2. It doesn't sort in a useful way:
```{r}
sort(x1)
```
You can fix both of these problems with a factor.
To create a factor you must start by creating a list of the valid **levels**:
```{r}
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
```
Now you can create a factor:
```{r}
y1 <- factor(x1, levels = month_levels)
y1
sort(y1)
```
And any values not in the level will be silently converted to NA:
```{r}
y2 <- factor(x2, levels = month_levels)
y2
```
This seems risky, so you might want to use `forcats::fct()` instead:
```{r}
#| error: true
y2 <- fct(x2, levels = month_levels)
```
If you omit the levels, they'll be taken from the data in alphabetical order:
```{r}
factor(x1)
```
Sorting alphabetically is slightly risky because not every computer will sort strings in the same way.
So `forcats::fct()` orders by first appearance:
```{r}
fct(x1)
```
If you ever need to access the set of valid levels directly, you can do so with `levels()`:
```{r}
levels(y2)
```
You can also create a factor when reading your data with readr with `col_factor()`:
```{r}
csv <- "
month,value
Jan,12
Feb,56
Mar,12"
df <- read_csv(csv, col_types = cols(month = col_factor(month_levels)))
df$month
```
## General Social Survey
For the rest of this chapter, we're going to use `forcats::gss_cat`.
It's a sample of data from the [General Social Survey](https://gss.norc.org), a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
The survey has thousands of questions, so in `gss_cat` Hadley selected a handful that will illustrate some common challenges you'll encounter when working with factors.
```{r}
gss_cat
```
(Remember, since this dataset is provided by a package, you can get more information about the variables with `?gss_cat`.)
When factors are stored in a tibble, you can't see their levels so easily.
One way to view them is with `count()`:
```{r}
gss_cat |>
count(race)
```
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels.
Those operations are described in the sections below.
### Exercises
1. Explore the distribution of `rincome` (reported income).
What makes the default bar chart hard to understand?
How could you improve the plot?
2. What is the most common `relig` in this survey?
What's the most common `partyid`?
3. Which `relig` does `denom` (denomination) apply to?
How can you find out with a table?
How can you find out with a visualization?
## Modifying factor order {#sec-modifying-factor-order}
It's often useful to change the order of the factor levels in a visualization.
For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
```{r}
#| fig-alt: |
#| A scatterplot of with tvhours on the x-axis and religion on the y-axis.
#| The y-axis is ordered seemingly aribtrarily making it hard to get
#| any sense of overall pattern.
relig_summary <- gss_cat |>
group_by(relig) |>
summarize(
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(x = tvhours, y = relig)) +
geom_point()
```
It is hard to read this plot because there's no overall pattern.
We can improve it by reordering the levels of `relig` using `fct_reorder()`.
`fct_reorder()` takes three arguments:
- `.f`, the factor whose levels you want to modify.
- `.x`, a numeric vector that you want to use to reorder the levels.
- Optionally, `.fun`, a function that's used if there are multiple values of `.x` for each value of `.f`. The default value is `median`.
```{r}
#| fig-alt: |
#| The same scatterplot as above, but now the religion is displayed in
#| increasing order of tvhours. "Other eastern" has the fewest tvhours
#| under 2, and "Don't know" has the highest (over 5).
ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
geom_point()
```
Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions watch much less.
As you start making more complicated transformations, we recommend moving them out of `aes()` and into a separate `mutate()` step.
For example, you could rewrite the plot above as:
```{r}
#| eval: false
relig_summary |>
mutate(
relig = fct_reorder(relig, tvhours)
) |>
ggplot(aes(x = tvhours, y = relig)) +
geom_point()
```
What if we create a similar plot looking at how average age varies across reported income level?
```{r}
#| fig-alt: |
#| A scatterplot with age on the x-axis and income on the y-axis. Income
#| has been reordered in order of average age which doesn't make much
#| sense. One section of the y-axis goes from $6000-6999, then <$1000,
#| then $8000-9999.
rincome_summary <- gss_cat |>
group_by(rincome) |>
summarize(
age = mean(age, na.rm = TRUE),
n = n()
)
ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) +
geom_point()
```
Here, arbitrarily reordering the levels isn't a good idea!
That's because `rincome` already has a principled order that we shouldn't mess with.
Reserve `fct_reorder()` for factors whose levels are arbitrarily ordered.
However, it does make sense to pull "Not applicable" to the front with the other special levels.
You can use `fct_relevel()`.
It takes a factor, `.f`, and then any number of levels that you want to move to the front of the line.
```{r}
#| fig-alt: |
#| The same scatterplot but now "Not Applicable" is displayed at the
#| bottom of the y-axis. Generally there is a positive association
#| between income and age, and the income band with the highethst average
#| age is "Not applicable".
ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, "Not applicable"))) +
geom_point()
```
Why do you think the average age for "Not applicable" is so high?
Another type of reordering is useful when you are coloring the lines on a plot.
`fct_reorder2(.f, .x, .y)` reorders the factor `.f` by the `.y` values associated with the largest `.x` values.
This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.
```{r}
#| layout-ncol: 2
#| fig-width: 3
#| fig-alt: |
#| A line plot with age on the x-axis and proportion on the y-axis.
#| There is one line for each category of marital status: no answer,
#| never married, separated, divorced, widowed, and married. It is
#| a little hard to read the plot because the order of the legend is
#| unrelated to the lines on the plot. Rearranging the legend makes
#| the plot easier to read because the legend colors now match the
#| order of the lines on the far right of the plot. You can see some
#| unsurprising patterns: the proportion never married decreases with
#| age, married forms an upside down U shape, and widowed starts off
#| low but increases steeply after age 60.
by_age <- gss_cat |>
filter(!is.na(age)) |>
count(age, marital) |>
group_by(age) |>
mutate(
prop = n / sum(n)
)
ggplot(by_age, aes(x = age, y = prop, color = marital)) +
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set1")
ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set1") +
labs(color = "marital")
```
Finally, for bar plots, you can use `fct_infreq()` to order levels in decreasing frequency: this is the simplest type of reordering because it doesn't need any extra variables.
Combine it with `fct_rev()` if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.
```{r}
#| fig-alt: |
#| A bar char of marital status ordered from least to most common:
#| no answer (~0), separated (~1,000), widowed (~2,000), divorced
#| (~3,000), never married (~5,000), married (~10,000).
gss_cat |>
mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
ggplot(aes(x = marital)) +
geom_bar()
```
### Exercises
1. There are some suspiciously high numbers in `tvhours`.
Is the mean a good summary?
2. For each factor in `gss_cat` identify whether the order of the levels is arbitrary or principled.
3. Why did moving "Not applicable" to the front of the levels move it to the bottom of the plot?
## Modifying factor levels
More powerful than changing the orders of the levels is changing their values.
This allows you to clarify labels for publication, and collapse levels for high-level displays.
The most general and powerful tool is `fct_recode()`.
It allows you to recode, or change, the value of each level.
For example, take the `partyid` variable from the `gss_cat` data frame:
```{r}
gss_cat |> count(partyid)
```
The levels are terse and inconsistent.
Let's tweak them to be longer and use a parallel construction.
Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:
```{r}
gss_cat |>
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)
) |>
count(partyid)
```
`fct_recode()` will leave the levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.
To combine groups, you can assign multiple old levels to the same new level:
```{r}
#| results: false
gss_cat |>
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)
)
```
Use this technique with care: if you group together categories that are truly different you will end up with misleading results.
If you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`.
For each new variable, you can provide a vector of old levels:
```{r}
gss_cat |>
mutate(
partyid = fct_collapse(partyid,
"other" = c("No answer", "Don't know", "Other party"),
"rep" = c("Strong republican", "Not str republican"),
"ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
"dem" = c("Not str democrat", "Strong democrat")
)
) |>
count(partyid)
```
Sometimes you just want to lump together the small groups to make a plot or table simpler.
That's the job of the `fct_lump_*()` family of functions.
`fct_lump_lowfreq()` is a simple starting point that progressively lumps the smallest groups categories into "Other", always keeping "Other" as the smallest category.
```{r}
gss_cat |>
mutate(relig = fct_lump_lowfreq(relig)) |>
count(relig)
```
In this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we'd probably like to see some more details!
Instead, we can use the `fct_lump_n()` to specify that we want exactly 10 groups:
```{r}
gss_cat |>
mutate(relig = fct_lump_n(relig, n = 10)) |>
count(relig, sort = TRUE)
```
Read the documentation to learn about `fct_lump_min()` and `fct_lump_prop()` which are useful in other cases.
### Exercises
1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
2. How could you collapse `rincome` into a small set of categories?
3. Notice there are 9 groups (excluding other) in the `fct_lump` example above.
Why not 10?
(Hint: type `?fct_lump`, and find the default for the argument `other_level` is "Other".)
## Ordered factors {#sec-ordered-factors}
Before we continue, it's important to briefly mention a special type of factor: ordered factors.
Created with the `ordered()` function, ordered factors imply a strict ordering between levels, but don't specify anything about the magnitude of the differences between the levels.
You use ordered factors when you know there the levels are ranked, but there's no precise numerical ranking.
You can identify an ordered factor when its printed because it uses `<` symbols between the factor levels:
```{r}
ordered(c("a", "b", "c"))
```
In both base R and the tidyverse, ordered factors behave very similarly to regular factors.
There are only two places where you might notice different behavior:
- If you map an ordered factor to color or fill in ggplot2, it will default to `scale_color_viridis()`/`scale_fill_viridis()`, a color scale that implies a ranking.
- If you use an ordered predictor in a linear model, it will use "polynomial contrasts". These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don't routinely interpret them. If you want to learn more, we recommend `vignette("contrasts", package = "faux")` by Lisa DeBruine.
For the purposes of this book, correctly distinguishing between regular and ordered factors is not particularly important.
More broadly, however, certain fields (particularly the social sciences) do use ordered factors extensively.
In these contexts, it's important to correctly identify them so that other analysis packages can offer the appropriate behavior.
## Summary
This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions.
forcats contains a wide range of other helpers that we didn't have space to discuss here, so whenever you're facing a factor analysis challenge that you haven't encountered before, I highly recommend skimming the [reference index](https://forcats.tidyverse.org/reference/index.html) to see if there's a canned function that can help solve your problem.
If you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \<sigh\>*](https://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods.
An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!
In the next chapter we'll switch gears to start learning about dates and times in R.
Dates and times seem deceptively simple, but as you'll soon see, the more you learn about them, the more complex they seem to get!