|
| 1 | +# Factors |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors with non-alphabetical order. |
| 6 | + |
| 7 | +Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [_stringsAsFactors: An unauthorized biography_](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [_stringsAsFactors = \<sigh\>_](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. |
| 8 | + |
| 9 | +Factors aren't as common in the tidyverse, because no function will automatically turn a character vector into a factor. It is, however, a good idea to use factors when appropriate, and controlling their levels can be particularly useful for tailoring visualisations of categorical data. |
| 10 | + |
| 11 | +### Prerequisites |
| 12 | + |
| 13 | +To work with factors, we'll use the __forcats__ packages (tools for dealing **cat**egorical variables + anagram of factors). It provides a wide range of helpers for working with factors. We'll also use ggplot2 because factors are particularly important for visualisation. |
| 14 | + |
| 15 | +```{r setup, message = FALSE} |
| 16 | +# devtools::install_github("hadley/forcats") |
| 17 | +library(forcats) |
| 18 | +library(ggplot2) |
| 19 | +library(dplyr) |
| 20 | +``` |
| 21 | + |
| 22 | +## Creating factors |
| 23 | + |
| 24 | +There are two ways to create a factor: during import with readr, using `col_factor()`, or after the fact, turning a string into a factor. Often you'll need to do a little experimetation, so I recommend starting with strings. |
| 25 | + |
| 26 | +To turn a string into a factor, call `factor()`, supplying list of possible values: |
| 27 | + |
| 28 | +```{r} |
| 29 | +
|
| 30 | +``` |
| 31 | + |
| 32 | +For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of variables from the [General Social Survey](https://gssdataexplorer.norc.org/). The variables have been selected to illustrate a number of challenges with working with factors. |
| 33 | + |
| 34 | +```{r} |
| 35 | +gss_cat |
| 36 | +```` |
| 37 | +
|
| 38 | +You can see the levels of a factor with `levels()`: |
| 39 | +
|
| 40 | +```{r} |
| 41 | +levels(gss_cat$race) |
| 42 | +``` |
| 43 | + |
| 44 | +And this order is preserved in operations like `count()`: |
| 45 | + |
| 46 | +```{r} |
| 47 | +gss_cat %>% |
| 48 | + count(race) |
| 49 | +``` |
| 50 | + |
| 51 | +And in visualisations like `geom_bar()`: |
| 52 | + |
| 53 | +```{r} |
| 54 | +ggplot(gss_cat, aes(race)) + |
| 55 | + geom_bar() |
| 56 | +``` |
| 57 | + |
| 58 | +Note that by default, ggplot2 will drop levels that don't have any values. You can force them to appear with : |
| 59 | + |
| 60 | +```{r} |
| 61 | +ggplot(gss_cat, aes(race)) + |
| 62 | + geom_bar() + |
| 63 | + scale_x_discrete(drop = FALSE) |
| 64 | +``` |
| 65 | + |
| 66 | +Currently dplyr doesn't have a `drop` option, but it will in the future. |
| 67 | + |
| 68 | +## Modifying factor order |
| 69 | + |
| 70 | +```{r} |
| 71 | +relig <- gss_cat %>% |
| 72 | + group_by(relig) %>% |
| 73 | + summarise( |
| 74 | + age = mean(age, na.rm = TRUE), |
| 75 | + tvhours = mean(tvhours, na.rm = TRUE), |
| 76 | + n = n() |
| 77 | + ) |
| 78 | +
|
| 79 | +ggplot(relig, aes(tvhours, relig)) + geom_point() |
| 80 | +ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) + geom_point() |
| 81 | +``` |
| 82 | + |
| 83 | +If you just want to pull a couple of levels out to the front, you can use `fct_relevel()`. |
| 84 | + |
| 85 | +```{r} |
| 86 | +rincome <- gss_cat %>% |
| 87 | + group_by(rincome) %>% |
| 88 | + summarise( |
| 89 | + age = mean(age, na.rm = TRUE), |
| 90 | + tvhours = mean(tvhours, na.rm = TRUE), |
| 91 | + n = n() |
| 92 | + ) |
| 93 | +
|
| 94 | +ggplot(rincome, aes(age, rincome)) + geom_point() |
| 95 | +
|
| 96 | +gss_cat %>% count(fct_rev(rincome)) |
| 97 | +``` |
| 98 | + |
| 99 | +`fct_rev(rincome)` |
| 100 | +`fct_reorder(religion, rincome)` |
| 101 | +`fct_reorder2(religion, year, rincome)` |
| 102 | + |
| 103 | + |
| 104 | +```{r} |
| 105 | +by_year <- gss_cat %>% |
| 106 | + group_by(year, marital) %>% |
| 107 | + count() %>% |
| 108 | + mutate(prop = n / sum(n)) |
| 109 | +
|
| 110 | +ggplot(by_year, aes(year, prop, colour = marital)) + |
| 111 | + geom_line() |
| 112 | +
|
| 113 | +ggplot(by_year, aes(year, prop, colour = fct_reorder2(marital, year, prop))) + |
| 114 | + geom_line() |
| 115 | +
|
| 116 | +``` |
| 117 | + |
| 118 | +## Modifying factor levels |
| 119 | + |
| 120 | +`fct_recode()` is the most general. It allows you to transform levels. |
| 121 | + |
| 122 | +### Manually grouping |
| 123 | + |
| 124 | +```{r} |
| 125 | +fct_count(fct_collapse(gss_cat$partyid, |
| 126 | + other = c("No answer", "Don't know", "Other party"), |
| 127 | + rep = c("Strong republican", "Not str republican"), |
| 128 | + ind = c("Ind,near rep", "Independent", "Ind,near dem"), |
| 129 | + dem = c("Not str democrat", "Strong democrat") |
| 130 | +)) |
| 131 | +``` |
| 132 | + |
| 133 | +### Lumping small groups together |
| 134 | + |
| 135 | +```{r} |
| 136 | +gss_cat %>% mutate(relig = fct_lump(relig)) %>% count(relig) |
| 137 | +gss_cat %>% mutate(relig = fct_lump(relig, 5)) %>% count(relig, sort = TRUE) |
| 138 | +``` |
| 139 | + |
| 140 | +```{r} |
| 141 | +gss_cat$relig %>% fct_infreq() %>% fct_lump(5) %>% fct_count() |
| 142 | +gss_cat$relig %>% fct_lump(5) %>% fct_infreq() %>% fct_count() |
| 143 | +``` |
| 144 | + |
| 145 | +`fct_reorder()` is sometimes also useful. It... |
0 commit comments