Skip to content

Commit 8815e8f

Browse files
committed
Start banging out factors chapter
1 parent 5a29a71 commit 8815e8f

File tree

6 files changed

+155
-6
lines changed

6 files changed

+155
-6
lines changed

DESCRIPTION

+2
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Imports:
1212
broom,
1313
condvis,
1414
dplyr,
15+
forcats,
1516
gapminder,
1617
ggplot2,
1718
ggrepel,
@@ -37,6 +38,7 @@ Imports:
3738
tidyr,
3839
viridis
3940
Remotes:
41+
hadley/forcats,
4042
hadley/modelr,
4143
hadley/stringr,
4244
hadley/tibble,

_bookdown.yml

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ rmd_files: [
1515
"tidy.Rmd",
1616
"relational-data.Rmd",
1717
"strings.Rmd",
18+
"factors.Rmd",
1819
"datetimes.Rmd",
1920

2021
"program.Rmd",

communicate-plots.Rmd

+2-2
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Now you need to _communicate_ the result of your analysis to others. Your audien
1010

1111
In this chapter, we'll focus once again on ggplot2. We'll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including __ggrepel__ and __viridis__. Rather than loading those extension here we'll refer to their functions explicitly with the `::` notation. That will help make it obvious what functions are built into ggplot2, and what functions come from other packages.
1212

13-
```{r}
13+
```{r, message = FALSE}
1414
library(ggplot2)
1515
library(dplyr)
1616
```
@@ -473,7 +473,7 @@ ggplot(mpg, aes(displ, hwy)) +
473473
theme_bw()
474474
```
475475

476-
ggplot2 includes eight themes by default, as shown in Figure \@ref(fig:themes). Many more are included in add-on packages like __ggthemes__ (<https://github.com/jrnold/ggthemes>), by Jeremy Arnold.
476+
ggplot2 includes eight themes by default, as shown in Figure \@ref(fig:themes). Many more are included in add-on packages like __ggthemes__ (<https://github.com/jrnold/ggthemes>), by Jeffrey Arnold.
477477

478478
```{r themes, echo = FALSE, fig.cap = "The eight themes built-in to ggplot2."}
479479
knitr::include_graphics("images/visualization-themes.png")

factors.Rmd

+145
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Factors
2+
3+
## Introduction
4+
5+
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors with non-alphabetical order.
6+
7+
Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [_stringsAsFactors: An unauthorized biography_](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [_stringsAsFactors = \<sigh\>_](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley.
8+
9+
Factors aren't as common in the tidyverse, because no function will automatically turn a character vector into a factor. It is, however, a good idea to use factors when appropriate, and controlling their levels can be particularly useful for tailoring visualisations of categorical data.
10+
11+
### Prerequisites
12+
13+
To work with factors, we'll use the __forcats__ packages (tools for dealing **cat**egorical variables + anagram of factors). It provides a wide range of helpers for working with factors. We'll also use ggplot2 because factors are particularly important for visualisation.
14+
15+
```{r setup, message = FALSE}
16+
# devtools::install_github("hadley/forcats")
17+
library(forcats)
18+
library(ggplot2)
19+
library(dplyr)
20+
```
21+
22+
## Creating factors
23+
24+
There are two ways to create a factor: during import with readr, using `col_factor()`, or after the fact, turning a string into a factor. Often you'll need to do a little experimetation, so I recommend starting with strings.
25+
26+
To turn a string into a factor, call `factor()`, supplying list of possible values:
27+
28+
```{r}
29+
30+
```
31+
32+
For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of variables from the [General Social Survey](https://gssdataexplorer.norc.org/). The variables have been selected to illustrate a number of challenges with working with factors.
33+
34+
```{r}
35+
gss_cat
36+
````
37+
38+
You can see the levels of a factor with `levels()`:
39+
40+
```{r}
41+
levels(gss_cat$race)
42+
```
43+
44+
And this order is preserved in operations like `count()`:
45+
46+
```{r}
47+
gss_cat %>%
48+
count(race)
49+
```
50+
51+
And in visualisations like `geom_bar()`:
52+
53+
```{r}
54+
ggplot(gss_cat, aes(race)) +
55+
geom_bar()
56+
```
57+
58+
Note that by default, ggplot2 will drop levels that don't have any values. You can force them to appear with :
59+
60+
```{r}
61+
ggplot(gss_cat, aes(race)) +
62+
geom_bar() +
63+
scale_x_discrete(drop = FALSE)
64+
```
65+
66+
Currently dplyr doesn't have a `drop` option, but it will in the future.
67+
68+
## Modifying factor order
69+
70+
```{r}
71+
relig <- gss_cat %>%
72+
group_by(relig) %>%
73+
summarise(
74+
age = mean(age, na.rm = TRUE),
75+
tvhours = mean(tvhours, na.rm = TRUE),
76+
n = n()
77+
)
78+
79+
ggplot(relig, aes(tvhours, relig)) + geom_point()
80+
ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) + geom_point()
81+
```
82+
83+
If you just want to pull a couple of levels out to the front, you can use `fct_relevel()`.
84+
85+
```{r}
86+
rincome <- gss_cat %>%
87+
group_by(rincome) %>%
88+
summarise(
89+
age = mean(age, na.rm = TRUE),
90+
tvhours = mean(tvhours, na.rm = TRUE),
91+
n = n()
92+
)
93+
94+
ggplot(rincome, aes(age, rincome)) + geom_point()
95+
96+
gss_cat %>% count(fct_rev(rincome))
97+
```
98+
99+
`fct_rev(rincome)`
100+
`fct_reorder(religion, rincome)`
101+
`fct_reorder2(religion, year, rincome)`
102+
103+
104+
```{r}
105+
by_year <- gss_cat %>%
106+
group_by(year, marital) %>%
107+
count() %>%
108+
mutate(prop = n / sum(n))
109+
110+
ggplot(by_year, aes(year, prop, colour = marital)) +
111+
geom_line()
112+
113+
ggplot(by_year, aes(year, prop, colour = fct_reorder2(marital, year, prop))) +
114+
geom_line()
115+
116+
```
117+
118+
## Modifying factor levels
119+
120+
`fct_recode()` is the most general. It allows you to transform levels.
121+
122+
### Manually grouping
123+
124+
```{r}
125+
fct_count(fct_collapse(gss_cat$partyid,
126+
other = c("No answer", "Don't know", "Other party"),
127+
rep = c("Strong republican", "Not str republican"),
128+
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
129+
dem = c("Not str democrat", "Strong democrat")
130+
))
131+
```
132+
133+
### Lumping small groups together
134+
135+
```{r}
136+
gss_cat %>% mutate(relig = fct_lump(relig)) %>% count(relig)
137+
gss_cat %>% mutate(relig = fct_lump(relig, 5)) %>% count(relig, sort = TRUE)
138+
```
139+
140+
```{r}
141+
gss_cat$relig %>% fct_infreq() %>% fct_lump(5) %>% fct_count()
142+
gss_cat$relig %>% fct_lump(5) %>% fct_infreq() %>% fct_count()
143+
```
144+
145+
`fct_reorder()` is sometimes also useful. It...

vectors.Rmd

+1-4
Original file line numberDiff line numberDiff line change
@@ -597,17 +597,14 @@ typeof(x)
597597
attributes(x)
598598
```
599599

600-
Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \<sigh\>](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is modelling. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow".
601-
602-
Factors aren't common in the tidyverse, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first place. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a character vector.
600+
You can create them from scratch with `factor()` or from a character vector with `as.factor()`.
603601

604602
```{r}
605603
x <- factor(letters[1:5])
606604
is.factor(x)
607605
as.factor(letters[1:5])
608606
```
609607

610-
Otherwise, you might try my __forcats__ package, which provides handy functions for working with factors (forcats = tools **for** **cat**egorical variables, and is an anagram of factors!). At the time of writing it was only available on github, <https://github.com/hadley/forcats>, but it may have made it to CRAN by the time you read this book.
611608

612609
### Dates and date-times
613610

wrangle.Rmd

+4
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,10 @@ Data wrangling also encompasses data transformation, which you've already learn
3030

3131
* [Strings] will introduce regular expressions, a powerful tool for
3232
manipulating strings.
33+
34+
* [Factors] are how R stores categorical data. They are used when a variable
35+
has a fixed set of possible values, or when you want to non-alphabetical
36+
ordering of a string.
3337

3438
* [Dates and times] will give you the key tools for working with
3539
dates and date-times.

0 commit comments

Comments
 (0)