Chapter 7 Introduction to dplyr
In today’s class we introduce a new package, dplyr
, which, along with ggplot2 will be used in almost every class session. We will also introduce the readr
package, for reading tabular data.
7.1 Libraries
Both readr
and dplyr
are members of the tidyverse, so a single invocation of library()
makes the functions defined in these two packages available for our use:
library(tidyverse)
7.2 Reading data with the readr
package
The readr
package defines a number of functions for reading data tables from common file formats like Comma-Separated-Value (CSV) and Tab-Separated-Value (TSV) files.
The two most frequently used readr
functions we’ll use in this class are read_csv()
and read_tsv()
for reading CSV and TSV files respectively. There are some variants of these basic function, which you can read about by invoking the help system (?read_csv
).
7.2.1 Reading Excel files
The tidyverse also includes a package called readxl
which can be used to read Excel spreadsheets (recent versions with .xls
and .xlsx
extensions). Excel files are somewhat more complicated to deal with because they can include separate “sheets”. We won’t use readxl
in this class, but documentation and examples of how readxl
is used can be found at the page linked above.
7.2.2 Example data: NC Births
For today’s hands on session we’ll use a data set that contains information on 150 cases of mothers and their newborns in North Carolina in 2004. This data set is available at the following URL:
The births data is a TSV file, so we’ll use the read_tsv()
function to read it:
<- read_tsv("https://github.com/Bio723-class/example-datasets/raw/master/nc-births.txt")
births ## Rows: 150 Columns: 9
## ── Column specification ──────────────────
## Delimiter: "\t"
## chr (3): premature, sexBaby, smoke
## dbl (6): fAge, mAge, weeks, visits, gained, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice that when you used read_tsv()
the function printed information about how it “parsed” the data (i.e. the types it assigned to each of the columns).
The variables in the data set are:
- father’s age (
fAge
), - mother’s age (
mAge
),
- weeks of gestation (
weeks
) - whether the birth was premature or full term (
premature
) - number of OB/GYN visits (
visits
) - mother’s weight gained in pounds (
gained
) - babies birth weight (
weight
) - sex of the baby (
sexBaby
) - whether the mother was a smoker (
smoke
).
Notice too that we read the TSV file directly from a remote location via a URL. If instead, you wanted to load a local file on your computer you would specify the “path” – i.e. the location on your hard drive where you stored the file. For example, here is how I would load the same file if it was stored in the Downloads directory on my Mac laptop:
# load the data from a local file
<- read_tsv("/Users/pmagwene/Downloads/nc-births.txt") births
7.3 A note on “tibbles”
You may have noticed that most of the functions defined in tidyverse related packages return not data frames, but rather something called a “tibble”. You can think about tibbles as light-weight data frames. In fact if you ask about the “class” of a tibble you’ll see that it includes data.frame
as one of it’s classes as well as tbl
and tbl_df
.
class(births)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
There are some minor differences between data frame and tibbles. For example, tibbles print differently in the console and don’t automatically change variable names and types in the same way that standard data frames do. Usually tibbles can be used wherever a standard data frame is expected, but you may occasionally find a function that only works with a standard data frame. It’s easy to convert a tibble to a standard data frame using the as.data.frame
function:
<- as.data.frame(births) births.std.df
For more details about tibbles, see the Tibbles chapter in R for Data Analysis.
7.4 Data filtering and transformation with dplyr
dplyr
is powerful tool for data filter and transformation. In the same way that ggplot2
attempts to provide a “grammar of graphics”, dplyr
aims to provide a “grammar of data manipulation”. In today’s material we will see how dplyr
complements and simplifies standard data frame indexing and subsetting operations. However, dplyr
is focused only on data frames and doesn’t completely replace the basic subsetting operations, and so being adept with both dplyr
and the indexing approaches we’ve seen previously is important. If you’re curious about the name “dplyr”, the package’s originator Hadley Wickham says it’s supposed to invoke the idea of pliers for data frames (Github: Meaning of dplyrs name)
7.5 dplyr’s “verbs”
The primary functions in the dplyr
package can be thought of as a set of “verbs”, each verb corresponding to a common data manipulation task. Some of the most frequently used verbs/functions in dplyr
include:
select
– select columnsfilter
– filter rowsmutate
– create new columnsarrange
– reorder rowssummarize
– summarize valuesgroup_by
– split data frame on some grouping variable. Can be powerfully combined withsummarize
All of these functions return new data frames rather than modifying the existing data frame (though some of the functions support in place modification of data frames via optional arguments).We illustrate these below by example using the NC births data.
7.5.1 select
The select
function subsets the columns (variables) of a data frame. For example, to select just the weeks and weight columns from the births data set we could do:
# note I'm prefixing select with the package name (dplyr)
# to avoid name clashes with built-in select function
<- dplyr::select(births, weeks, weight)
wks.weight dim(wks.weight) # dim should be 50 x 2
## [1] 150 2
head(wks.weight)
## # A tibble: 6 × 2
## weeks weight
## <dbl> <dbl>
## 1 39 6.88
## 2 39 7.69
## 3 40 8.88
## 4 40 9
## 5 40 7.94
## 6 40 8.25
The equivalent using standard indexing would be:
<- births[c("weeks", "weight")]
wks.wt.alt dim(wks.wt.alt)
## [1] 150 2
head(wks.wt.alt)
## # A tibble: 6 × 2
## weeks weight
## <dbl> <dbl>
## 1 39 6.88
## 2 39 7.69
## 3 40 8.88
## 4 40 9
## 5 40 7.94
## 6 40 8.25
Notes:
* The first argument to all of the dplyr
functions is the data frame you’re operating on
- When using functions defined in
dplyr
andggplot2
variable names are (usually) not quoted or used with the$
operator. This is a design feature of these libraries and makes it easier to carry out interactive analyes because it saves a fair amount of typing.
7.5.2 filter
The filter
function returns those rows of the data set that meet the given logical criterion.
For example, to get all the premature babies in the data set we could use filter as so:
<- filter(births, premature == "premie")
premies dim(premies)
## [1] 21 9
The equivalent using standard indexing would be:
<- births[births$premature == "premie",] premies.alt
The filter
function will work with more than one logical argument, and these are joined together using Boolean AND logic (i.e. intersection). For example, to find those babies that were premature and whose mothers were smokers we could do:
<- filter(births, premature == "premie", smoke == "smoker") smoking.premies
The equivalent call using standard indexing is:
# don't forget the trailing comma to indicate rows!
<- births[(births$premature == "premie") & (births$smoke == "smoker"),] smoking.premies.alt
filter
also accepts logical statements chained together using the standard Boolean operators. For example, to find babies who were premature or whose moms were older than 35 you could use the OR operator |
:
<- filter(births, premature == "premie" | fAge > 35) premies.or.oldmom
7.5.3 mutate
The mutate
function creates a new data frame that is the same as input data frame but with additional variables (columns) as specified by the function arguments. In the example below, I create two new variables, weight.in.kg
and a mom.smoked
:
# to make code more readable it's sometime useful to spread out
# function arguments over multiple lines like I've done here
<- mutate(births,
births.plus weight.in.kg = weight / 2.2,
mom.smoked = (smoke == "smoker"))
head(births.plus)
## # A tibble: 6 × 11
## fAge mAge weeks premature visits gained weight sexBaby smoke weigh…¹ mom.s…²
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <lgl>
## 1 31 30 39 full term 13 1 6.88 male smok… 3.13 TRUE
## 2 34 36 39 full term 5 35 7.69 male nons… 3.50 FALSE
## 3 36 35 40 full term 12 29 8.88 male nons… 4.04 FALSE
## 4 41 40 40 full term 13 30 9 female nons… 4.09 FALSE
## 5 42 37 40 full term NA 10 7.94 male nons… 3.61 FALSE
## 6 37 28 40 full term 12 35 8.25 male smok… 3.75 TRUE
## # … with abbreviated variable names ¹weight.in.kg, ²mom.smoked
The equivalent using standard indexing would be to create a new data frame from births
, appending the new variables to the end as so:
<- data.frame(births,
births.plus.alt weight.in.kg = births$weight / 2.2,
mom.smoked = (births$smoke == "smoker"))
7.5.4 arrange
Arrange creates a new data frame where the rows are sorted according to their values for one or more variables. For example, to sort by mothers age we could do:
<- arrange(births, mAge)
young.moms.first head(young.moms.first)
## # A tibble: 6 × 9
## fAge mAge weeks premature visits gained weight sexBaby smoke
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 18 15 37 full term 12 76 8.44 male nonsmoker
## 2 NA 16 40 full term 4 12 6 female nonsmoker
## 3 21 16 38 full term 15 75 7.56 female smoker
## 4 26 17 38 full term 11 30 9.5 female nonsmoker
## 5 17 17 29 premie 4 10 2.63 female nonsmoker
## 6 20 17 40 full term 17 38 7.19 male nonsmoker
The equivalent to arrange
using standard indexing would be to use the information returned by the order
function:
<- births[order(births$mAge),]
young.moms.first.alt head(young.moms.first.alt)
## # A tibble: 6 × 9
## fAge mAge weeks premature visits gained weight sexBaby smoke
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 18 15 37 full term 12 76 8.44 male nonsmoker
## 2 NA 16 40 full term 4 12 6 female nonsmoker
## 3 21 16 38 full term 15 75 7.56 female smoker
## 4 26 17 38 full term 11 30 9.5 female nonsmoker
## 5 17 17 29 premie 4 10 2.63 female nonsmoker
## 6 20 17 40 full term 17 38 7.19 male nonsmoker
When using arrange
, multiple sorting variables can be specified:
<- arrange(births, mAge, fAge)
sorted.by.moms.and.dads head(sorted.by.moms.and.dads)
## # A tibble: 6 × 9
## fAge mAge weeks premature visits gained weight sexBaby smoke
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 18 15 37 full term 12 76 8.44 male nonsmoker
## 2 21 16 38 full term 15 75 7.56 female smoker
## 3 NA 16 40 full term 4 12 6 female nonsmoker
## 4 17 17 29 premie 4 10 2.63 female nonsmoker
## 5 20 17 40 full term 17 38 7.19 male nonsmoker
## 6 26 17 38 full term 11 30 9.5 female nonsmoker
If you want to sort in descending order, you can combing arrange
with the desc
(=descend) function, also defined in dplyr
:
<- arrange(births, desc(mAge))
old.moms.first head(old.moms.first)
## # A tibble: 6 × 9
## fAge mAge weeks premature visits gained weight sexBaby smoke
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 NA 41 33 premie 13 0 5.69 female nonsmoker
## 2 41 40 40 full term 13 30 9 female nonsmoker
## 3 33 40 36 premie 13 23 7.81 female nonsmoker
## 4 40 40 38 full term 13 38 7.31 male nonsmoker
## 5 46 39 38 full term 10 35 6.75 male smoker
## 6 NA 38 32 premie 10 16 2.19 female smoker
7.5.5 summarize
summarize
applies a function of interest to one or more variables in a data frame, reducing a vector of values to a single value and returning the results in a data frame. This is most often used to calculate statistics like means, medians, count, etc. As we’ll see below, this is powerful when combined with the group_by
function.
summarize(births,
mean.wt = mean(weight),
median.wks = median(weeks))
## # A tibble: 1 × 2
## mean.wt median.wks
## <dbl> <dbl>
## 1 7.05 39
You’ll need to be diligent if your data has missing values (NAs). For example, by default the mean
function returns NA if any of the input values are NA:
summarize(births,
mean.gained = mean(gained))
## # A tibble: 1 × 1
## mean.gained
## <dbl>
## 1 NA
However, if you read the mean
docs (?mean
) you’ll see that there is an na.rm
argument that indicates whether NA values should be removed before computing the mean. This is what we want so we instead call summarize as follows:
summarize(births,
mean.gained = mean(gained, na.rm = TRUE))
## # A tibble: 1 × 1
## mean.gained
## <dbl>
## 1 32.5
7.5.6 group_by
The group_by
function implicitly adds grouping information to a data frame.
# group the births by whether mom smoked or not
<- group_by(births, smoke) by_smoking
The object returned by group_by
is a “grouped data frame”:
class(by_smoking)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Some functions, like count()
and summarize()
(see below) know how to use the grouping information. For example, to count the number of births conditional on mother smoking status we could do:
count(by_smoking)
## # A tibble: 2 × 2
## smoke n
## <chr> <int>
## 1 nonsmoker 100
## 2 smoker 50
group_by
also works with multiple grouping variables, with each added grouping variable specified as an additional argument:
<- group_by(births, smoke, mAge > 35) by_smoking.and.mAge
7.5.7 Combining grouping and summarizing
Grouped data frames can be combined with the summarize
function we saw above. For example, if we wanted to calculate mean birth weight, broken down by whether the baby’s mother smoked or not we could call summarize
with our by_smoking
grouped data frame:
summarize(by_smoking, mean.wt = mean(weight))
## # A tibble: 2 × 2
## smoke mean.wt
## <chr> <dbl>
## 1 nonsmoker 7.18
## 2 smoker 6.78
Similarly to get the mean birth weight of children conditioned on mothers smoking status and age:
summarize(by_smoking.and.mAge, mean(weight))
## `summarise()` has grouped output by
## 'smoke'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## smoke `mAge > 35` `mean(weight)`
## <chr> <lgl> <dbl>
## 1 nonsmoker FALSE 7.17
## 2 nonsmoker TRUE 7.26
## 3 smoker FALSE 6.83
## 4 smoker TRUE 6.30
7.5.8 Scoped variants of mutate
and summarize
Both the mutate()
and summarize()
functions provide “scoped” alternatives, that allow us to apply the operation on a selection of variables. These variants are often used in combination with grouping. We’ll look at the summarize
versions – summarize_all()
, summarize_at()
, and summarize_if()
. See the documentation (?mutate_all
) for descriptions of the mutate
versions.
7.5.8.1 summarize_all()
summarize_all()
applies a one or more functions to all columns in a data frame. Here we illustrate a simple version of this with the iris data:
# group by species
<- group_by(iris, Species)
by_species # calculate the mean of every variable, grouped by species
summarize_all(by_species, mean)
## # A tibble: 3 × 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
Note that if we try and apply summarize_all()
in the same way to the grouped data frame by_smoking
we’ll get a bunch of warning messages:
summarize_all(by_smoking, mean)
## # A tibble: 2 × 9
## smoke fAge mAge weeks premature visits gained weight sexBaby
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 nonsmoker NA 26.9 38.6 NA NA NA 7.18 NA
## 2 smoker NA 26 38.5 NA 10.8 NA 6.78 NA
Here’s an example of one of these warnings:
Warning messages:
1: In mean.default(premature) :
argument is not numeric or logical: returning NA
This message is telling us that we can’t apply the mean()
function to the data frame column premature
because this is not a numerical or logical vector. Despite this and the other similar warnings, summarize_all()
does return a result, but the means for any non-numeric values are replaced with NAs, as shown below:
# A tibble: 2 x 9
smoke fAge mAge weeks premature visits gained weight sexBaby
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 nonsmoker NA 26.9 38.6 NA NA NA 7.18 NA
2 smoker NA 26.0 38.5 NA 10.8 NA 6.78 NA
If you examine the output above, you’ll see that there are several variables that are numeric, however we still got NAs when we calculated the grouped means. This is because those variables contain NA values. The mean
function has an optional argument, na.rm
, which tells the function to remove any missing data before calculating the mean. Thus we can modify our call to summarize_all
as follows:
# calculate mean of all variables, grouped by smoking status
summarize_all(by_smoking, mean, na.rm = TRUE)
## # A tibble: 2 × 9
## smoke fAge mAge weeks premature visits gained weight sexBaby
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 nonsmoker 29.8 26.9 38.6 NA 11.9 32.5 7.18 NA
## 2 smoker 29.7 26 38.5 NA 10.8 32.3 6.78 NA
Note that the non-numeric data columns still lead to NA values.
7.5.8.2 summarize_if()
summarize_if()
is similar to summarize_all()
, except it only applies the function of interest to those variables that match a particular predicate (i.e. are TRUE for a particular TRUE/FALSE test).
Here we use summarize_if()
to apply the mean()
function to only those variables (columns) that are numeric.
# calculate mean of all numeric variables, grouped by smoking status
summarize_if(by_smoking, is.numeric, mean, na.rm = TRUE)
## # A tibble: 2 × 7
## smoke fAge mAge weeks visits gained weight
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 nonsmoker 29.8 26.9 38.6 11.9 32.5 7.18
## 2 smoker 29.7 26 38.5 10.8 32.3 6.78
7.5.8.3 summarize_at()
summarize_at()
allows us to apply functions of interest only to specific variables.
# calculate mean of gained and weight variables, grouped by smoking status
summarize_at(by_smoking, c("gained", "weight"), mean, na.rm = TRUE)
## # A tibble: 2 × 3
## smoke gained weight
## <chr> <dbl> <dbl>
## 1 nonsmoker 32.5 7.18
## 2 smoker 32.3 6.78
All three of the scoped summarize functions can also be used to apply multiple functions, by wrapping the function names in a call to dplyr::funs()
:
# calculate mean and std deviation of
# gained and weight variables, grouped by smoking status
summarize_at(by_smoking, c("gained", "weight"), funs(mean, sd), na.rm = TRUE)
## # A tibble: 2 × 5
## smoke gained_mean weight_mean gained_sd weight_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 nonsmoker 32.5 7.18 15.2 1.43
## 2 smoker 32.3 6.78 16.6 1.60
summarize_at()
accepts as the the argument for variables a character vector of column names, a numeric vector of column positions, or a list of columns generated by the dplyr::vars()
function, which can be be used as so:
# reformatted to promote readability of arguments
summarize_at(by_smoking,
vars(gained, weight),
funs(mean, sd),
na.rm = TRUE)
## # A tibble: 2 × 5
## smoke gained_mean weight_mean gained_sd weight_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 nonsmoker 32.5 7.18 15.2 1.43
## 2 smoker 32.3 6.78 16.6 1.60
7.5.9 Combining summarize with grouping aesthetics in ggplot2
We’ve already seen an instance of grouping (conditioning) when we used aesthetics like color or fill to distinguish subgroups in different types of statistical graphics. Below is an example where we integrate information from a group_by
/summarize
operation into a plot:
# calculate mean weights, conditioned on smoking status
<-
wt.by.smoking summarize(by_smoking, mean_weight = mean(weight, na.rm = TRUE))
# create density plot for all the data
# and then use geom_vline to draw vertical lines at the means for
# each group
ggplot(births) +
geom_density(aes(x = weight, color = smoke)) + # data drawn from births
geom_vline(data = wt.by.smoking, # note use of different data frame!
mapping = aes(xintercept = mean_weight, color = smoke),
linetype = 'dashed')
7.6 Pipes
dplyr
includes a very useful operator available called a pipe available to us. Pipes are powerful because they allow us to chain together sets of operations in a very intuitive fashion while minimizing nested function calls. We can think of pipes as taking the output of one function and feeding it as the first argument to another function call, where we’ve already specified the subsequent arguments.
Pipes are actually defined in another packaged called magrittr
. We’ll look at the basic pipe operator and then look at a few additional “special” pipes that magrittr
provides.
7.6.1 Install and load magrittr
In magrittr in not already installed, install it via the command line or the RStudio GUI. Having done so, you will need to load magrittr via the library()
function:
library(magrittr)
7.6.2 The basic pipe operator
The pipe operator is designated by %>%
. Using pipes, the expression x %>% f()
is equivalent to f(x)
and the expression x %>% f(y)
is equivalent to f(x,y)
. The documentation on pipes (see ?magrittr
) uses the notation lhs %>% rhs
where lhs
and rhs
are short for “left-hand side” and “right-hand side” respectively. I’ll use this same notation in some of the explanations that follow.
%>% head() # same as head(births)
births ## # A tibble: 6 × 9
## fAge mAge weeks premature visits gained weight sexBaby smoke
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 31 30 39 full term 13 1 6.88 male smoker
## 2 34 36 39 full term 5 35 7.69 male nonsmoker
## 3 36 35 40 full term 12 29 8.88 male nonsmoker
## 4 41 40 40 full term 13 30 9 female nonsmoker
## 5 42 37 40 full term NA 10 7.94 male nonsmoker
## 6 37 28 40 full term 12 35 8.25 male smoker
%>% head # you can even leave the parentheses out
births ## # A tibble: 6 × 9
## fAge mAge weeks premature visits gained weight sexBaby smoke
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 31 30 39 full term 13 1 6.88 male smoker
## 2 34 36 39 full term 5 35 7.69 male nonsmoker
## 3 36 35 40 full term 12 29 8.88 male nonsmoker
## 4 41 40 40 full term 13 30 9 female nonsmoker
## 5 42 37 40 full term NA 10 7.94 male nonsmoker
## 6 37 28 40 full term 12 35 8.25 male smoker
%>% head(10) # same as head(births, 10)
births ## # A tibble: 10 × 9
## fAge mAge weeks premature visits gained weight sexBaby smoke
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 31 30 39 full term 13 1 6.88 male smoker
## 2 34 36 39 full term 5 35 7.69 male nonsmoker
## 3 36 35 40 full term 12 29 8.88 male nonsmoker
## 4 41 40 40 full term 13 30 9 female nonsmoker
## 5 42 37 40 full term NA 10 7.94 male nonsmoker
## 6 37 28 40 full term 12 35 8.25 male smoker
## 7 35 35 28 premie 6 29 1.63 female nonsmoker
## 8 28 21 35 premie 9 15 5.5 female smoker
## 9 22 20 32 premie 5 40 2.69 male smoker
## 10 36 25 40 full term 13 34 8.75 female nonsmoker
Multiple pipes can be chained together, such that x %>% f() %>% g() %>% h()
is equivalent to h(g(f(x)))
.
# equivalent to: head(arrange(births, weight), 10)
%>% arrange(weight) %>% head(10)
births ## # A tibble: 10 × 9
## fAge mAge weeks premature visits gained weight sexBaby smoke
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 35 35 28 premie 6 29 1.63 female nonsmoker
## 2 NA 18 33 premie 7 40 1.69 male smoker
## 3 NA 38 32 premie 10 16 2.19 female smoker
## 4 17 17 29 premie 4 10 2.63 female nonsmoker
## 5 22 20 32 premie 5 40 2.69 male smoker
## 6 38 37 26 premie 5 25 3.63 male nonsmoker
## 7 25 22 34 premie 10 20 3.75 male nonsmoker
## 8 NA 24 38 full term 16 50 3.75 female nonsmoker
## 9 30 25 35 premie 15 40 4.5 male smoker
## 10 19 20 34 premie 13 6 4.5 male nonsmoker
When there are multiple piping operations, I like to arrange the statements vertically to help emphasize the flow of processing and to facilitate debugging and/or modification. I would usually rearrange the above code block as follows:
%>%
births arrange(weight) %>%
head(10)
## # A tibble: 10 × 9
## fAge mAge weeks premature visits gained weight sexBaby smoke
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 35 35 28 premie 6 29 1.63 female nonsmoker
## 2 NA 18 33 premie 7 40 1.69 male smoker
## 3 NA 38 32 premie 10 16 2.19 female smoker
## 4 17 17 29 premie 4 10 2.63 female nonsmoker
## 5 22 20 32 premie 5 40 2.69 male smoker
## 6 38 37 26 premie 5 25 3.63 male nonsmoker
## 7 25 22 34 premie 10 20 3.75 male nonsmoker
## 8 NA 24 38 full term 16 50 3.75 female nonsmoker
## 9 30 25 35 premie 15 40 4.5 male smoker
## 10 19 20 34 premie 13 6 4.5 male nonsmoker
7.6.3 An example without pipes
To illustrate how pipes help us, first let’s look at an example set of analysis steps without using pipes. Let’s say we wanted to explore the relationship between father’s age and baby’s birth weight. We’ll start this process of exploration by generating a bivariate scatter plot. Being good scientists we want to express our data in SI units, so we’ll need to converts pounds to kilograms. You’ll also recall that a number of the cases have missing data on father’s age, so we’ll want to remove those before we plot them. Here’s how we might accomplish these steps:
# add a new column for weight in kg
<- mutate(births, weight.kg = weight / 2.2)
births.kg
# filter out the NA fathers
<- filter(births.kg, !is.na(fAge))
filtered.births
# create our plot
ggplot(filtered.births, aes(x = fAge, y = weight.kg)) +
geom_point() +
labs(x = "Father's Age (years)", y = "Birth Weight (kg)")
Notice that we created two “temporary” data frames along the way – births.kg
and filtered.births
. These probably aren’t of particular interest to us, but we needed to generate them to build the plot we wanted. If you were particularly masochistic you could avoid these temporary data frames by using nested functions call like this:
# You SHOULD NOT write nested code like this.
# Code like this is hard to debug and understand!
ggplot(filter(mutate(births, weight.kg = weight / 2.2), !is.na(fAge)),
aes(x = fAge, y = weight.kg)) +
geom_point() +
labs(x = "Father's Age (years)", y = "Birth Weight (kg)")
7.6.4 The same example using pipes
The pipe operator makes the output of one statement (lhs
) as the first input of a following function (rhs
). This simplifies the above example to:
%>%
births mutate(weight.kg = weight / 2.2) %>%
filter(!is.na(fAge)) %>%
ggplot(aes(x = fAge, y = weight.kg)) +
geom_point() +
labs(x = "Father's Age (years)", y = "Birth Weight (kg)")
In the example above, we feed the data frame into the mutate
function. mutate
expects a data frame as a first argument, and subsequent arguments specify the new variables to be created. births %>% mutate(weight.kg = weight / 2.2)
is thus equivalent to mutate(births, weight.kg = weight / 2.2))
. We then pipe the output to filter
, removing NA fathers, and then pipe that output as the input to ggplot.
As mentioned previously, it’s good coding style to write each discrete step as its own line when using piping. This make it easier to understand what the steps of the analysis are as well as facilitating changes to the code (commenting out lines, adding lines, etc)
7.6.5 Assigning the output of a statement involving pipes to a variable
It’s important to recognize that pipes are simply a convenient way to chain together a series of expression. Just like any other compound expression, the output of a series of pipe statements can be assigned to a variable, like so:
<-
stats.old.moms %>%
births filter(mAge > 35) %>%
summarize(median.gestation = median(weeks),
mean.weight = mean(weight))
stats.old.moms## # A tibble: 1 × 2
## median.gestation mean.weight
## <dbl> <dbl>
## 1 38 6.94
Note that our summary table, stats.old.moms
, is itself a data frame.
7.6.6 Compound assignment pipe operator
A fairly common operation when working interactively in R is to update an existing data frame. magrittr
defines another pipe operator – %<>%
– called the “compound assignment” pipe operator, to facilitate this. The compound assignment pipe operator has the basic usage lhs %<>% rhs
. This operator evaluates the function on the rhs
using the lhs
as the first argument, and then updates the lhs
with the resulting value. This is simply shorthand for writing lhs <- lhs %>% rhs
.
%<>% # note compound pipe operator!
stats.old.moms mutate(mean.weight.kg = mean.weight / 2.2)
7.6.7 The dot operator with pipes
When working with pipes, sometimes you’ll want to use the lhs
in multiple places on the rhs
, or as something other than the first argument to the rhs
. magrittr
provides for this situation by using the dot (.
) operator as a placeholder. Using the dot operator, the expression y %>% f(x, .)
is equivalent to f(x,y)
.
c("dog", "cakes", "sauce", "house") %>% # create a vector
sample(1) %>% # pick a random single element of that vector
str_c("hot", .) # string concatenate the pick with the word "hot"
## [1] "hothouse"
7.6.8 The exposition pipe operator
magrittr
defines another operator called the “exposition pipe operator”, designed %$%
. This operator exposes the names in the lhs
to the expression on the rhs
.
Here is an example of using the exposition pipe operator to simply return the vector of weights:
%>%
births filter(premature == "premie") %$% # note the different pipe operator!
weight## [1] 1.63 5.50 2.69 6.50 7.81 4.75 3.75 2.19 6.81 4.69 6.75 4.50 5.94 4.50 5.06
## [16] 5.69 1.69 6.31 2.63 5.88 3.63
If we wanted to calculate the minimum and maximum weight of premature babies in the data set we could do the following (though I’d usually prefer summarize()
unless I needed the results in the form of a vector):
%>%
births filter(mAge > 35) %$% # note the different pipe operator!
c(min(weight), max(weight))
## [1] 2.19 10.13