Chapter 5 Functions and control flow statements

5.1 Writing your own functions

So far we’ve been using a variety of built in functions in R. However the real power of a programming language is the ability to write your own functions. Functions are a mechanism for organizing and abstracting a set of related computations. We usually write functions to represent sets of computations that we apply frequently, or to represent some conceptually coherent set of manipulations to data.

function() {expr} defines a simplest form of R functions. function() packs the following expressions in {} as a function so they can be assigned to and run by a name. For example,

#assgin a function called "PrintSignature"
PrintSignature <- function() {
    paste("Dr. Paul Magwene at Duke University", date())
    } 
PrintSignature() #run the function by its name
## [1] "Dr. Paul Magwene at Duke University Fri Feb  3 16:56:47 2023"

The more general form of an R function is as follows:

funcname <- function(arg1, arg2) {
 # one or more expressions that operate on the fxn arguments
 # last expression is the object returned
 # or you can explicitly return an object
}

Arg1 and arg2 are function arguments that allow you to pass different values to the expressions every time you run the function. Function arguments are given in the parentheses after function and seperated by ,. You can add as many arguments as you want. To make this concrete, here’s an example where we define a function to calculate the area of a circle:

area.of.circle <- function(r){
  return(pi * r^2)
}

Since R returns the value of the last expression in the function, the return call is optional and we could have simply written:

area.of.circle <- function(r){
  pi * r^2
}

Very short and concise functions are often written as a single line. In practice I’d probably write the above function as:

area.of.circle <- function(r) {pi * r^2}

The area.of.circle function takes one argument, r, and calculates the area of a circle with radius r. Having defined the function we can immediately put it to use:

area.of.circle(3)
## [1] 28.27433

radius <- 4
area.of.circle(radius)
## [1] 50.26548

If you type a function name without parentheses R shows you the function’s definition. This works for built-in functions as well (thought sometimes these functions are defined in C code in which case R will tell you that the function is a .Primitive).

5.1.1 Function arguments

Function arguments can specify the data that a function operates on or parameters that the function use, i.e., the “input” or “independent varialbe” of the function. Any R objects can be taken as function arguments, including bare numbers, vectors, lists, data.frames, names of assigned variables, and even other functions. Each argument only takes a single R object, so if you have complicated input or uncertain length of input, it’s better to design some arguments that take vectors or lists. Function arguments can be either required or optional. In the case of optional arguments, a default value is assigned if the argument is not given.

Take for example the log function. If you examine the help file for the log function (type ?log now) you’ll see that it takes two arguments, refered to as x and base. The argument x represents the numeric vector you pass to the function and is a required argument (see what happens when you type log() without giving an argument). The argument base is optional. By default the value of base is \(e = 2.71828\ldots\). Therefore by default the log function returns natural logarithms. If you want logarithms to a different base you can change the base argument as in the following examples:

log(2) # log of 2, base e
## [1] 0.6931472
log(2,2) # log of 2, base 2
## [1] 1
log(2, 4) # log of 2, base 4
## [1] 0.5

Because base 2 and base 10 logarithms are fairly commonly used, there are convenient aliases for calling log with these bases.

log2(8)
## [1] 3
log10(100)
## [1] 2

5.1.2 Writing functions with optional arguments

To write a function that has an optional argument, you can simply specify the optional argument and its default value in the function definition as so:

# a function to substitute missing values in a vector
sub.missing <- function(x, sub.value = -99){
  x[is.na(x)] <- sub.value
  return(x)
}

You can then use this function as so:

m <- c(1, 2, NA, 4)
sub.missing(m, -999)  # explicitly define sub.value
## [1]    1    2 -999    4
sub.missing(m, sub.value = -333) # more explicit syntax
## [1]    1    2 -333    4
sub.missing(m)   # use default sub.value
## [1]   1   2 -99   4
m  # notice that m wasn't modified within the function
## [1]  1  2 NA  4

Notice that when we called sub.missing with our vector m, the vector did not get modified in the function body. Rather a new vector, x was created within the function and returned. However, if you did the missing value subsitute outside of a function call, then the vector would be modified:

n <- c(1, 2, NA, 4)
n[is.na(n)] <- -99
n
## [1]   1   2 -99   4

5.1.3 Putting R functions in Scripts

When you define a function at the interactive prompt and then close the interpreter your function definition will be lost. The simple way around this is to define your R functions in a script that you can than access at any time.

In RStudio choose File > New File > R Script. This will bring up a blank editor window. Type your function(s) into the editor. Everything in this file will be interpretted as R code, so you should not use the code block notation that is used in Markdown notebooks. Save the source file in your R working directory with a name like myfxns.R.

# functions defined in myfxns.R

area.of.circle <- function(r) {pi * r^2}

area.of.rectangle <- function(l, w) {l * w}

area.of.triangle <- function(b, h) {0.5 * b * h }

Once your functions are in a script file you can make them accesible by using the source function, which reads the named file as input and evaluates any definitions or statements in the input file (See also the Source button in the R Studio GUI):

source("myfxns.R")

Having sourced the file you can now use your functions like so:

radius <- 3
len <- 4
width <- 5
base <- 6
height <- 7

area.of.circle(radius)
## [1] 28.27433
area.of.rectangle(len, width)
## [1] 20
area.of.triangle(base, height)
## [1] 21

Note that if you change the source file, such as correcting a mistake or adding a new function, you need to call the source function again to make those changes available.

5.2 Control flow statements

Control flow statements control the order of execution of different pieces of code. They can be used to do things like make sure code is only run when certain conditions are met, to iterate through data structures, to repeat something until a specified event happens, etc. Control flow statements are frequently used when writing functions or carrying out complex data transformation.

5.2.1 `if` and `if-else` statements

if and if-else blocks allow you to structure the flow of execution so that certain expressions are executed only if particular conditions are met.

The general form of an if expression is:

if (Boolean expression) {
  Code to execute if 
  Boolean expression is true
}

Here’s a simple if expression in which we check whether a number is less than 0.5, and if so assign a values to a variable.

x <- runif(1)  # runif generates a random number between 0 and 1
face <- NULL  # set face to a NULL value

if (x < 0.5) {
  face <- "heads"
}
face
## NULL

The else clause specifies what to do in the event that the if statement is not true. The combined general for of an if-else expression is:

if (Boolean expression) {
  Code to execute if 
  Boolean expression is true
} else {
  Code to execute if 
  Boolean expression is false
}

Our previous example makes more sense if we include an else clause.

x <- runif(1)

if (x < 0.5) {
  face <- "heads"
} else {
  face <- "tails"
}

face
## [1] "heads"

With the addition of the else statement, this simple code block can be thought of as simulating the toss of a coin.

5.2.1.1 `if-else` in a function

Let’s take our “if-else” example above and turn it into a function we’ll call coin.flip. A literal re-interpretation of our previous code in the context of a function is something like this:

# coin.flip.literal takes no arguments
coin.flip.literal <- function() {
  x <- runif(1)
  if (x < 0.5) {
    face <- "heads"
  } else {
    face <- "tails"
  }
  face
}

coin.flip.literal is pretty long for what it does — we created a temporary variable x that is only used once, and we created the variable face to hold the results of our if-else statement, but then immediately returned the result. This is inefficient and decreases readability of our function. A much more compact implementation of this function is as follows:

coin.flip <- function() {
  if (runif(1) < 0.5) {
    return("heads")
  } else {
    return("tails")
  }
}

Note that in our new version of coin.flip we don’t bother to create temporary the variables x and face and we immediately return the results within the if-else statement.

5.2.1.2 Multiple `if-else` statements

When there are more than two possible outcomes of interest, multiple if-else statements can be chained together. Here is an example with three outcomes:

x <- sample(-5:5, 1)  # sample a random integer between -5 and 5

if (x < 0) {
  sign.x <- "Negative"
} else if (x > 0) {
  sign.x <- "Positive"
} else {
  sign.x <- "Zero"
}

sign.x
## [1] "Positive"

5.2.2 for loops

A for statement iterates over the elements of a sequence (such as vectors or lists). A common use of for statements is to carry out a calculation on each element of a sequence (but see the discussion of map below) or to make a calculation that involves all the elements of a sequence.

The general form of a for loop is:

for (elem in sequence) {
  Do some calculations or
  Evaluate one or more expressions
}

As an example, say we wanted to call our coin.flip function multiple times. We could use a for loop to do so as follows:

flips <- c() # empty vector to hold outcomes of coin flips
for (i in 1:20) {
  flips <- c(flips, coin.flip())  # flip coin and add to our vector
}
flips
##  [1] "tails" "heads" "tails" "heads" "tails" "tails" "heads" "tails" "heads"
## [10] "tails" "tails" "heads" "heads" "tails" "tails" "tails" "heads" "tails"
## [19] "heads" "heads"

Let’s use a for loop to create a multi.coin.flip function thats accepts an optional argument n that specifies the number of coin flips to carry out:

multi.coin.flip <- function(n = 1) {
  # create an empty character vector of length n
  # it's more efficient to create an empty vector of the right
  # length than to "grow" a vector with each iteration
  flips <- vector(mode="character", length=n)  
  for (i in 1:n) {
    flips[i] <- coin.flip()
  }
  flips
}

With this new definition, a single call of coin.flip returns a single outcome:

multi.coin.flip()
## [1] "tails"

And calling multi.coin.flip with a numeric argument returns multiple coin flips:

multi.coin.flip(n=10)
##  [1] "tails" "heads" "tails" "heads" "heads" "heads" "heads" "heads" "tails"
## [10] "tails"

5.2.3 `break` statement

A break statement allows you to exit a loop even if it hasn’t completed. This is useful for ending a control statement when some criteria has been satisfied. break statements are usually nested in if statements.

In the following example we use a break statement inside a for loop. In this example, we pick random real numbers between 0 and 1, accumulating them in a vector (random.numbers). The for loop insures that we never pick more than 20 random numbers before the loop ends. However, the break statement allows the loop to end prematurely if the number picked is greater than 0.95.

random.numbers <- c()

for (i in 1:20) {
  x <- runif(1)
  random.numbers <- c(random.numbers, x)
  if (x > 0.95) {
    break
  }
}

random.numbers
##  [1] 0.33824003 0.27150893 0.80147861 0.82639203 0.72485470 0.45984222
##  [7] 0.33742675 0.13972448 0.70618396 0.30628448 0.83748920 0.38625625
## [13] 0.72688588 0.83052833 0.26359268 0.81387537 0.65306014 0.84585978
## [19] 0.01309625 0.49304270

5.2.4 `repeat` loops

A repeat loop will loop indefinitely until we explicitly break out of the loop with a break statement. For example, here’s an example of how we can use repeat and break to simulate flipping coins until we get a head:

ct <- 0
repeat {
  flip <- coin.flip()
  ct <- ct + 1
  if (flip == "heads"){
    break
  }
}

ct
## [1] 1

5.2.5 `next` statement

A next satement allows you to halt the processing of the current iteration of a loop and immediately move to the next item of the loop. This is useful when you want to skip calculations for certain elements of a sequence:

sum.not.div3 <- 0

for (i in 1:20) {
  if (i %% 3 == 0) { # skip summing values that are evenly divisible by three
    next
  }
  sum.not.div3 <- sum.not.div3 + i
}
sum.not.div3
## [1] 147

5.2.6 while statements

A while statement iterates as long as the condition statement it contains is true. In the following example, the while loop calls coin.flip until “heads” is the result, and keeps track of the number of flips. Note that this represents the same logic as the repeat-break example we saw earlier, but in a a more compact form.

first.head <- 1

while(coin.flip() == "tails"){
  first.head <- first.head + 1
}

first.head
## [1] 1

5.2.7 `ifelse`

The ifelse function is equivalent to a for-loop with a nested if-else statement. ifelse applies the specified test to each element of a vector, and returns different values depending on if the test is true or false.

Here’s an example of using ifelse to replace NA elements in a vector with zeros.

x <- c(3, 1, 4, 5, 9, NA, 2, 6, 5, 4)
newx <- ifelse(is.na(x), 0, x)
newx
##  [1] 3 1 4 5 9 0 2 6 5 4

The equivalent for-loop could be written as:

x <- c(3, 1, 4, 5, 9, NA, 2, 6, 5, 4)
newx <- c()  # create an empty vector
for (elem in x) {
  if (is.na(elem)) {
    newx <- c(newx, 0)  # append zero to newx
  } else {
    newx <- c(newx, elem)  # append elem to newx
  }
}
newx
##  [1] 3 1 4 5 9 0 2 6 5 4

The ifelse function is clearly a more compact and readable way to accomplish this.

5.3 `map` and related tools

Another common situation is applying a function to every element of a list or vector. Again, we could use a for loop, but the map functions often are better alternatives.

NOTE: map is a relative newcomer to R and must be loaded with the purrr package (purrr is loaded when we load tidyverse). Although base R has a complicated series of “apply” functions (apply, lapply, sapply, vapply, mapply), map provides similar functionality with a more consistent interface. We won’t use the apply functions in this class, but you may see them in older code.

library(tidyverse)
## ── Attaching packages ────────────────────
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ──── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

5.3.1 basic `map`

Typically, map takes two arguments – a sequence (a vector, list, or data frame) and a function. It then applies the function to each element of the sequence, returning the results as a list.

To illustrate map, let’s consider an example with a list of 2-vectors, where each vector gives the min and max values of some variable of interest for individuals in a sample (e.g. resting heart rate and maximum heart rate during exercise). We can use the map function to quickly generate the difference between the resting and maximum heart rates:

heart.rates <- list(bob = c(60, 120), fred = c(79, 150), jim = c(66, 110))
diff.fxn <- function(x) {x[2] - x[1]}

map(heart.rates, diff.fxn)
## $bob
## [1] 60
## 
## $fred
## [1] 71
## 
## $jim
## [1] 44

As a second example, here’s how we could use map to get the class of each object in a list:

x <- list(c(1,2,3), "a", "b", list(lead = "Michael", keyboard = "Jermaine"))
map(x, class)
## [[1]]
## [1] "numeric"
## 
## [[2]]
## [1] "character"
## 
## [[3]]
## [1] "character"
## 
## [[4]]
## [1] "list"

5.3.2 `map_if` and `map_at`

map_if is a variant of map that takes a predicate function (a function that evaluates to TRUE or FALSE) to determine which elements of the input sequence are transformed by the map function. All elements of the sequence that do not meet the predicate are left un-transformed. Like map, map_if always returns a list.

Here’s an example where we use map_if to apply the stringr::str_to_upper function to those columns of a data frame that are character vectors, and apply abs to obtain the absolute value of a numeric column:

a <- rnorm(6)
b <- c("a", "b", "c", "d", "e", "f")
c <- c("u", "v", "w", "x", "y", "z")
df <- data_frame(a, b, c)
head(df)
## # A tibble: 6 × 3
##         a b     c    
##     <dbl> <chr> <chr>
## 1 -0.498  a     u    
## 2  0.428  b     v    
## 3 -0.371  c     w    
## 4  0.0106 d     x    
## 5  0.591  e     y    
## 6  0.807  f     z

df2 <- map_if(df, is.character, str_to_upper)
df2 <- map_if(df2, is.numeric, abs)
head(df2)
## $a
## [1] 0.49798655 0.42817641 0.37114033 0.01056529 0.59110949 0.80657685
## 
## $b
## [1] "A" "B" "C" "D" "E" "F"
## 
## $c
## [1] "U" "V" "W" "X" "Y" "Z"

Note that df2 is a list, not a data frame. We can convert df2 to a data frame df3, using the as_data_frame() function:

# Next, create data frame df3 
df3 <- as_data_frame(df2)
head(df3)
## # A tibble: 6 × 3
##        a b     c    
##    <dbl> <chr> <chr>
## 1 0.498  A     U    
## 2 0.428  B     V    
## 3 0.371  C     W    
## 4 0.0106 D     X    
## 5 0.591  E     Y    
## 6 0.807  F     Z

Note that if our goal is to apply functions to the columns of a data frame, it may be easier with dplyr::mutate():


df4 <- df %>% as_tibble() %>%
  mutate(a = abs(a),
         b = str_to_upper(b),
         c = str_to_upper(c))

head(df4)
## # A tibble: 6 × 3
##        a b     c    
##    <dbl> <chr> <chr>
## 1 0.498  A     U    
## 2 0.428  B     V    
## 3 0.371  C     W    
## 4 0.0106 D     X    
## 5 0.591  E     Y    
## 6 0.807  F     Z

5.3.3 mapping in parallel using `map2`

The map2 function applies a transformation function to two sequences in parallel. The following example illustrates this:

first.names <- c("John", "Mary", "Fred")
last.names <- c("Smith", "Hernandez", "Kidogo")

map2(first.names, last.names, str_c, sep=" ")
## [[1]]
## [1] "John Smith"
## 
## [[2]]
## [1] "Mary Hernandez"
## 
## [[3]]
## [1] "Fred Kidogo"

Note how we can specify arguments to the transformation function as additional arguments to map2 (i.e., the sep argument gets passed to str_c)

5.3.4 `map` variants that return vectors

map, map_if, and map_at always return lists. The purrr library also has a series of map variants that return vectors:

map_lgl (for logical vectors)
map_chr (for character vectors)
map_int (integer vectors)
map_dbl (double vectors)

# compare the outputs of map and map_chr
a <- map(letters[1:6], str_to_upper)
str(a)
## List of 6
##  $ : chr "A"
##  $ : chr "B"
##  $ : chr "C"
##  $ : chr "D"
##  $ : chr "E"
##  $ : chr "F"

b <- map_chr(letters[1:6], str_to_upper)
str(b) # a vector
##  chr [1:6] "A" "B" "C" "D" "E" "F"

Here’s an example using map_dbl, where we create a data frame with three columns, and compute the median of each column:

# Make data frame for analysis
df <- tibble(a = rnorm(100), b = rnorm(100),c = rnorm(100))

map_dbl(df, median) # median of each column of df
##           a           b           c 
##  0.14931120  0.13070397 -0.09626582

Chapter 5 Functions and control flow statements

5.1 Writing your own functions

5.1.1 Function arguments

5.1.2 Writing functions with optional arguments

5.1.3 Putting R functions in Scripts

5.2 Control flow statements

5.2.1 if and if-else statements

5.2.1.1 if-else in a function

5.2.1.2 Multiple if-else statements

5.2.2 for loops

5.2.3 break statement

5.2.4 repeat loops

5.2.5 next statement

5.2.6 while statements

5.2.7 ifelse

5.3 map and related tools

5.3.1 basic map

5.3.2 map_if and map_at

5.3.3 mapping in parallel using map2

5.3.4 map variants that return vectors