Chapter 4 Data structures

In computer science, the term “data structure” refers to the ways that data are stored, retrieved, and organized in a computer’s memory. Common examples include lists, hash tables (also called dictionaries), sets, queues, and trees. Different types of data structures are used to support different types of operations on data.

In R, the three basic data structures are vectors, lists, and data frames.

4.1 Vectors

Vectors are the core data structure in R. Vectors store an ordered lists of items, all of the same type (i.e. the data in a vector are “homogenous” with respect to their type).

The simplest way to create a vector at the interactive prompt is to use the c() function, which is short hand for “combine” or “concatenate”.

x <- c(2,4,6,8)  # create a vector, assignn it the variable name `x`
x
## [1] 2 4 6 8

Vectors in R always have a type (accessed with the typeof() function) and a length (accessed with the length() function).

length(x)
## [1] 4
typeof(x)
## [1] "double"

Vectors don’t have to be numerical; logical and character vectors work just as well.

y <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)
y
## [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE
typeof(y)
## [1] "logical"
length(y)
## [1] 6

z <- c("How", "now", "brown", "cow")
z
## [1] "How"   "now"   "brown" "cow"
typeof(z)
## [1] "character"
length(z)
## [1] 4

You can also use c() to concatenate two or more vectors together.

x <- c(2, 4, 6, 8)
y <- c(1, 3, 5, 7, 9)  # create another vector, labeled y
xy <- c(x,y)  # combine two vectors
xy
## [1] 2 4 6 8 1 3 5 7 9

z <- c(pi/4, pi/2, pi, 2*pi)
xyz <- c(x, y, z)  # combine three vectors
xyz
##  [1] 2.0000000 4.0000000 6.0000000 8.0000000 1.0000000 3.0000000 5.0000000
##  [8] 7.0000000 9.0000000 0.7853982 1.5707963 3.1415927 6.2831853

4.1.1 Vector Arithmetic

The basic R arithmetic operations work on numeric vectors as well as on single numbers (in fact, behind the scenes in R single numbers are vectors!).

x <- c(2, 4, 6, 8, 10)
x * 2  # multiply each element of x by 2
## [1]  4  8 12 16 20
x - pi # subtract pi from each element of x
## [1] -1.1415927  0.8584073  2.8584073  4.8584073  6.8584073

y <- c(0, 1, 3, 5, 9)
x + y  # add together each matching element of x and y
## [1]  2  5  9 13 19
x * y # multiply each matching element of x and y
## [1]  0  4 18 40 90
x/y # divide each matching element of x and y
## [1]      Inf 4.000000 2.000000 1.600000 1.111111

Basic numerical functions operate element-wise on numerical vectors:

sin(x)
## [1]  0.9092974 -0.7568025 -0.2794155  0.9893582 -0.5440211
cos(x * pi)
## [1] 1 1 1 1 1
log(x)
## [1] 0.6931472 1.3862944 1.7917595 2.0794415 2.3025851

4.1.2 Vector recycling

When vectors are not of the same length R “recycles” the elements of the shorter vector to make the lengths conform.

x <- c(2, 4, 6, 8, 10)
length(x)
## [1] 5
z <- c(1, 4, 7, 11)
length(z)
## [1] 4
x + z
## [1]  3  8 13 19 11

In the example above z was treated as if it was the vector (1, 4, 7, 11, 1).

Recycling can be useful but it can also be a subtle source of errors. Notice that R provides warning messages when recycling is being applied. Make sure to pay attention to such messages when debugging your code.

4.1.3 Simple statistical functions for numeric vectors

Now that we’ve introduced vectors as the simplest data structure for holding collections of numerical values, we can introduce a few of the most common statistical functions that operate on such vectors.

First let’s create a vector to hold our sample data of interest. Here I’ve taken a random sample of the lengths of the last names of students enrolled in Bio 723 during Spring 2018.

len.name <- c(7, 7, 6, 2, 9, 9, 7, 4, 10, 5)

Some common statistics of interest include minimum, maximum, mean, median, variance, and standard deviation:

sum(len.name)
## [1] 66
min(len.name)
## [1] 2
max(len.name)
## [1] 10
mean(len.name)
## [1] 6.6
median(len.name)
## [1] 7
var(len.name)  # variance
## [1] 6.044444
sd(len.name)   # standard deviation
## [1] 2.458545

The summary() function applied to a vector of doubles produce a useful table of some of these key statistics:

summary(len.name)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    5.25    7.00    6.60    8.50   10.00

4.1.4 Indexing Vectors

Accessing the element of a vector is called “indexing”. Indexing is the process of specifying the numerical positions (indices) that you want to take access from the vector.

For a vector of length \(n\), we can access the elements by the indices \(1 \ldots n\). We say that R vectors (and other data structures like lists) are “one-indexed”. Many other programming languages, such as Python, C, and Java, use zero-indexing where the elements of a data structure are accessed by the indices \(0 \ldots n-1\). Indexing errors are a common source of bugs.

Indexing a vector is done by specifying the index in square brackets as shown below:

x <- c(2, 4, 6, 8, 10)
length(x)
## [1] 5

x[1]  # return the 1st element of x
## [1] 2

x[4]  # return the 4th element of x
## [1] 8

Negative indices are used to exclude particular elements. x[-1] returns all elements of x except the first.

x[-1]
## [1]  4  6  8 10

You can get multiple elements of a vector by indexing by another vector. In the example below, x[c(3,5)] returns the third and fifth element of x`.

x[c(3,5)]
## [1]  6 10

Besides numerical indexing, R allows logical indexing which takes a vector of Booleans and returns the positions with TRUE values.

x[c(TRUE, FALSE, TRUE, FALSE, FALSE)] #return 1st and 3rd elements but ignore 2nd, 4th and 5th
## [1] 2 6

4.1.5 Comparison operators applied to vectors

When the comparison operators, such as “greater than” (>), “less than or equal to” (<=), equality (==), etc, are applied to numeric vectors, they return logical vectors:

x <- c(2, 4, 6, 8, 10, 12)
x < 8  # returns TRUE for all elements lass than 8
## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

Here’s a fancier example:

x > 4 & x < 10  # greater than 4 AND less than 10
## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE

4.1.6 Combining Indexing and Comparison of Vectors

A very powerful feature of R is the ability to combine the comparison operators (which return TRUE or FALSE values) with indexing. This facilitates data filtering and subsetting.

Here’s an example:

x <- c(2, 4, 6, 8, 10) 
x[x > 5]
## [1]  6  8 10

In the first example we retrieved all the elements of x that are larger than 5 (read as “x where x is greater than 5”). Notice how we got back all the elements where the statement in the brackets was TRUE.

You can string together comparisons for more complex filtering.

x[x < 4 | x > 8]  # less than four OR greater than 8
## [1]  2 10

In the second example we retrieved those elements of x that were smaller than four or greater than six. Combining indexing and comparison is a concept which we’ll use repeatedly in this course.

4.1.7 Vector manipulation

You can combine indexing with assignment to change the elements of a vectors:

x <- c(2, 4, 6, 8, 10)
x[2] <- -4 
x
## [1]  2 -4  6  8 10

You can also use indexing vectors to change multiple values at once:

x <- c(2, 4, 6, 8, 10)
x[c(1, 3, 5)]  <- 6
x
## [1] 6 4 6 8 6

Using logical vectors to manipulate the elements of a vector also works:

x <- c(2, 4, 6, 8, 10)
x[x > 5] = 5    # truncate all values to have max value 5
x
## [1] 2 4 5 5 5

4.1.8 Vectors from regular sequences

There are a variety of functions for creating regular sequences in the form of vectors.

1:10  # create a vector with the integer values from 1 to 10
##  [1]  1  2  3  4  5  6  7  8  9 10
20:11  # a vector with the integer values from 20 to 11
##  [1] 20 19 18 17 16 15 14 13 12 11

seq(1, 10)  # like 1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1, 10, by = 2) # 1:10, in steps of 2
## [1] 1 3 5 7 9
seq(2, 4, by = 0.25) # 2 to 4, in steps of 0.25
## [1] 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00

4.1.9 Additional functions for working with vectors

The function unique() returns the unique items in a vector:

x <- c(5, 2, 1, 4, 6, 9, 8, 5, 7, 9)
unique(x)
## [1] 5 2 1 4 6 9 8 7

rev() returns the items in reverse order (without changing the input vector):

y <- rev(x)
y
##  [1] 9 7 5 8 9 6 4 1 2 5
x  # x is still in original order
##  [1] 5 2 1 4 6 9 8 5 7 9

There are a number of useful functions related to sorting. Plain sort() returns a new vector with the items in sorted order:

sorted.x <- sort(x)  # returns items of x sorted
sorted.x
##  [1] 1 2 4 5 5 6 7 8 9 9

x        # but x remains in its unsorted state
##  [1] 5 2 1 4 6 9 8 5 7 9

The related function order() gives the indices which would rearrange the items into sorted order:

order(x)
##  [1]  3  2  4  1  8  5  9  7  6 10

order() can be useful when you want to sort one list by the values of another:

students <- c("fred", "tabitha", "beatriz", "jose")
class.ranking <- c(4, 2, 1, 3)

students[order(class.ranking)]  # get the students sorted by their class.ranking
## [1] "beatriz" "tabitha" "jose"    "fred"

any() and all(), return single boolean values based on a specified comparison provided as an argument:

y <- c(2, 4, 5, 6, 8)

any(y > 5) # returns TRUE if any of the elements are TRUE
## [1] TRUE

all(y > 5) # returns TRUE if all of the elements are TRUE
## [1] FALSE

which() returns the indices of the vector for which the input is true:

which(y > 5)
## [1] 4 5

4.2 Lists

R lists are like vectors, but unlike a vector where all the elements are of the same type, the elements of a list can have arbitrary types (even other lists). Lists are a powerful data structure for organizing information, because there are few constraints on the shape or types of the data included in a list.

Lists are easy to create:

l <- list('Bob', pi, 10)

Note that lists can contain arbitrary data. Lists can even contain other lists:

l <- list('Bob', pi, 10, list("foo", "bar", "baz", "qux"))

Lists are displayed with a particular format, distinct from vectors:

l
## [[1]]
## [1] "Bob"
## 
## [[2]]
## [1] 3.141593
## 
## [[3]]
## [1] 10
## 
## [[4]]
## [[4]][[1]]
## [1] "foo"
## 
## [[4]][[2]]
## [1] "bar"
## 
## [[4]][[3]]
## [1] "baz"
## 
## [[4]][[4]]
## [1] "qux"

In the example above, the correspondence between the list and its display is obvious for the first three items. The fourth element may be a little confusing at first. Remember that the fourth item of l was another list. So what’s being shown in the output for the fourth item is the nested list.

An alternative way to display a list is using the str() function (short for “structure”). str() provides a more compact representation that also tells us what type of data each element is:

str(l)
## List of 4
##  $ : chr "Bob"
##  $ : num 3.14
##  $ : num 10
##  $ :List of 4
##   ..$ : chr "foo"
##   ..$ : chr "bar"
##   ..$ : chr "baz"
##   ..$ : chr "qux"

4.2.1 Length and type of lists

Like vectors, lists have length:

length(l)
## [1] 4

But the type of a list is simply “list”, not the type of the items within the list. This makes sense because lists are allowed to be heterogeneous (i.e. hold data of different types).

typeof(l)
## [1] "list"

4.2.2 Indexing lists

Lists have two indexing operators. Indexing a list with single brackets, like we did with vectors, returns a new list containing the element at index \(i\). Lists also support double bracket indexing (x[[i]]) which returns the bare element at index \(i\) (i.e. the element without the enclosing list). This is a subtle but important point so make sure you understand the difference between these two forms of indexing.

4.2.2.1 Single bracket list indexing

First, let’s demonstrate single bracket indexing of the lists l we created above.

l[1]           # single brackets, returns list('Bob') 
## [[1]]
## [1] "Bob"
typeof(l[1])   # notice the list type
## [1] "list"

When using single brackets, lists support indexing with ranges and numeric vectors:

l[3:4]
## [[1]]
## [1] 10
## 
## [[2]]
## [[2]][[1]]
## [1] "foo"
## 
## [[2]][[2]]
## [1] "bar"
## 
## [[2]][[3]]
## [1] "baz"
## 
## [[2]][[4]]
## [1] "qux"
l[c(1, 3, 5)]
## [[1]]
## [1] "Bob"
## 
## [[2]]
## [1] 10
## 
## [[3]]
## NULL

4.2.2.2 Double bracket list indexing

If double bracket indexing is used, the object at the given index in a list is returned:

l[[1]]         # double brackets, return plain 'Bob'
## [1] "Bob"
typeof(l[[1]]) # notice the 'character' type
## [1] "character"

Double bracket indexing does not support multiple indices, but you can chain together double bracket operators to pull out the items of sublists. For example:

# second item of the fourth item of the list
l[[4]][[2]]  
## [1] "bar"

4.2.3 Naming list elements

The elements of a list can be given names when the list is created:

p <- list(first.name='Alice', last.name="Qux", age=27, years.in.school=10)

You can retrieve the names associated with a list using the names function:

names(p)
## [1] "first.name"      "last.name"       "age"             "years.in.school"

If a list has named elements, you can retrieve the corresponding elements by indexing with the quoted name in either single or double brackets. Consistent with previous usage, single brackets return a list with the corresponding named element, whereas double brackets return the bare element.

For example, make sure you understand the difference in the output generated by these two indexing calls:

p["first.name"]
## $first.name
## [1] "Alice"

p[["first.name"]]
## [1] "Alice"

4.2.4 The $ operator

Retrieving named elements of lists (and data frames as we’ll see), turns out to be a pretty common task (especially when doing interactive data analysis) so R has a special operator to make this more convenient. This is the $ operator, which is used as illustrated below:

p$first.name  # equivalent to p[["first.name"]]
## [1] "Alice"
p$age         # equivalent to p[["age"]]
## [1] 27

4.2.5 Changing and adding lists items

Combining indexing and assignment allows you to change items in a list:

suspect <- list(first.name = "unknown", 
                last.name = "unknown", 
                aka = "little")

suspect$first.name <- "Bo"
suspect$last.name <- "Peep"
suspect[[3]] <- "LITTLE"

str(suspect)
## List of 3
##  $ first.name: chr "Bo"
##  $ last.name : chr "Peep"
##  $ aka       : chr "LITTLE"

By combining assignment with a new name or an index past the end of the list you can add items to a list:

suspect$age <- 17  # add a new item named age
suspect[[5]] <- "shepardess"   # create an unnamed item at position 5

Be careful when adding an item using indexing, because if you skip an index an intervening NULL value is created:

# there are only five items in the list, what happens if we
# add a new item at position seven?
suspect[[7]] <- "wanted for sheep stealing"

str(suspect)
## List of 7
##  $ first.name: chr "Bo"
##  $ last.name : chr "Peep"
##  $ aka       : chr "LITTLE"
##  $ age       : num 17
##  $           : chr "shepardess"
##  $           : NULL
##  $           : chr "wanted for sheep stealing"

4.2.6 Combining lists

The c (combine) function we introduced to create vectors can also be used to combine lists:

list.a <- list("little", "bo", "peep")
list.b <- list("has lost", "her", "sheep")
list.c <- c(list.a, list.b)
list.c
## [[1]]
## [1] "little"
## 
## [[2]]
## [1] "bo"
## 
## [[3]]
## [1] "peep"
## 
## [[4]]
## [1] "has lost"
## 
## [[5]]
## [1] "her"
## 
## [[6]]
## [1] "sheep"

4.2.7 Converting lists to vectors

Sometimes it’s useful to convert a list to a vector. The unlist() function takes care of this for us.

# a homogeneous list
ex1 <- list(2, 4, 6, 8)
unlist(ex1)
## [1] 2 4 6 8

When you convert a list to a vector make sure you remember that vectors are homogeneous, so items within the new vector will be “coerced” to have the same type.

# a heterogeneous list
ex2 <- list(2, 4, 6, c("bob", "fred"), list(1 + 0i, 'foo'))
unlist(ex2)
## [1] "2"    "4"    "6"    "bob"  "fred" "1+0i" "foo"

Note that unlist() also unpacks nested vectors and lists as shown in the second example above.

4.3 Data frames

Along with vectors and lists, data frames are one of the core data structures when working in R. A data frame is essentially a list which represents a data table, where each column in the table has the same number of rows and every item in the a column has to be of the same type. Unlike standard lists, the objects (columns) in a data frame must have names. We’ve seen data frames previously, for example when we loaded data sets using the read_csv function.

4.3.1 Creating a data frame

While data frames will often be created by reading in a data set from a file, they can also be created directly in the console as illustrated below:

age <- c(30, 26, 21, 29, 25, 22, 28, 24, 23, 20)
sex <- rep(c("M","F"), 5)
wt.in.kg <- c(88, 76, 67, 66, 56, 74, 71, 60, 52, 72)

df <- data.frame(age = age, sex = sex, wt = wt.in.kg)

Here we created a data frame with three columns, each of length 10.

4.3.2 Type and class for data frames

Data frames can be thought of as specialized lists, and in fact the type of a data frame is “list” as illustrated below:

typeof(df)
## [1] "list"

To distinguish a data frame from a generic list, we have to ask about it’s “class”.

class(df) # the class of our data frame
## [1] "data.frame"
class(l)  # compare to the class of our generic list
## [1] "list"

The term “class” comes from a style/approach to programming called “object oriented programming”. We won’t go into explicit detail about how object oriented programming works in this class, though we will exploit many of the features of objects that have a particular class.

4.3.3 Length and dimension for data frames

Applying the length() function to a data frame returns the number of columns. This is consistent with the fact that data frames are specialized lists:

length(df)
## [1] 3

To get the dimensions (number of rows and columns) of a data frame, we use the dim() function. dim() returns a vector, whose first value is the number of rows and whose second value is the number of columns:

dim(df)
## [1] 10  3

We can get the number of rows and columns individually using the nrow() and ncol() functions:

nrow(df)  # number of rows
## [1] 10
ncol(df)  # number of columsn
## [1] 3

4.3.4 Indexing and accessing data frames

Data frames can be indexed by either column index, column name, row number, or a combination of row and column numbers.

4.3.4.1 Single bracket indexing of the columns of a data frame

The single bracket operator with a single numeric index returns a data frame with the corresponding column.

df[1]  # get the first column (=age) of the data frame
## # A tibble: 10 × 1
##      age
##    <dbl>
##  1    30
##  2    26
##  3    21
##  4    29
##  5    25
##  6    22
##  7    28
##  8    24
##  9    23
## 10    20

The single bracket operator with multiple numeric indices returns a data frame with the corresponding columns.

df[1:2]  # first two columns
## # A tibble: 10 × 2
##      age sex  
##    <dbl> <chr>
##  1    30 M    
##  2    26 F    
##  3    21 M    
##  4    29 F    
##  5    25 M    
##  6    22 F    
##  7    28 M    
##  8    24 F    
##  9    23 M    
## 10    20 F
df[c(1, 3)]  # columns 1 (=age) and 3 (=wt)
## # A tibble: 10 × 2
##      age    wt
##    <dbl> <dbl>
##  1    30    88
##  2    26    76
##  3    21    67
##  4    29    66
##  5    25    56
##  6    22    74
##  7    28    71
##  8    24    60
##  9    23    52
## 10    20    72

Column names can be substituted for indices when using the single bracket operator:

df["age"]  
## # A tibble: 10 × 1
##      age
##    <dbl>
##  1    30
##  2    26
##  3    21
##  4    29
##  5    25
##  6    22
##  7    28
##  8    24
##  9    23
## 10    20

df[c("age", "wt")]
## # A tibble: 10 × 2
##      age    wt
##    <dbl> <dbl>
##  1    30    88
##  2    26    76
##  3    21    67
##  4    29    66
##  5    25    56
##  6    22    74
##  7    28    71
##  8    24    60
##  9    23    52
## 10    20    72

4.3.4.2 Single bracket indexing of the rows of a data frame

To get specific rows of a data frame, we use single bracket indexing with an additional comma following the index. For example to get the first row a data frame we would do:

df[1,]    # first row
## # A tibble: 1 × 3
##     age sex      wt
##   <dbl> <chr> <dbl>
## 1    30 M        88

This syntax extends to multiple rows:

df[1:2,]  # first two rows
## # A tibble: 2 × 3
##     age sex      wt
##   <dbl> <chr> <dbl>
## 1    30 M        88
## 2    26 F        76

df[c(1, 3, 5),]  # rows 1, 3 and 5
## # A tibble: 3 × 3
##     age sex      wt
##   <dbl> <chr> <dbl>
## 1    30 M        88
## 2    21 M        67
## 3    25 M        56

4.3.4.3 Single bracket indexing of both the rows and columns of a data frame

Single bracket indexing of data frames extends naturally to retrieve both rows and columns simultaneously:

df[1, 2]  # first row, second column
## [1] "M"
df[1:3, 2:3] # first three rows, columns 2 and 3
## # A tibble: 3 × 2
##   sex      wt
##   <chr> <dbl>
## 1 M        88
## 2 F        76
## 3 M        67

# you can even mix numerical indexing (rows) with named indexing of columns
df[5:10, c("age", "wt")]  
## # A tibble: 6 × 2
##     age    wt
##   <dbl> <dbl>
## 1    25    56
## 2    22    74
## 3    28    71
## 4    24    60
## 5    23    52
## 6    20    72

4.3.4.4 Double bracket and $ indexing of data frames

Whereas single bracket indexing of a data frame always returns a new data frame, double bracket indexing and indexing using the $ operator, returns vectors.

df[["age"]]
##  [1] 30 26 21 29 25 22 28 24 23 20
typeof(df[["age"]])
## [1] "double"

df$wt
##  [1] 88 76 67 66 56 74 71 60 52 72
typeof(df$wt)
## [1] "double"

4.3.5 Logical indexing of data frames

Logical indexing using boolean values works on data frames in much the same way it works on vectors. Typically, logical indexing of a data frame is used to filter the rows of a data frame.

For example, to get all the subject in our example data frame who are older than 25 we could do:

 # NOTE: the comma after 25 is important to insure we're indexing rows!
df[df$age > 25, ] 
## # A tibble: 4 × 3
##     age sex      wt
##   <dbl> <chr> <dbl>
## 1    30 M        88
## 2    26 F        76
## 3    29 F        66
## 4    28 M        71

Similarly, to get all the individuals whose weight is between 60 and 70 kgs we could do:

df[(df$wt >= 60 & df$wt <= 70),]
## # A tibble: 3 × 3
##     age sex      wt
##   <dbl> <chr> <dbl>
## 1    21 M        67
## 2    29 F        66
## 3    24 F        60

4.3.6 Adding columns to a data frame

Adding columns to a data frame is similar to adding items to a list. The easiest way to do so is using named indexing. For example, to add a new column to our data frame that gives the individuals ages in number of days, we could do:

df[["age.in.days"]] <- df$age * 365
dim(df)
## [1] 10  4