Chapter 2 Getting Started with R

2.1 What is R?

R is a statistical computing environment and programming language. It is free, open source, and has a large and active community of developers and users. There are many different R packages (libraries) available for conducting out a wide variety of different analyses, for everything from genome sequence data to geospatial information.

2.2 What is RStudio?

RStudio (http://www.rstudio.com/) is an open source integrated development environment (IDE) that provides a nicer graphical interface to R than does the default R GUI.

The figure below illustrates the RStudio interface, in it’s default configuration. For the exercises below you’ll be primarily entering commands in the “console” window. We’ll review key parts of the RStudio interface in greater detail in class.

RStudio window with the panes labeled

Figure 2.1: RStudio window with the panes labeled

2.3 Entering commands in the console

You can type commands directly in the console. When you hit Return (Enter) on your keyboard the text you typed is evaluated by the R interpreter. This means that the R program reads your commands, makes sure there are no syntax errors, and then carries out any commands that were specified.

Try evaluating the following arithmetic commands in the console:

10 + 5
10 - 5
10 / 5
10 * 5

If you type an incomplete command and then hit Return on your keyboard, the console will show a continuation line marked by a + symbol. For example enter the incomplete statement (10 + 5 and then hit Enter. You should see something like this.

> (10 + 5
+

The continuation line tells you that R is waiting for additional input before it evaluates what you typed. Either complete your command (e.g. type the closing parenthesis) and hit Return, or hit the “Esc” key to exit the continuation line without evaluating what you typed.

2.4 Comments

When working in the R console, or writing R code, the pound symbol (#) indicates the start of a comment. Anything after the #, up to the end of the current line, is ignored by the R interpretter.

# This line will be ignored
5 + 4 # the first part of this line, up to the #, will be evaluated

Throughout this course I will often include short explanatory comments in my code examples.

When I want to display the output generated by an R statement typed at the console I will generally use a display convention in which I prepend the results with the symbols ##.

5 + 4  # same as above but with output displayed
## [1] 9

2.5 Using R as a Calculator

The simplest way to use R is as a fancy calculator. Evaluate each of the following statements in the console.

10 + 2 # addition
10 - 2 # subtraction
10 * 2 # multiplication
10 / 2 # division
10 ^ 2 # exponentiation
10 ** 2 # alternate exponentiation
pi * 2.5^2 # R knows about some constants such as Pi
10 %% 3 # modulus operator -- gives remainder after division
10 %/% 3 # integer division

Be aware that certain operators have precedence over others. For example multiplication and division have higher precedence than addition and subtraction. Use parentheses to disambiguate potentially confusing statements.

(10 + 2)/4-5   # was the output what you expected?
(10 + 2)/(4-5) # compare the answer to the above

Division by zero produces an object that represents infinite numbers. Infinite values can be either positive or negative

1/0 
## [1] Inf
-1/0
## [1] -Inf

Invalid calculations produce a objected called NaN which is short for “Not a Number”:

0/0  # invalid calculation
## [1] NaN

2.5.1 Common mathematical functions

Many commonly used mathematical functions are built into R. Here are some examples:

abs(-3)   # absolute value
## [1] 3
cos(pi/3) # cosine
## [1] 0.5
sin(pi/3) # sine
## [1] 0.8660254
log(10)   # natural logarithm
## [1] 2.302585
log10(10) # log base 10
## [1] 1
log2(10) # log base 2
## [1] 3.321928
exp(1)    # exponential function
## [1] 2.718282
sqrt(10)  # square root
## [1] 3.162278
10^0.5  # same as square root
## [1] 3.162278

2.6 Variable assignment

An important programming concept in all programming languages is that of “variable assignment”. Variable assignment is the act of creating labels that point to particular data values in a computers memory, which allows us to apply operations to the labels rather than directly to specific. Variable assignment is an important mechanism of abstracting and generalizing computational operations.

Variable assignment in R is accomplished with the assignment operator, which is designated as <- (left arrow, constructed from a left angular brack and the minus sign). This is illustrated below:

x <- 10  # assign the variable name 'x' the value 10
sin(x)   # apply the sin function to the value x points to
## [1] -0.5440211

x <- pi  # x now points to a different value
sin(x)   # the same function call now produces a different result 
## [1] 1.224647e-16
         # note that sin(pi) == 0, but R returns a floating point value very 
         # very close to but not zero

2.6.1 Valid variable names

As described in the R documentation, “A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ‘.2way’ are not valid, and neither are the reserved words.”

Here are some examples of valid and invalid variable names. Mentally evaluate these based on the definition above, and then evaluate these in the R interpetter to confirm your understanding :

x <- 10
x.prime <- 10
x_prime <- 10
my.long.variable.name <- 10
another_long_variable_name <- 10
_x <- 10
.x <- 10
2.x <- 2 * x

2.7 Data types

The phrase “data types” refers to the representations of information that a programming language provides. In R, there are three core data types representing numbes, logical values, and strings. You can use the function typeof() to get information about an objects type in R.

2.7.1 Numeric data types

There are three standard types of numbers in R.

  1. “double” – this is the default numeric data type, and is used to represent both real numbers and whole numbers (unless you explicitly ask for integers, see below). “double” is short for “double precision floating point value”. All of the previous computations you’ve seen up until this point used data of type double.

    typeof(10.0)  # real number
    ## [1] "double"
    typeof(10)  # whole numbers default to doubles
    ## [1] "double"
  2. “integer” – when your numeric data involves only whole numbers, you can get slighly better performance using the integer data type. You must explicitly ask for numbers to be treated as integers.

    typeof(as.integer(10))  # now treated as an integer
    ## [1] "integer"
  3. “complex” – R has a built-in data type to represent complex numbers – numbers with a “real” and “imaginary” component. We won’t encounter the use of complex numbers in this course, but they do have many important uses in mathematics and engineering and also have some interesting applications in biology.

    typeof(1 + 0i)
    ## [1] "complex"
    sqrt(-1)      # sqrt of -1, using doubles
    ## [1] NaN
    sqrt(-1 + 0i) # sqrt of -1, using complex numbers
    ## [1] 0+1i

2.7.2 Logical values

When we compare values to each other, our calculations no longer return “doubles” but rather TRUE and FALSE values. This is illustrated below:

10 < 9  # is 10 less than 9?
## [1] FALSE
10 > 9  # is 10 greater than 9?
## [1] TRUE
10 <= (5 * 2) # less than or equal to?
## [1] TRUE
10 >= pi # greater than or equal to?
## [1] TRUE
10 == 10 # equals?
## [1] TRUE
10 != 10 # does not equal?
## [1] FALSE

TRUE and FALSE objects are of “logical” data type (known as “Booleans” in many other languages, after the mathematician George Boole).

typeof(TRUE)
typeof(FALSE)
x <- FALSE
typeof(x)  # x points to a logical
x <- 1
typeof(x)  # the variable x no longer points to a logical

When working with numerical data, tests of equality can be tricky. For example, consider the following two comparisons:

10 == (sqrt(10)^2) # Surprised by the result? See below.
4 == (sqrt(4)^2) # Even more confused?

Mathematically we know that both \((\sqrt{10})^2 = 10\) and \((\sqrt{4})^2 = 4\) are true statements. Why does R tell us the first statement is false? What we’re running into here are the limits of computer precision. A computer can’t represent \(\sqrt 10\) exactly, whereas \(\sqrt 4\) can be exactly represented. Precision in numerical computing is a complex subject and a detailed discussion is beyond the scope of this course. However, it’s important to be aware of this limitation (this limitation is true of any programming language, not just R).

To test “near equality” R provides a function called all.equal(). This function takes two inputs – the numerical values to be compared – and returns TRUE if their values are equal up to a certain level of tolerance (defined by the built-in numerical precision of your computer).

all.equal(10, sqrt(10)^2)
## [1] TRUE

Here’s another example where the simple equality operator returns an unexpected result, but all.equal() produces the comparison we’re likely after.

sin(pi) == 0  
## [1] FALSE
all.equal(sin(pi), 0)  
## [1] TRUE

2.7.2.1 Logical operators

Logical values support Boolean operations, like logical negation (“not”), “and”, “or”, “xor”, etc. This is illustrated below:

!TRUE  # logical negation -- reads as "not x"
## [1] FALSE
TRUE & FALSE # AND: are x and y both TRUE?
## [1] FALSE
TRUE | FALSE # OR: are either x or y TRUE?
## [1] TRUE
xor(TRUE,FALSE)  # XOR: is either x or y TRUE, but not both?
## [1] TRUE

The function isTRUE can be useful for evaluating the state of a variable:

x <- sample(1:10, 1) # sample a random number in the range 1 to 10
isTRUE(x > 5)  # was the random number picked greater than 5?
## [1] FALSE

2.7.3 Character strings

Character strings (“character”) represent single textual characters or a longer sequence of characters. They are created by enclosing the characters in text either single our double quotes.

typeof("abc")  # double quotes 
## [1] "character"
typeof('abc')  # single quotes
## [1] "character"

Character strings have a length, which can be found using the nchar function:

first.name <- "jasmine"
nchar(first.name)
## [1] 7

last.name <- 'smith'
nchar(last.name)
## [1] 5

There are a number of built-in functions for manipulating character strings. Here are some of the most common ones.

2.7.3.1 Joining strings

The paste() function joins two characters strings together:

paste(first.name, last.name)  # join two strings
## [1] "jasmine smith"
paste("abc", "def")
## [1] "abc def"

Notice that paste() adds a space between the strings? If we didn’t want the space we can call the paste() function with an optional argument called sep (short for separator) which specifies the character(s) that are inserted between the joined strings.

paste("abc", "def", sep = "")  # join with no space; "" is an empty string
## [1] "abcdef"
paste0("abc", "def") # an equivalent function with no space in newer version of R
## [1] "abcdef"
paste("abc", "def", sep = "|") # join with a vertical bar
## [1] "abc|def"

2.7.3.2 Splitting strings

The strsplit() function allows us to split a character string into substrings according to matches to a specified split string (see ?strsplit for details). For example, we could break a sentence into it’s constituent words as follows:

sentence <- "Call me Ishmael."
words <- strsplit(sentence, " ")  # split on space
words
## [[1]]
## [1] "Call"     "me"       "Ishmael."

Notice that strsplit() is the reverse of paste().

2.7.3.3 Substrings

The substr() function allows us to extract a substring from a character object by specifying the first and last positions (indices) to use in the extraction:

substr("abcdef", 2, 5)  # get substring from characters 2 to 5
## [1] "bcde"
substr(first.name, 1, 3) # get substring from characters 1 to       
## [1] "jas"

2.8 Packages

Packages are libraries of R functions and data that provide additional capabilities and tools beyond the standard library of functions included with R. Hundreds of people around the world have developed packages for R that provide functions and related data structures for conducting many different types of analyses.

Throughout this course you’ll need to install a variety of packages. Here I show the basic procedure for installing new packages from the console as well as from the R Studio interface.

2.8.1 Installing packages from the console

The function install.packages() provides a quick and conveniet way to install packages from the R console.

2.8.2 Install the tidyverse package

To illustrate the use of install.packages(), we’ll install a collection of packages (a “meta-package”) called the tidyverse. Here’s how to install the tidyverse meta-package from the R console:

install.packages("tidyverse", dependencies = TRUE)

The first argument to install.packages gives the names of the package we want to install. The second argument, dependencies = TRUE, tells R to install any additional packages that tidyverse depends on.

2.8.3 Installing packages from the RStudio dialog

You can also install packages using a graphical dialog provided by RStudio. To do so pick the Packages tab in RStudio, and then click the Install button.

The Packages tab in RStudio

Figure 2.2: The Packages tab in RStudio

In the packages entry box you can type the name of the package you wish to install. Let’s install another useful package called “stringr”. Type the package name in the “Packages” field, make sure the “Install dependencies” check box is checked, and then press the “Install” button.

Package Install Dialog

Figure 2.3: Package Install Dialog

2.8.4 Loading packages with the library() function

Once a package is installed on your computer, the package can be loaded into your R session using the library function. To insure our previous install commands worked correctly, let’s load one of the packages we just installed.

library(tidyverse)

Since the tidyverse pacakge is a “meta-package” it provides some additional info about the sub-packages that got loaded.

When you load tidyverse, you will also see a message about “Conflicts” as several of the functions provided in the dplyr package (a sub-package in tidyverse) conflict with names of functions provided by the “stats” package which usually gets automically loaded when you start R. The conflicting funcdtions are filter and lag. The conflicting functions in the stats package are lag and filter which are used in time series analysis. The dplyr functions are more generally useful. Furthermore, if you need these masked functions you can still access them by prefacing the function name with the name of the package (e.g. stats::filter).

We will use the “tidyverse” package for almost every class session and assignment in this class. Get in the habit of including the library(tidyverse) statement in all of your R documents.

2.9 The R Help System

R comes with fairly extensive documentation and a simple help system. You can access HTML versions of the R documentation under the Help tab in Rstudio. The HTML documentation also includes information on any packages you’ve installed. Take a few minutes to browse through the R HTML documentation. In addition to the HTML documentation there is also a search box where you can enter a term to search on (see red arrow in figure below).

The RStudio Help tab

Figure 2.4: The RStudio Help tab

2.9.1 Getting help from the console

In addition to getting help from the RStudio help tab, you can directly search for help from the console. The help system can be invoked using the help function or the ? operator.

help("log")
?log

If you are using RStudio, the help results will appear in the “Help” tab of the Files/Plots/Packages/Help/Viewer (lower right window by default).

What if you don’t know the name of the function you want? You can use the help.search() function.

help.search("log")

In this case help.search("log") returns all the functions with the string log in them. For more on help.search type ?help.search.

Other useful help related functions include apropos() and example(), vignette(). apropos returns a list of all objects (including variable names and function names) in the current session that match the input string.

apropos("log")
##  [1] "as.data.frame.logical" "as.logical"            "as.logical.factor"    
##  [4] "dlogis"                "is.logical"            "log"                  
##  [7] "log10"                 "log1p"                 "log2"                 
## [10] "logb"                  "Logic"                 "logical"              
## [13] "logLik"                "loglin"                "plogis"               
## [16] "qlogis"                "rlogis"                "SSlogis"

example() provides examples of how a function is used.

example(log)
## 
## log> log(exp(3))
## [1] 3
## 
## log> log10(1e7) # = 7
## [1] 7
## 
## log> x <- 10^-(1+2*1:9)
## 
## log> cbind(x, log(1+x), log1p(x), exp(x)-1, expm1(x))
##           x                                                    
##  [1,] 1e-03 9.995003e-04 9.995003e-04 1.000500e-03 1.000500e-03
##  [2,] 1e-05 9.999950e-06 9.999950e-06 1.000005e-05 1.000005e-05
##  [3,] 1e-07 1.000000e-07 1.000000e-07 1.000000e-07 1.000000e-07
##  [4,] 1e-09 1.000000e-09 1.000000e-09 1.000000e-09 1.000000e-09
##  [5,] 1e-11 1.000000e-11 1.000000e-11 1.000000e-11 1.000000e-11
##  [6,] 1e-13 9.992007e-14 1.000000e-13 9.992007e-14 1.000000e-13
##  [7,] 1e-15 1.110223e-15 1.000000e-15 1.110223e-15 1.000000e-15
##  [8,] 1e-17 0.000000e+00 1.000000e-17 0.000000e+00 1.000000e-17
##  [9,] 1e-19 0.000000e+00 1.000000e-19 0.000000e+00 1.000000e-19

The vignette() function gives longer, more detailed documentation about libraries. Not all libraries include vignettes, but for those that do it’s usually a good place to get started. For example, the stringr package (which we installed above) includes a vignette. To read it’s vignette, type the following at the console

vignette("stringr")