Chapter 2 Getting Started with R
2.1 What is R?
R is a statistical computing environment and programming language. It is free, open source, and has a large and active community of developers and users. There are many different R packages (libraries) available for conducting out a wide variety of different analyses, for everything from genome sequence data to geospatial information.
2.2 What is RStudio?
RStudio (http://www.rstudio.com/) is an open source integrated development environment (IDE) that provides a nicer graphical interface to R than does the default R GUI.
The figure below illustrates the RStudio interface, in it’s default configuration. For the exercises below you’ll be primarily entering commands in the “console” window. We’ll review key parts of the RStudio interface in greater detail in class.
2.3 Entering commands in the console
You can type commands directly in the console. When you hit Return (Enter) on your keyboard the text you typed is evaluated by the R interpreter. This means that the R program reads your commands, makes sure there are no syntax errors, and then carries out any commands that were specified.
Try evaluating the following arithmetic commands in the console:
10 + 5
10 - 5
10 / 5
10 * 5
If you type an incomplete command and then hit Return on your keyboard, the console will show a continuation line marked by a +
symbol. For example enter the incomplete statement (10 + 5
and then hit Enter. You should see something like this.
> (10 + 5
+
The continuation line tells you that R is waiting for additional input before it evaluates what you typed. Either complete your command (e.g. type the closing parenthesis) and hit Return, or hit the “Esc” key to exit the continuation line without evaluating what you typed.
2.5 Using R as a Calculator
The simplest way to use R is as a fancy calculator. Evaluate each of the following statements in the console.
10 + 2 # addition
10 - 2 # subtraction
10 * 2 # multiplication
10 / 2 # division
10 ^ 2 # exponentiation
10 ** 2 # alternate exponentiation
* 2.5^2 # R knows about some constants such as Pi
pi 10 %% 3 # modulus operator -- gives remainder after division
10 %/% 3 # integer division
Be aware that certain operators have precedence over others. For example multiplication and division have higher precedence than addition and subtraction. Use parentheses to disambiguate potentially confusing statements.
10 + 2)/4-5 # was the output what you expected?
(10 + 2)/(4-5) # compare the answer to the above (
Division by zero produces an object that represents infinite numbers. Infinite values can be either positive or negative
1/0
## [1] Inf
-1/0
## [1] -Inf
Invalid calculations produce a objected called NaN
which is short for “Not a Number”:
0/0 # invalid calculation
## [1] NaN
2.5.1 Common mathematical functions
Many commonly used mathematical functions are built into R. Here are some examples:
abs(-3) # absolute value
## [1] 3
cos(pi/3) # cosine
## [1] 0.5
sin(pi/3) # sine
## [1] 0.8660254
log(10) # natural logarithm
## [1] 2.302585
log10(10) # log base 10
## [1] 1
log2(10) # log base 2
## [1] 3.321928
exp(1) # exponential function
## [1] 2.718282
sqrt(10) # square root
## [1] 3.162278
10^0.5 # same as square root
## [1] 3.162278
2.6 Variable assignment
An important programming concept in all programming languages is that of “variable assignment”. Variable assignment is the act of creating labels that point to particular data values in a computers memory, which allows us to apply operations to the labels rather than directly to specific. Variable assignment is an important mechanism of abstracting and generalizing computational operations.
Variable assignment in R is accomplished with the assignment operator, which is designated as <-
(left arrow, constructed from a left angular brack and the minus sign). This is illustrated below:
<- 10 # assign the variable name 'x' the value 10
x sin(x) # apply the sin function to the value x points to
## [1] -0.5440211
<- pi # x now points to a different value
x sin(x) # the same function call now produces a different result
## [1] 1.224647e-16
# note that sin(pi) == 0, but R returns a floating point value very
# very close to but not zero
2.6.1 Valid variable names
As described in the R documentation, “A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ‘.2way’ are not valid, and neither are the reserved words.”
Here are some examples of valid and invalid variable names. Mentally evaluate these based on the definition above, and then evaluate these in the R interpetter to confirm your understanding :
x <- 10
x.prime <- 10
x_prime <- 10
my.long.variable.name <- 10
another_long_variable_name <- 10
_x <- 10
.x <- 10
2.x <- 2 * x
2.7 Data types
The phrase “data types” refers to the representations of information that a programming language provides. In R, there are three core data types representing numbes, logical values, and strings. You can use the function typeof()
to get information about an objects type in R.
2.7.1 Numeric data types
There are three standard types of numbers in R.
“double” – this is the default numeric data type, and is used to represent both real numbers and whole numbers (unless you explicitly ask for integers, see below). “double” is short for “double precision floating point value”. All of the previous computations you’ve seen up until this point used data of type double.
typeof(10.0) # real number ## [1] "double" typeof(10) # whole numbers default to doubles ## [1] "double"
“integer” – when your numeric data involves only whole numbers, you can get slighly better performance using the integer data type. You must explicitly ask for numbers to be treated as integers.
typeof(as.integer(10)) # now treated as an integer ## [1] "integer"
“complex” – R has a built-in data type to represent complex numbers – numbers with a “real” and “imaginary” component. We won’t encounter the use of complex numbers in this course, but they do have many important uses in mathematics and engineering and also have some interesting applications in biology.
typeof(1 + 0i) ## [1] "complex" sqrt(-1) # sqrt of -1, using doubles ## [1] NaN sqrt(-1 + 0i) # sqrt of -1, using complex numbers ## [1] 0+1i
2.7.2 Logical values
When we compare values to each other, our calculations no longer return “doubles” but rather TRUE
and FALSE
values. This is illustrated below:
10 < 9 # is 10 less than 9?
## [1] FALSE
10 > 9 # is 10 greater than 9?
## [1] TRUE
10 <= (5 * 2) # less than or equal to?
## [1] TRUE
10 >= pi # greater than or equal to?
## [1] TRUE
10 == 10 # equals?
## [1] TRUE
10 != 10 # does not equal?
## [1] FALSE
TRUE
and FALSE
objects are of “logical” data type (known as “Booleans” in many other languages, after the mathematician George Boole).
typeof(TRUE)
typeof(FALSE)
<- FALSE
x typeof(x) # x points to a logical
<- 1
x typeof(x) # the variable x no longer points to a logical
When working with numerical data, tests of equality can be tricky. For example, consider the following two comparisons:
10 == (sqrt(10)^2) # Surprised by the result? See below.
4 == (sqrt(4)^2) # Even more confused?
Mathematically we know that both \((\sqrt{10})^2 = 10\) and \((\sqrt{4})^2 = 4\) are true statements. Why does R tell us the first statement is false? What we’re running into here are the limits of computer precision. A computer can’t represent \(\sqrt 10\) exactly, whereas \(\sqrt 4\) can be exactly represented. Precision in numerical computing is a complex subject and a detailed discussion is beyond the scope of this course. However, it’s important to be aware of this limitation (this limitation is true of any programming language, not just R).
To test “near equality” R provides a function called all.equal()
. This function takes two inputs – the numerical values to be compared – and returns TRUE
if their values are equal up to a certain level of tolerance (defined by the built-in numerical precision of your computer).
all.equal(10, sqrt(10)^2)
## [1] TRUE
Here’s another example where the simple equality operator returns an unexpected result, but all.equal()
produces the comparison we’re likely after.
sin(pi) == 0
## [1] FALSE
all.equal(sin(pi), 0)
## [1] TRUE
2.7.2.1 Logical operators
Logical values support Boolean operations, like logical negation (“not”), “and”, “or”, “xor”, etc. This is illustrated below:
!TRUE # logical negation -- reads as "not x"
## [1] FALSE
TRUE & FALSE # AND: are x and y both TRUE?
## [1] FALSE
TRUE | FALSE # OR: are either x or y TRUE?
## [1] TRUE
xor(TRUE,FALSE) # XOR: is either x or y TRUE, but not both?
## [1] TRUE
The function isTRUE
can be useful for evaluating the state of a variable:
<- sample(1:10, 1) # sample a random number in the range 1 to 10
x isTRUE(x > 5) # was the random number picked greater than 5?
## [1] FALSE
2.7.3 Character strings
Character strings (“character”) represent single textual characters or a longer sequence of characters. They are created by enclosing the characters in text either single our double quotes.
typeof("abc") # double quotes
## [1] "character"
typeof('abc') # single quotes
## [1] "character"
Character strings have a length, which can be found using the nchar
function:
<- "jasmine"
first.name nchar(first.name)
## [1] 7
<- 'smith'
last.name nchar(last.name)
## [1] 5
There are a number of built-in functions for manipulating character strings. Here are some of the most common ones.
2.7.3.1 Joining strings
The paste()
function joins two characters strings together:
paste(first.name, last.name) # join two strings
## [1] "jasmine smith"
paste("abc", "def")
## [1] "abc def"
Notice that paste()
adds a space between the strings? If we didn’t want the space we can call the paste()
function with an optional argument called sep
(short for separator) which specifies the character(s) that are inserted between the joined strings.
paste("abc", "def", sep = "") # join with no space; "" is an empty string
## [1] "abcdef"
paste0("abc", "def") # an equivalent function with no space in newer version of R
## [1] "abcdef"
paste("abc", "def", sep = "|") # join with a vertical bar
## [1] "abc|def"
2.7.3.2 Splitting strings
The strsplit()
function allows us to split a character string into substrings according to matches to a specified split string (see ?strsplit
for details). For example, we could break a sentence into it’s constituent words as follows:
<- "Call me Ishmael."
sentence <- strsplit(sentence, " ") # split on space
words
words## [[1]]
## [1] "Call" "me" "Ishmael."
Notice that strsplit()
is the reverse of paste()
.
2.7.3.3 Substrings
The substr()
function allows us to extract a substring from a character object by specifying the first and last positions (indices) to use in the extraction:
substr("abcdef", 2, 5) # get substring from characters 2 to 5
## [1] "bcde"
substr(first.name, 1, 3) # get substring from characters 1 to
## [1] "jas"
2.8 Packages
Packages are libraries of R functions and data that provide additional capabilities and tools beyond the standard library of functions included with R. Hundreds of people around the world have developed packages for R that provide functions and related data structures for conducting many different types of analyses.
Throughout this course you’ll need to install a variety of packages. Here I show the basic procedure for installing new packages from the console as well as from the R Studio interface.
2.8.1 Installing packages from the console
The function install.packages()
provides a quick and conveniet way to install packages from the R console.
2.8.2 Install the tidyverse package
To illustrate the use of install.packages()
, we’ll install a collection of packages (a “meta-package”) called the tidyverse. Here’s how to install the tidyverse meta-package from the R console:
install.packages("tidyverse", dependencies = TRUE)
The first argument to install.packages
gives the names of the package we want to install. The second argument, dependencies = TRUE
, tells R to install any additional packages that tidyverse depends on.
2.8.3 Installing packages from the RStudio dialog
You can also install packages using a graphical dialog provided by RStudio. To do so pick the Packages
tab in RStudio, and then click the Install
button.
In the packages entry box you can type the name of the package you wish to install. Let’s install another useful package called “stringr”. Type the package name in the “Packages” field, make sure the “Install dependencies” check box is checked, and then press the “Install” button.
2.8.4 Loading packages with the library()
function
Once a package is installed on your computer, the package can be loaded into your R session using the library
function. To insure our previous install commands worked correctly, let’s load one of the packages we just installed.
library(tidyverse)
Since the tidyverse pacakge is a “meta-package” it provides some additional info about the sub-packages that got loaded.
When you load tidyverse, you will also see a message about “Conflicts” as several of the functions provided in the dplyr package (a sub-package in tidyverse) conflict with names of functions provided by the “stats” package which usually gets automically loaded when you start R. The conflicting funcdtions are filter
and lag
. The conflicting functions in the stats package are lag
and filter
which are used in time series analysis. The dplyr
functions are more generally useful. Furthermore, if you need these masked functions you can still access them by prefacing the function name with the name of the package (e.g. stats::filter
).
We will use the “tidyverse” package for almost every class session and assignment in this class. Get in the habit of including the library(tidyverse)
statement in all of your R documents.
2.9 The R Help System
R comes with fairly extensive documentation and a simple help system. You can access HTML versions of the R documentation under the Help tab in Rstudio. The HTML documentation also includes information on any packages you’ve installed. Take a few minutes to browse through the R HTML documentation. In addition to the HTML documentation there is also a search box where you can enter a term to search on (see red arrow in figure below).
2.9.1 Getting help from the console
In addition to getting help from the RStudio help tab, you can directly search for help from the console. The help system can be invoked using the help
function or the ?
operator.
help("log")
?log
If you are using RStudio, the help results will appear in the “Help” tab of the Files/Plots/Packages/Help/Viewer (lower right window by default).
What if you don’t know the name of the function you want? You can use the help.search()
function.
help.search("log")
In this case help.search("log")
returns all the functions with
the string log
in them. For more on help.search
type
?help.search
.
Other useful help related functions include apropos()
and example()
, vignette()
. apropos
returns a list of all objects (including variable names and function names) in the current session that match the input string.
apropos("log")
## [1] "as.data.frame.logical" "as.logical" "as.logical.factor"
## [4] "dlogis" "is.logical" "log"
## [7] "log10" "log1p" "log2"
## [10] "logb" "Logic" "logical"
## [13] "logLik" "loglin" "plogis"
## [16] "qlogis" "rlogis" "SSlogis"
example()
provides examples of how a function is used.
example(log)
##
## log> log(exp(3))
## [1] 3
##
## log> log10(1e7) # = 7
## [1] 7
##
## log> x <- 10^-(1+2*1:9)
##
## log> cbind(x, log(1+x), log1p(x), exp(x)-1, expm1(x))
## x
## [1,] 1e-03 9.995003e-04 9.995003e-04 1.000500e-03 1.000500e-03
## [2,] 1e-05 9.999950e-06 9.999950e-06 1.000005e-05 1.000005e-05
## [3,] 1e-07 1.000000e-07 1.000000e-07 1.000000e-07 1.000000e-07
## [4,] 1e-09 1.000000e-09 1.000000e-09 1.000000e-09 1.000000e-09
## [5,] 1e-11 1.000000e-11 1.000000e-11 1.000000e-11 1.000000e-11
## [6,] 1e-13 9.992007e-14 1.000000e-13 9.992007e-14 1.000000e-13
## [7,] 1e-15 1.110223e-15 1.000000e-15 1.110223e-15 1.000000e-15
## [8,] 1e-17 0.000000e+00 1.000000e-17 0.000000e+00 1.000000e-17
## [9,] 1e-19 0.000000e+00 1.000000e-19 0.000000e+00 1.000000e-19
The vignette()
function gives longer, more detailed documentation about libraries. Not all libraries include vignettes, but for those that do it’s usually a good place to get started. For example, the stringr package (which we installed above) includes a vignette. To read it’s vignette, type the following at the console
vignette("stringr")
2.4 Comments
When working in the R console, or writing R code, the pound symbol (
#
) indicates the start of a comment. Anything after the#
, up to the end of the current line, is ignored by the R interpretter.Throughout this course I will often include short explanatory comments in my code examples.
When I want to display the output generated by an R statement typed at the console I will generally use a display convention in which I prepend the results with the symbols
##
.