Instructions

Make your solutions computable

For each of the problems below (except in cases where you are asked to discuss your interpretation) write R code blocks that will compute appropriate solutions. A good rule of thumb for judging whether your solution is appropriately “computable” is to ask yourself “If I added or changed observations to this data set, would my code still compute the right solution?”

Part 1

Data set: Brushtail possums

Lindenmayer and colleagues (Lindenmayer DB et al. 1995, Australian Journal of Zoology 43, 449-458) studied morphological variation among populations of the mountain brushtail possum (Trichosurus caninus) in Australia. The investigators recorded variables about individual possum’s sex, age, where they were collected, and a range of morphological measurements.

This data set is available as a Comma-Separated-Value (CSV) formatted spreadsheet via this link:

lindemayer-possums.csv

You can use the read_csv() function defined in the readr package (part of the tidyverse) to create a data frame from a CSV file .

library(tidyverse)
possums <- read_csv("https://tinyurl.com/lindenmayer-possums")

What are the dimensions of the possums data set?
What are the names of the variables in the possum data set?
Sites where animals were collected were assigned a site number (variable sites). How many unique sites are there in the data set?
Animals were categorized as coming from two different populations (variable Pop). What are the names of the populations?
Possums in the study were assigned an “age category” (variable age). There are several samples with missing age information (‘NA’ values). Read the help on the funtion is.na() and write code to compute the number of samples with missing age data. HINT: the sum() applied to a Boolean vector counts the number of TRUE elements.
Create a histogram depicting the distribution of the tail length variable (taill). Make sure to pick an appropriate number of bins for your visualization.
Create a set of histograms depicting the distribution of tail lengths in the possums data set faceted by population.
Create a figure that uses boxplots to compare taill in the two different populations.
Histograms can be usefully combined with strip/jitter plots as shown in the figure below. Reproduce the figure below using geom_histogram and geom_jitter layers. Hint: To get the jittered points to sit at the base of each histogram, set the y aesthetic to zero.
Using the pipes and dplyr::summarize, calculate the sample size, mean, standard deviation, and standard error (of the mean) of tail length for each population of possums. Your output should be a single table (tibble). The dplyr function dplyr::n() is useful for this problem, as are the base mean() and sd() functions.
A useful statistical rule of thumb is that means between groups are significantly different if the difference in means is more than two standard errors (assuming normality, homogeneity of variances, etc). By this rule of thumb, is possum tail length different in the two populations?
Draw scatter plot showing the relationship between tail length and total body length, using jitter to minimize overplotting and color the points by the population variable.
Recreate the combined scatter plot / 2D density plot of tail length vs total body length, shown below. Note that I’ve added a little bit of jitter and alpha-transparency to the drawn points. Make sure to include titles, subtitles and axis labels.

Part 2

Data set: NC Births

This set of problems uses a data set that contains information on 150 cases of mothers and their newborns in North Carolina in 2004. This data set is available at the following URL:

https://github.com/Bio723-class/example-datasets/raw/master/nc-births.txt

This file is formatted as Tab-Separated Values (TSV). The variables in the data set are:

father’s age (fAge),
mother’s age (mAge),
weeks of gestation (weeks)
whether the birth was premature or full term (premature)
number of OB/GYN visits (visits)
mother’s weight gained in pounds (gained)
babies birth weight (weight)
sex of the baby (sexBaby)
whether the mother was a smoker (smoke).

Include appropriate code to load the NC births data set before answering the following.