For each of the problems below (except in cases where you are asked to discuss your interpretation) write R code blocks that will compute appropriate solutions. A good rule of thumb for judging whether your solution is appropriately “computable” is to ask yourself “If I added or changed observations to this data set, would my code still compute the right solution?”
Lindenmayer and colleagues (Lindenmayer DB et al. 1995, Australian Journal of Zoology 43, 449-458) studied morphological variation among populations of the mountain brushtail possum (Trichosurus caninus) in Australia. The investigators recorded variables about individual possum’s sex, age, where they were collected, and a range of morphological measurements.
This data set is available as a Comma-Separated-Value (CSV) formatted spreadsheet via this link:
You can use the read_csv() function defined in the readr package (part of the tidyverse) to create a data frame from a CSV file .
library(tidyverse)
possums <- read_csv("https://tinyurl.com/lindenmayer-possums")
What are the dimensions of the possums data set?
What are the names of the variables in the possum data set?
Sites where animals were collected were assigned a site number (variable sites). How many unique sites are there in the data set?
Animals were categorized as coming from two different populations (variable Pop). What are the names of the populations?
Possums in the study were assigned an “age category” (variable age). There are several samples with missing age information (‘NA’ values). Read the help on the funtion is.na() and write code to compute the number of samples with missing age data. HINT: the sum() applied to a Boolean vector counts the number of TRUE elements.
Create a histogram depicting the distribution of the tail length variable (taill). Make sure to pick an appropriate number of bins for your visualization.
Create a set of histograms depicting the distribution of tail lengths in the possums data set faceted by population.
Create a figure that uses boxplots to compare taill in the two different populations.
Histograms can be usefully combined with strip/jitter plots as shown in the figure below. Reproduce the figure below using geom_histogram and geom_jitter layers. Hint: To get the jittered points to sit at the base of each histogram, set the y aesthetic to zero.
Using the pipes and dplyr::summarize, calculate the sample size, mean, standard deviation, and standard error (of the mean) of tail length for each population of possums. Your output should be a single table (tibble). The dplyr function dplyr::n() is useful for this problem, as are the base mean() and sd() functions.
A useful statistical rule of thumb is that means between groups are significantly different if the difference in means is more than two standard errors (assuming normality, homogeneity of variances, etc). By this rule of thumb, is possum tail length different in the two populations?
Draw scatter plot showing the relationship between tail length and total body length, using jitter to minimize overplotting and color the points by the population variable.
Recreate the combined scatter plot / 2D density plot of tail length vs total body length, shown below. Note that I’ve added a little bit of jitter and alpha-transparency to the drawn points. Make sure to include titles, subtitles and axis labels.
This set of problems uses a data set that contains information on 150 cases of mothers and their newborns in North Carolina in 2004. This data set is available at the following URL:
This file is formatted as Tab-Separated Values (TSV). The variables in the data set are:
fAge),mAge),weeks)premature)visits)gained)weight)sexBaby)smoke).Include appropriate code to load the NC births data set before answering the following.
Write a code block showing how to use dplyr::filter to get all the cases where the age of both the mother and father was 20 years or younger [1 pt]
Write the equivalent code showing how to get the same cases using standard indexing [1 pt]
Write a code block that shows how to use dplyr::arrange to sort the births data by the babies birth weight [1 pt]
Using the output from the previous problem, in combination with standard indexing, show how to calculate the mean birth weight of the ten lightest babies [1 pt]
Show how to calculate the mean birth weight of the ten heaviest babies [1 pt]
Write a code block that uses dplyr::group_by() and dplyr::count() to get the counts of cases by mother’s smoking status and the baby’s term status (premature or full term). That is, we want the output to tell us four pieces of information [1 pt]
Create a boxplot figure like the one below to illustrate how birth weight varies conditional on term classification (premature or full term) and mother’s smoking status [1 pts]
Distribution Birth Weights as a Function of Term and Mother’s Smoking
Use dplyr::summarize() to calculate the mean weights of babies for all four combinations of term classification and mother’s smoking [1 pt]
Write a code block that uses pipes to count the number of premature births in the data set. [1 pt]
Write a code block that uses pipes to calculate the mean weight, in kilograms, of babies classified as premature. [1 pts]
Write a code block that uses pipes to create a scatter plot depicting birth weight in kilograms (y-axis) versus weeks of gestation (x-axis) for babies born to non-smoking mothers, coloring the points according to whether the baby was premature or full term. Your figure should look similar to the one below [2 pts]
The relationship between birth weight and weeks of gestation for babies born to non-smoking mothers.
Consider the following code block which illustrates two ways to calculate the mean and median gestation time for babies of mothers who smoke:
smokers.1 <-
births %>%
filter(smoke == "smoker") %>%
summarize(mean.gestation = mean(weeks), median.gestation = median(weeks))
smokers.2 <-
births %>%
filter(smoke == "smoker") %$%
c(mean(weeks), median(weeks))
What is the type and class of smokers.1? [0.5 pt]
What is the type and class of smokers.2? [0.5 pt]
Why does smokers.1$mean.gestation work, while smokers.2$mean.gestation raises an error? [1 pt]
Read the documentation on dplyr::filter_all(), dplyr::any_vars(), and dplyr::all_vars() and then show how to use these functions to filter out of the births data set those cases (rows) for which there is any missing data (across all variables) [1 pt]