Use the homework template linked to on the course website.
For each of the problems below (except in cases where you are asked to discuss your interpretaion) write R code blocks that will compute appropriate solutions. A good rule of thumb for judging whether your solution is appropriately “computable” is to ask yourself “If I added or changed observations to this data set, would my code still compute the right solution?”
When completed, submit your R Markdown document (the file with the extension .Rmd) via Sakai.
The tidied “long” version of the Spellman gene expression data set is available as as CSV file at the following link:
Load the tidy, long version of the Spellman data set.
Show how to create a derived data frame that only includes the observations from the alpha factor (alpha) experiment, and genes that have no more than two missing (NA) observations. What are the dimensions of this data frame? [3 pts]
Using the data frame from the previous question, compute:
the gene names for the 500 most variable genes in the alpha factor experimental conditions [1 pt]
a corresponding data frame containly only the most variable genes. [1 pt]
Of the 500 genes under consideration, which shows the greatest variability? [0.5 pt]
Which gene shows the least variability? [0.5 pt]
Show how to find the index of the time point of minimal expression for each of the 500 most variable genes [2 pts].
Create a corresponding heat map showing the 500 genes sorted by the time point of minimal expression [3 pts]
Create a “wide” version of your 500 gene data frame, with genes as variables [2 pts]
Calculate a correlation matrix giving all pairwise correlations between the 500 genes [1 pts]
How many genes have correlations greater than 0.6 with the most variable gene in the data set (see earlier question)? [1 pt]
Create a heat map and corresponding line plots for the genes with correlations greater than 0.6 with the most variable gene in the data set. Both heat map and line plots should show expression of the correlated genes over time. In the line plot, highlight the most variable gene. Combine these two figures as (A) and (B) subfigures using cowplot. [5 pts]
NOTE: to get this figure to show up nicely in your knitted HTML you can specify the relative width and height of the generated figure in the code block header. For example: {r, fig.width=16, fig.height=10}
Identify a “real-world” (i.e. published and available) data set from your scientific sub-field. Provide a reference to the paper this data set is associated with, a link to the data set, and describe (in broad terms) the structure of this data. How is the data organized in it’s current form? Does the data have missing values? If you wanted to analyze this data yourself, what sort of wrangling would you anticipate having to do with this data? [4 pts]