Instructions

For each of the problems below (except in cases where you are asked to discuss your interpretation) write R code blocks that will compute appropriate solutions. A good rule of thumb for judging whether your solution is appropriately “computable” is to ask yourself “If I added or changed observations to this data set, would my code still compute the right solution?”

When completed, submit your R Markdown document (the file with the extension .Rmd) via Sakai.

Make your solutions computable

For each of the problems below (except in cases where you are asked to discuss your interpretaion) write R code blocks that will compute appropriate solutions. A good rule of thumb for judging whether your solution is appropriately “computable” is to ask yourself “If I added or changed observations to this data set, would my code still compute the right solution?”

When completed, submit your R Markdown document (the file with the extension .Rmd) and the knit HTML via Sakai.

Problems

Using the built-in iris data set, create a distance matrix (using dist()) representing the Euclidean distance between each of the specimens (rows in the data frame). Make sure your distance matrix keeps the appropriate Species labels [2 pts]
Carry out hierarchical clustering of the iris specimens using UPGMA aglommeration based on the distance matrix from Problem 1. Draw a dendrogram representing this clustering using the dendextend package, coloring the labels (see dendextend::labels_colors()) according to Species assignment. That is, all the setosa specimens should have labels with the same color, all the viriginica specimens should have labels in a different color, etc. [6 pts]
Cut your UPGMA dendrogram to yield 3 clusters, and plot the dendrogram color the branches according to cluster assignment. [2 pts]
Carry out k-medoids clustering of the iris specimen using the distance matrix from Problem 1, specifying three clusters [2 pts]
Generate a plot of the iris specimens in the space of the first two principal components (based on the covariance matrix). Color the specimens by their cluster membership according to the k-medoids clustering you carried out in the previous problem, and specify their shape by the Species label. [4 pts]
Since we know the true Species groupings, it’s straightforward to visually inspect the figures you made above to identify mis-clustered samples. By visual inspection, how many of the iris specimens are mis-clustered under the UPGMA clustering? By visual inspection, how many of the specimens are mis-clustered under the k-medoids clustering? [2 pts]
EXTRA CREDIT: If you know the true groupings, and can visualize the clustering and its implied groupings, it’s relatively straightforward to identify mis-clustered samples. It is a more challenging task to compute this. Show how to compute the mis-clustered samples for the UPGMA and k-medoids clusterings you carried out above. Your solution should be robust to re-orderings of the original data and should not rely on visual inspection [5 pts]

Bio 723: Clustering assignment

Instructions

Make your solutions computable

Problems