For each of the problems below (except in cases where you are asked to discuss your interpretation) write R code blocks that will compute appropriate solutions. A good rule of thumb for judging whether your solution is appropriately “computable” is to ask yourself “If I added or changed observations to this data set, would my code still compute the right solution?”
When completed, submit your R Markdown document (the file with the extension .Rmd) via Sakai.
For each of the problems below (except in cases where you are asked to discuss your interpretaion) write R code blocks that will compute appropriate solutions. A good rule of thumb for judging whether your solution is appropriately “computable” is to ask yourself “If I added or changed observations to this data set, would my code still compute the right solution?”
When completed, submit your R Markdown document (the file with the extension .Rmd) and the knit HTML via Sakai.
Using the built-in iris data set, create a distance matrix (using dist()) representing the Euclidean distance between each of the specimens (rows in the data frame). Make sure your distance matrix keeps the appropriate Species labels [2 pts]
Carry out hierarchical clustering of the iris specimens using UPGMA aglommeration based on the distance matrix from Problem 1. Draw a dendrogram representing this clustering using the dendextend package, coloring the labels (see dendextend::labels_colors()) according to Species assignment. That is, all the setosa specimens should have labels with the same color, all the viriginica specimens should have labels in a different color, etc. [6 pts]
Cut your UPGMA dendrogram to yield 3 clusters, and plot the dendrogram color the branches according to cluster assignment. [2 pts]
Carry out k-medoids clustering of the iris specimen using the distance matrix from Problem 1, specifying three clusters [2 pts]
Generate a plot of the iris specimens in the space of the first two principal components (based on the covariance matrix). Color the specimens by their cluster membership according to the k-medoids clustering you carried out in the previous problem, and specify their shape by the Species label. [4 pts]
Since we know the true Species groupings, it’s straightforward to visually inspect the figures you made above to identify mis-clustered samples. By visual inspection, how many of the iris specimens are mis-clustered under the UPGMA clustering? By visual inspection, how many of the specimens are mis-clustered under the k-medoids clustering? [2 pts]
EXTRA CREDIT: If you know the true groupings, and can visualize the clustering and its implied groupings, it’s relatively straightforward to identify mis-clustered samples. It is a more challenging task to compute this. Show how to compute the mis-clustered samples for the UPGMA and k-medoids clusterings you carried out above. Your solution should be robust to re-orderings of the original data and should not rely on visual inspection [5 pts]