Hierarchical Clustering & Principal Component Analysis Using R Programming

realcode4you
Jan 13, 2022
6 min read

Hierarchical Clustering:

The dataset on American College and University Rankings contains information on 1302 American colleges and universities offering an undergraduate program. For each university, there are 17 continuous measurements (such as tuition and graduation rate) and 2 categorical measurements (such as location by state and whether it is a private or public school). Note that many records are missing some measurements. Our first goal is to estimate these missing values from “similar” records. This will be done by clustering the complete records and then finding the closest cluster for each of the partial records. The missing values will be imputed from the information in that cluster.

Remove all records with missing measurements from the dataset.
For all the continuous measurements, run hierarchical clustering using complete linkage and Euclidean distance. Make sure to normalize the measurements. From the dendrogram, how many clusters seem reasonable for describing these data?
Compare the summary statistics for each cluster and describe each cluster in this context (e.g., “Universities with high tuition, low acceptance rate…”).
Use the categorical measurements that were not used in the analysis (State and Private/Public) to characterize the different clusters. Is there any relationship between the clusters and the categorical information?
What other external information can explain the contents of some or all of these clusters?
Consider Tufts University, which is missing some information. Compute the Euclidean distance of this record from each of the clusters that you found above (using only the measurements that you have). Which cluster is it closest to? Impute the missing values for Tufts by taking the average of the cluster on those measurements.

Principal Component Analysis:

The file ToyotaCorolla.csv contains data on used cars (Toyota Corollas) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal will be to predict the price of a used Toyota Corolla based on its specifications.

Identify the categorical variables.
Explain the relationship between a categorical variable and the series of binary dummy variables derived from it.
How many dummy binary variables are required to capture the information in a categorical variable with N categories?
Use R to convert the categorical variables in this dataset into dummy variables, and explain in words, for one record, the values in the derived binary dummies.
Use R to produce a correlation matrix and matrix plot. Comment on the relationships among variables.

Test your concepts:

Consider the 3-means algorithm on a set S consisting of the following 6 points in the plane: a= (0,0), b = (8,0), c=(16,0), d=(0,6), e=(8,6), f=(16,6). The algorithm uses the Euclidean distance metric to assign each point to its nearest centroid; ties are broken in favor of the centroid to the left/down. A starting configuration is a subset of 3 starting points from S that form the initial centroids. A 3- partition is a partition of S into 3 subsets; thus {a,b,e}, {c,d}, {f} is a 3-partition; clearly any 3- partition induces a set of three centroids in the natural manner. A 3-partition is stable if repetition of the 3-means iteration with the induced centroids leaves it unchanged.

How many starting configurations are there?
What are the stable 3-partitions?
What is the number of starting configurations leading to each of the stable 3-partitions in (b) above?
What is the maximum number of iterations from any starting configuration to its stable 3-partition?

Implementation

#----------------Loading_data----------------------------------------------------------------------------------------

mydata <- read.csv("Universities.csv")  # read csv file 
mydata

#-------Q---1.a---removing missing values----------------------------------------------------------------------------

mydata_clean <- na.omit(mydata)

library(dplyr)

# print no. of rows and columns in the dataframe
print(paste("No. of rows:", nrow(mydata_clean)))
print(paste("No. of cols:", ncol(mydata_clean)))

#------Q---1.b---Heirarchical_Clustering-----------------------------------------------------------------------------

#---------separating continuous and categorical data-----------------------------------------------------------------

#--------------Categorical_data-------------------------------------------------------

categorical_variables <- names(select(mydata_clean,"State","Public..1...Private..2."))
print(categorical_variables)

#--------------Numerical_data---------------------------------------------------------

continuous_variables <- setdiff(names(which(sapply(mydata_clean, is.numeric))),("Public..1...Private..2."))
print(continuous_variables)

#--------------Normalise_data---------------------------------------------------------

mydata_normalised <- as.data.frame(scale(mydata_clean[,continuous_variables]))

#--------------dist_method------------------------------------------------------------

dist_mat <- dist(mydata_normalised, method = 'euclidean')

#---------------clustering------------------------------------------------------------

hcluster <- hclust(dist_mat, method = "complete")

# plot the dendogram
plot(hcluster,cex=0.6)
abline(h = 12, col = 'red')

# from the dendogram we select 6 clusters
# draw dendogram with red borders around the 6 clusters
rect.hclust(hcluster, k=6, border="green")

# coloring the dendogram branches according to cluster
suppressPackageStartupMessages(library(dendextend))
dend_obj <- as.dendrogram(hcluster)
col_dend <- color_branches(dend_obj, h= 12)
plot(col_dend)

# getting clusters
sub_grp <- cutree(hcluster, k = 6)
# cluster count
table(sub_grp)

# Adding the clusters to the table in a column named "class"
mydata_clean['class']<-as.factor(sub_grp)

# grouping the continuous data by 'class' and taking mean (cluster centers)
cluster_table<-aggregate(mydata_clean[continuous_variables],by=mydata_clean['class'],mean)
summary(cluster_table)

#-------Q---1.c------------------------------------------------------------------------------------------------------

# In this case there are three major clusters i.e. cluster 1, 2 and 3 and three outliers as cluster i.e. cluster 4, 5 
# and 6. We will

# cluster 1: Universities with low tution (both in_state and out_state) and  high acceptance rate.

# cluster 2: Universities with high tution (both in_state and out_state) and low acceptance rate.

# cluster 3: Universities with low in_state tution and high out_state tution and high acceptance rate.

#-------Q---1.d------------------------------------------------------------------------------------------------------

# grouping the categorical data by class
cat_cluster_table<- group_by(mydata_clean[categorical_variables],by=mydata_clean['class'])
summary(cat_cluster_table)

# cluster 1
table(mydata_clean$Public..1...Private..2.[mydata_clean$class == 1])
print("Cluster 1 has a comparable number of Private and Public colleges")

# cluster 2
table(mydata_clean$Public..1...Private..2.[mydata_clean$class == 2])
print("Cluster 2 has a majority of Private colleges")

# cluster 3
table(mydata_clean$Public..1...Private..2.[mydata_clean$class == 3])
print("Cluster 3 has a majority of Public colleges")

# all in all, the three clusters have an ample number of Private colleges than Public due to the fact that the 
# dataset has more private colleges as compared to public ones.

# We can infer from cluster 2 that Private colleges has higher tution fees.

#-------Q---1.e------------------------------------------------------------------------------------------------------

# Out of the three major clusters, Cluster3 has most number of Full Time undergrads and cluster2 has the least number
# of Full Time undergrads.

# There is a similar trend for Part Time undergrads as that of FUll Time undergrads.
# We can say that the more students choose Public colleges over Private colleges

# add fees is higher in the cluster3 (majority Public colleges) as compared to the clusters having higher number of 
# private colleges.

# graduation rate is the highest for cluster2 i.e. Universities with high tution (both in_state and out_state) and 
# low acceptance rate.

# no. of new students from top 10 and top 25 is highest for cluster2 i.e. Universities with high tution 
# (both in_state and out_state) and low acceptance rate.

#-------Q---1.f---Centroid_Comparison--------------------------------------------------------------------------------

# preapring the data point for comparison
point_new <- mydata[mydata['College.Name']=='Tufts University']
point_new[is.na(point_new)] <- 0
point_new<- as.integer(point_new[-c(1:3)])

# fucntion to calculate the Euclidean Distance
Euclidean_Distance<-function(x1, x2) sqrt(sum((x1 - x2) ^ 2))

# calculating Euclidean distance of data from each cluster
for (i in 1:nrow(cluster_table)){
  point <- cluster_table[i,][-1]
  print(paste('Distance from cluster',i,'is', Euclidean_Distance(point_new,point) ))
}

# From the result we can see that the Tufts University data is closer to cluster 2.

# imputing the missing value by cluster average
mydata$X..PT.undergrad[mydata$College.Name == 'Tufts University']<-cluster_table$X..PT.undergrad[cluster_table$class == 2]

#--------------------------------------------------------------------------------------------------------------------

#--------------Qusetion_2--------------------------------------------------------------------------------------------

#------Q--2.a---Categorical_columns----------------------------------------------------------------------------------

# loading the dataset
data_df <- read.csv('ToyotaCorolla.csv')
glimpse(data_df) # info of the dataset

#check for factor type columns
is.fact <- sapply(data_df, is.factor)
#print the categorical column's name
print('The categorical columns are:')
print( names(data_df[,is.fact]))

# There are a total of three categorical columns including the model name of the car, that column is not a feature 
# column. Thus, there are two categorical variables in "Fuel_Type" and "Color".

#------Q--2.b---relationship between a categorical variable and the series of binary dummy variables-----------------

# A dummy variable is a variable that takes values of 0 and 1, where the values indicate the presence or absence 
# of a category. 
# When a categorical variable has more than two categories, it can be represented by a set of dummy variables, 
# with one variable for each category. 
# 
# For example in this case the categorical variable "Fuel_Type" has two categories i.e. "Diesel" and "Petrol".
# Thus there will we two dummy columns one for Diesel and one for Petrol. In the dummy column for Diesel the instances
# that had a Diesel fuel type would have a value of 1 and the all the other values would be 0. The dummy column for 
# Petrol would have values in a similar manner.

#-----Q--2.c---No. of dummy variables required-----------------------------------------------------------------------

# N or N-1 dummy binary variables are required to capture the information in a categorical variable with N categories.
# 
# In some situations like Linear Regression, use of all N dummies will cause failure because the nth variable contains 
# redundant information and can be expressed as a linear combination of the others.Thus, generally we only need
# N-1 dummy variables.

#-------Q--2.d---Creating_dummies------------------------------------------------------------------------------------

#install.packages("dummies")
library("dummies")

data_df['Model'] = as.character(data_df['Model'])
data_dummy=dummy.data.frame(data_df,dummy.class="factor")

data_dummy=data_dummy[,!colnames(data_dummy) %in% "Fuel_TypeCNG"]
data_dummy=data_dummy[,!colnames(data_dummy) %in% "ColorBeige"]
as.array(names(data_dummy))

# 
# We will talk about the dummy variable Fuel_Type_Diesel.
# The Values in dummy variables are as follows:
# 
# 1 : if the instance or the row had Fuel_Type = "Diesel"
# 0 : if the instance had Fuel_Type = "Petrol"

#------Q---2.e---Correlation_matrix----------------------------------------------------------------------------------

# Correlation matrix
data_corr <- subset(data_df, select = -c(Model, Fuel_Type, Color,Cylinders))
cor_matrix=cor(data_corr)
data.frame(cor_matrix)

# matrix plot
heatmap(cor_matrix, Rowv= NA, Colv= NA)

#library(GGally)
#library(ggplot2)
#library(corrplot)
#library(gplots)
#ggcorr(data_corr, hjust = 0.75, color = "grey40", mid = "#FFFF66", name="Correlation Plot")

# Age_month is negatively correlated with price (-0.88): This means when the age of the car is more the price of 
# the car goes down.
# 
# KM is negatively correlated with price(-0.57): When the km increases similar to age the price of the car 
# decreases.
# 
# Weight is positively correlated with price(0.58): Weight of the Car is positively correlated with the Price
# (higher price higher weight)
# 
# KM is positively correlated with Age_month( 0.51): Other than price, KM and Age of Car is positively stating an
# increasing relationship for both the factors when one increases.
# 
# Weight is positively correlated with quarterly tax(0.63) : Quarterly Road Tax collected is positively related to 
# Weight stating higher Car weight higher the Quarterly road tax.
# 
# Radio and Radio Cassette has a high correlation (0.99): It sugggests that a car having a radio would surely
# have a radio cassette