Implementing Both K-means Clustering And EM Algorithm On 2D Points



K-means

K-means clustering is a type of unsupervised learning, which is used with unlabeled dataset. The goal of this algorithm is to find K groups in the data. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:

  • The centroids of the K clusters, which can be used to label new data

  • Labels for the training data (each data point is assigned to a single cluster)

K-means works by defining spherical clusters that are separable in a way so that the mean value converges towards the cluster center. Because of this, K-Means may underperform sometimes.


Use Cases:

  • Document Classification

  • Delivery Store Optimization

  • Customer Segmentation

  • Insurance Fraud Detection etc.


Algorithm :

Κ-means clustering algorithm inputs are the number of clusters Κ and the data set. Algorithm starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps:


1. Data assigment step:

Each centroid defines one of the clusters. In this step, each data point based on the squared Euclidean distance is assigned to its nearest centroid. If 𝑐𝑖ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on

where dist( · ) is the standard (L2) Euclidean distance.


2. Centroid update step:

Centroids are recomputed by taking the mean of all data points assigned to that centroid's cluster.


The algorithm iterates between step one and two until a stopping criteria is met (no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached). In this assignment, we will use the first criterion, i.e. that no cluster centers change.


This algorithm may converge on a local optimum. Assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.


Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from scipy.stats import multivariate_normal

Generating the data

We will generate 3 datasets to use in this assignment. They are created in the following code, and stored in X, X1 and X2 respectively. They are each sets of 500 2D points. We store them as matrices of size 500×2.


Please use these variables X1, X2 and X3 to run your kmeans and EM in this assignment.


"""
This cell has been completed for you. You don't need to change it.
"""

X,Y = make_blobs(cluster_std=1.5,random_state=20,n_samples=500,centers=3)
X = np.dot(X,np.random.RandomState(0).randn(2,2))

print('X shape: ', X.shape)

plt.scatter([x[0] for x in X], [x[1] for x in X])
plt.title('X')
plt.show()

centers = [[4, 7], [9, 9], [9, 2]]

X1,Y1 = make_blobs(cluster_std=1.5,random_state=20,n_samples=500,centers=centers)
X1 = np.dot(X1,np.random.RandomState(0).randn(2,2))

print('X1 shape: ', X1.shape)

plt.scatter([x[0] for x in X1], [x[1] for x in X1])
plt.title('X1')
plt.show()

centers = [[5, 5]]
X21,Y21 = make_blobs(cluster_std=1.5,random_state=20,n_samples=200,centers=centers)
X21 = np.dot(X21, np.array([[1.0, 0], [0, 5.0]]))

centers = [[5, 5]]
X22,Y22 = make_blobs(cluster_std=1.5,random_state=20,n_samples=200,centers=centers)
X22 = np.dot(X21, np.array([[5.0, 0], [0, 1.0]]))

centers = [[7, 7]]
X23, Y23 = make_blobs(cluster_std=1.5,random_state=20,n_samples=100,centers=centers)
X23 = np.dot(X23, np.random.RandomState(0).randn(2,2))

X2 = np.vstack((X21, X22, X23))

print('X2 shape: ', X2.shape)

plt.scatter([x[0] for x in X2], [x[1] for x in X2])
plt.title('X2')
plt.show()

Output




























Creating Class

class KmeansModel:
    
    def __init__(self, X, k, max_iters):
        self.X = X
        self.k = k
        self.max_iters = max_iters
        
        self.dim = X.shape[1]
        self.N = X.shape[0]
        
        centroid_indices = np.random.RandomState(2).permutation(X.shape[0])[:k]
        self.centroids = X[centroid_indices]
        
        initial_labels = self.get_labels(self.X, self.centroids)
        self.plot_data(initial_labels, 'Data with initial random clusters')


def get_labels(self, X, centroids): 
    labels=[]
    for i in range(X.shape[0]):
        point=X[i]
        
        # initialise a list to store the distance between the data-point and the centroids
        Euclidean_distance=[] 
        for index,C in enumerate(centroids):         
            # calculating euclidean distance between the data-point and the k-centroid points and appending into a list
            Euclidean_distance.append(np.power((point[0]-C[0]),2)+np.power((point[1]-C[1]),2))
        
        # selecting the index with the least distance
        label=Euclidean_distance.index(min(Euclidean_distance))
            
        labels.append([label])
        
    return (np.array(labels))
   

KmeansModel.get_labels = get_labels # Here we are just adding the function to the KmeansModel class.

def plot_data(self, labels, title):
    fig = plt.figure(figsize=(12,6))
    ax0 = fig.add_subplot(111)
    ax0.scatter(self.X[:,0], self.X[:,1], c=labels)
    ax0.set_title(title)
KmeansModel.plot_data = plot_data

Creating Main Function

def run(self):
    
    iters = 0
    while True:
        
        labels=self.get_labels(self.X,self.centroids)
        
        #assigning points to cluster based on labels
        clusters={}
        for i,j in zip(labels,self.X):
            if i[0] not in clusters:
                clusters[i[0]]=[]
                clusters[i[0]].append(j)
            else:
                clusters[i[0]].append(j)
        
        # calculating new centroids
        new_centroid=[]
        for i in clusters:
            new_centroid.append(list(np.average(clusters[i],axis=0)))

        # ADD THE CONVERGENCE CONDITION HERE, IE IF THE CENTROIDS DON'T CHANGE SINCE THE LAST ITERATION THEN BREAK
        comparison = self.centroids == np.array(new_centroid) 
        equal_arrays = comparison.all()
        
        if equal_arrays==True:
            break
        
        # ALSO UPDATE THE CENTROIDS TO THE NEW VALUES
        self.centroids=np.array(new_centroid)
            
        if iters == self.max_iters:
            break
        iters += 1

    final_labels = self.get_labels(self.X, self.centroids)
    self.plot_data(final_labels, 'Final clusters')
    
KmeansModel.run = run

Finally, we create an instance of KmeansModel and run our algorithm on some data.


km1 = KmeansModel(X, 3, 100)
km1.run()

km2 = KmeansModel(X1, 3, 100)
km2.run()

km3 = KmeansModel(X2, 3, 100)
km3.run()

































EM Algorithms

EM Algorithm finds clusters by treating each cluster as a 2D Gaussian distribution, and then fits the parameters of each distribution to accurately model the data. The goal is to find optimal parameters for each of the Gaussians which model the data accurately.




Creating Class

class EMModel:
    
    def __init__(self, X, k, max_iters):
        """
        This function initializes our parameters (mu, pi and sigma) and plots our data points.
        """
        self.X = X
        self.k = k
        self.max_iters = max_iters
        
        self.dim = self.X.shape[1] # Equals 2, as we are considering 2D points
        self.N = self.X.shape[0] # Equals the number of points in the dataset
        
        """
        Here we initialize mu, pi and sigma. Mu is k X 2, Pi is k X 1, Sigma is k X 2 X 2
        """
        self.mu = np.random.randint(max(min(self.X[:, 0]), min(self.X[:, 1])), min(max(self.X[:, 0]), max(self.X[:, 1])), size=(self.k, self.dim))
        self.pi = np.ones(self.k) / self.k
        self.sigma = np.zeros((self.k, self.dim, self.dim))
        for i in range(self.k):
            np.fill_diagonal(self.sigma[i], 5.0)
            
        """
        This part is required to plot clusters
        """
        x,y = np.meshgrid(np.sort(self.X[:,0]),np.sort(self.X[:,1]))
        self.XY = np.array([x.flatten(),y.flatten()]).T
        
        """
        This variable is required in case the covariance matrix turns out to not be positive semidefinite
        """
        self.sigma_correction = 0.0*np.identity(self.dim)
         
        """
        Finally, let's visualize our data points
        """
        self.plot_data('Initial State')

def plot_data(self, title, colors=None):
    """
    This function creates a scatter plot of all the data points. It also creates a contour plot of the probability 
    distributions of each of the clusters (specified by mu, pi and sigma)
    """
    fig = plt.figure(figsize=(12,6))
    ax0 = fig.add_subplot(111)
    ax0.scatter(self.X[:,0], self.X[:,1], c=colors)
    ax0.set_title(title)
    for m,c in zip(self.mu,self.sigma):
        c += self.sigma_correction
        multi_normal = multivariate_normal(mean=m,cov=c)
        ax0.contour(np.sort(self.X[:,0]),np.sort(self.X[:,1]),multi_normal.pdf(self.XY).reshape(len(self.X),len(self.X)),colors='black',alpha=0.3)
        ax0.scatter(m[0],m[1],c='grey',zorder=10,s=100)
        
EMModel.plot_data = plot_data


def expectation(self):
    # REPLACE THIS WITH THE ACTUAL r_ic COMPUTATION
    r = np.zeros((self.N, self.k))
    
    for i in range(self.k):
        for n in range(self.N):
            r[n,i] = self.pi[i] * multivariate_normal.pdf(self.X[n],
                                           self.mu[i], self.sigma[i])
    cost = np.log(r.sum(axis=1)).sum()
    r = r / r.sum(axis=1, keepdims=True)
    
    return r      
EMModel.expectation = expectation