K-mean Clustering Homework Help | What is K-mean Clustering?

K-Means Clustering


Import the following libraries:

  • random

  • numpy as np

  • matplotlib.pyplot as plt

  • KMeans from sklearn.cluster

  • make_blobs from sklearn.datasets.samples_generator

Also run %matplotlib inline since we will be plotting in this section.


#Import Libraries
import random 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs 
%matplotlib inline


Generating Random Data

So we will be creating our own dataset!

First we need to set up a random seed. Use numpy's random.seed() function, where the seed will be set to 0

ex.

random.seed(0)

Next we will be making random clusters of points by using the make_blobs class. The make_blobs class can take in many inputs, but we will be using these specific ones. Input

  • n_samples: The total number of points equally divided among clusters.

  • Value will be: 5000

  • centers: The number of centers to generate, or the fixed center locations.

  • Value will be: [[4, 4], [-2, -1], [2, -3],[1,1]]

  • cluster_std: The standard deviation of the clusters.

  • Value will be: 0.9

Output

  • X: Array of shape [n_samples, n_features]. (Feature Matrix)

  • The generated samples.

  • y: Array of shape [n_samples]. (Response Vector)

  • The integer labels for cluster membership of each sample.


X, y = make_blobs(n_samples=5000, centers=[[4,4], [-2, -1], [2, -3], [1, 1]], cluster_std=0.9)

Display the scatter plot of the randomly generated data.


plt.scatter(X[:, 0], X[:, 1], marker='.')

Output:











Setting up K-means

Now that we have our random data, let's set up our K-Means Clustering.

The KMeans class has many parameters that can be used, but we will be using these three:

  • init: Initialization method of the centroids.

  • Value will be: "k-means++"

  • k-means++: Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.

  • n\_clusters: The number of clusters to form as well as the number of centroids to generate.

  • Value will be: 4 (since we have 4 centers)

  • n\_init: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n\_init consecutive runs in terms of inertia.

  • Value will be: 12

Initialize K-Means with these parameters, where the output parameter is called k_means.


k_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)

Now let's fit the K-Means model with the feature matrix we created above, X


k_means.fit(X)

Output: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=4, n_init=12, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)


Now let's grab the labels for each point in the model using KMeans' .labels_ attribute and save it as k_means_labels


k_means_labels = k_means.labels_

We will also get the coordinates of the cluster centers using KMeans' .cluster_centers_ and save it as k_means_cluster_centers


k_means_cluster_centers = k_means.cluster_centers_

Creating the Visual Plot

So now that we have the random data generated and the KMeans model initialized, let's plot them and see what it looks like!

Please read through the code and comments to understand how to plot the model.


# Initialize the plot with the specified dimensions.
fig = plt.figure(figsize=(6, 4))

# Colors uses a color map, which will produce an array of colors based on
# the number of labels there are. We use set(k_means_labels) to get the
# unique labels.
colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means_labels))))

# Create a plot with a black background (background is black because we can see the points
# connection to the centroid.
ax = fig.add_subplot(1, 1, 1, axisbg = 'black')

# For loop that plots the data points and centroids.
# k will range from 0-3, which will match the possible clusters that each
# data point is in.
for k, col in zip(range(len([[2, 2], [-2, -1], [4, -3], [1, 1]])), colors):

    # Create a list of all data points, where the data poitns that are 
    # in the cluster (ex. cluster 0) are labeled as true, else they are
    # labeled as false.
    my_members = (k_means_labels == k)
    
    # Define the centroid, or cluster center.
    cluster_center = k_means_cluster_centers[k]
    
    # Plots the datapoints with color col.
    ax.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.')
    
    # Plots the centroids with specified color, but with a darker outline
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)

# Title of the plot
ax.set_title('KMeans')

# Remove x-axis ticks
ax.set_xticks(())

# Remove y-axis ticks
ax.set_yticks(())

# Show the plot
plt.show()

# Display the scatter plot from above for comparison.
plt.scatter(X[:, 0], X[:, 1], marker='.')

Output:
















Clustering Iris Data

Import the following libraries:


- Axes3D from mpl_toolkits.mplot3d

- KMeans from sklearn.cluster

- load_iris from sklearn.datasets


Note: It is presumed that numpy and matplotlib.pyplot are both imported as np and plt respectively from previous imports. If that is not the case, please import them!


#import libraries
from mpl_toolkits.mplot3d import Axes3D 
from sklearn.cluster import KMeans 
from sklearn.datasets import load_iris

Then we will set the random seed and the centers for K-means.


np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]

Using the load_iris() function, declare the iris datset as the variable iris


iris = load_iris()

Also declare X as the iris' data component, and y as iris' target component


X = iris.data 
y = iris.target

Now let's run the rest of the code and see what K-Means produces!


estimators = {'k_means_iris_3': KMeans(n_clusters=3),
              'k_means_iris_8': KMeans(n_clusters=8),
              'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,
                                              init='random')}

fignum = 1
for name, est in estimators.items():
    fig = plt.figure(fignum, figsize=(4, 3))
    plt.clf()
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

    plt.cla()
    est.fit(X)
    labels = est.labels_

    ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel('Petal width')
    ax.set_ylabel('Sepal length')
    ax.set_zlabel('Petal length')
    fignum = fignum + 1

# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()

for name, label in [('Setosa', 0),
                    ('Versicolour', 1),
                    ('Virginica', 2)]:
    ax.text3D(X[y == label, 3].mean(),
              X[y == label, 0].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.show()

output:

























The following plots (1-3) show the different end results you obtain by using different initalization processes. Plot 4 holds what the answer should be, however it is clear that K-means is heavily reliant on the initalization of the centroid


Hierarchical Clustering We will be looking at the next clustering technique, which is Agglomerative Hierarchical Clustering. Remember that agglomerative is the bottom up approach. In this lab, we will be looking at Agglomerative clustering, which is more popular than Divisive clustering. We will also be using Complete Linkage as the Linkage Criteria. NOTE: You can also try using Average Linkage wherever Complete Linkage would be used to see the difference!

Import Libraries:

  • numpy as np

  • ndimage from scipy

  • hierarchy from scipy.cluster

  • pyplot as plt from matplotlib

  • manifold from sklearn

  • datasets from sklearn

  • AgglomerativeClustering from sklearn

  • make_blobs from sklearn.datasets.samples_generator

Also run %matplotlib inline that that wasn't run already.


#import libraries
import numpy as np 
from scipy import ndimage 
from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 
from matplotlib import pyplot as plt 
from sklearn import manifold, datasets 
from sklearn.cluster import AgglomerativeClustering 
from sklearn.datasets.samples_generator import make_blobs 
%matplotlib inline


Generating Random Data We will be generating another set of data using the make_blobs class once again. This time you will input your own values! Input these parameters into make_blobs:

  • n_samples: The total number of points equally divided among clusters.

  • Choose a number from 10-1500

  • centers: The number of centers to generate, or the fixed center locations.

  • Choose arrays of x,y coordinates for generating the centers. Have 1-10 centers (ex. centers=[[1,1], [2,5]])

  • cluster_std: The standard deviation of the clusters. The larger the number, the further apart the clusters

  • Choose a number between 0.5-1.5

Save the result to X2 and y2.



X2, y2 = make_blobs(n_samples=50, centers=[[4,4], [-2, -1], [1, 1], [10,4]], cluster_std=0.9)

Plot the scatter plot of the randomly generated data.


plt.scatter(X2[:, 0], X2[:, 1], marker='.') 

Output:












Agglomerative Clustering

We will start by clustering the random data points we just created.

The AgglomerativeClustering class will require two inputs:

  • n_clusters: The number of clusters to form as well as the number of centroids to generate.

  • Value will be: 4

  • linkage: Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.

  • Value will be: 'complete'

  • Note: It is recommended you try everything with 'average' as well


Save the result to a variable called agglom


agglom = AgglomerativeClustering(n_clusters = 4, linkage = 'average')

Fit the model with X2 and y2 from the generated data above.


agglom.fit(X2,y2)

Output:

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='average', memory=None,
            n_clusters=4, pooling_func=<function mean at 0x2b3e3efb8048>)

Run the following code to show the clustering!

Remember to read the code and comments to gain more understanding on how the plotting works.


# Create a figure of size 6 inches by 4 inches.
plt.figure(figsize=(6,4))

# These two lines of code are used to scale the data points down,
# Or else the data points will be scattered very far apart.

# Create a minimum and maximum range of X2.
x_min, x_max = np.min(X2, axis=0), np.max(X2, axis=0)

# Get the average distance for X2.
X2 = (X2 - x_min) / (x_max - x_min)

# This loop displays all of the datapoints.
for i in range(X2.shape[0]):
    # Replace the data points with their respective cluster value 
    # (ex. 0) and is color coded with a colormap (plt.cm.spectral)
    plt.text(X2[i, 0], X2[i, 1], str(y2[i]),
             color=plt.cm.spectral(agglom.labels_[i] / 10.),
             fontdict={'weight': 'bold', 'size': 9})
    
# Remove the x ticks, y ticks, x and y axis
plt.xticks([])
plt.yticks([])
plt.axis('off')

# Display the plot
plt.show()

# Display the plot of the original data before clustering
plt.scatter(X2[: