K-mean Clustering Homework Help | What is K-mean Clustering?

realcode4you
Sep 20, 2021
11 min read

K-Means Clustering

Import the following libraries:

random
numpy as np
matplotlib.pyplot as plt
KMeans from sklearn.cluster
make_blobs from sklearn.datasets.samples_generator

Also run %matplotlib inline since we will be plotting in this section.

#Import Libraries
import random 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs 
%matplotlib inline

Generating Random Data

So we will be creating our own dataset!

First we need to set up a random seed. Use numpy's random.seed() function, where the seed will be set to 0

ex.

random.seed(0)

Next we will be making random clusters of points by using the make_blobs class. The make_blobs class can take in many inputs, but we will be using these specific ones. Input

n_samples: The total number of points equally divided among clusters.
- Value will be: 5000
centers: The number of centers to generate, or the fixed center locations.
- Value will be: [[4, 4], [-2, -1], [2, -3],[1,1]]
cluster_std: The standard deviation of the clusters.
- Value will be: 0.9

Output

X: Array of shape [n_samples, n_features]. (Feature Matrix)
- The generated samples.
y: Array of shape [n_samples]. (Response Vector)
- The integer labels for cluster membership of each sample.

X, y = make_blobs(n_samples=5000, centers=[[4,4], [-2, -1], [2, -3], [1, 1]], cluster_std=0.9)

Display the scatter plot of the randomly generated data.

plt.scatter(X[:, 0], X[:, 1], marker='.')

Output:

Setting up K-means

Now that we have our random data, let's set up our K-Means Clustering.

The KMeans class has many parameters that can be used, but we will be using these three:

init: Initialization method of the centroids.
- Value will be: "k-means++"
- k-means++: Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.
n\_clusters: The number of clusters to form as well as the number of centroids to generate.
- Value will be: 4 (since we have 4 centers)
n\_init: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n\_init consecutive runs in terms of inertia.
- Value will be: 12

Initialize K-Means with these parameters, where the output parameter is called k_means.

k_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)

Now let's fit the K-Means model with the feature matrix we created above, X

k_means.fit(X)

Output: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=4, n_init=12, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)

Now let's grab the labels for each point in the model using KMeans' .labels_ attribute and save it as k_means_labels

k_means_labels = k_means.labels_

We will also get the coordinates of the cluster centers using KMeans' .cluster_centers_ and save it as k_means_cluster_centers

k_means_cluster_centers = k_means.cluster_centers_

Creating the Visual Plot

So now that we have the random data generated and the KMeans model initialized, let's plot them and see what it looks like!

Please read through the code and comments to understand how to plot the model.

# Initialize the plot with the specified dimensions.
fig = plt.figure(figsize=(6, 4))

# Colors uses a color map, which will produce an array of colors based on
# the number of labels there are. We use set(k_means_labels) to get the
# unique labels.
colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means_labels))))

# Create a plot with a black background (background is black because we can see the points
# connection to the centroid.
ax = fig.add_subplot(1, 1, 1, axisbg = 'black')

# For loop that plots the data points and centroids.
# k will range from 0-3, which will match the possible clusters that each
# data point is in.
for k, col in zip(range(len([[2, 2], [-2, -1], [4, -3], [1, 1]])), colors):

    # Create a list of all data points, where the data poitns that are 
    # in the cluster (ex. cluster 0) are labeled as true, else they are
    # labeled as false.
    my_members = (k_means_labels == k)
    
    # Define the centroid, or cluster center.
    cluster_center = k_means_cluster_centers[k]
    
    # Plots the datapoints with color col.
    ax.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.')
    
    # Plots the centroids with specified color, but with a darker outline
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)

# Title of the plot
ax.set_title('KMeans')

# Remove x-axis ticks
ax.set_xticks(())

# Remove y-axis ticks
ax.set_yticks(())

# Show the plot
plt.show()

# Display the scatter plot from above for comparison.
plt.scatter(X[:, 0], X[:, 1], marker='.')

Output:

Clustering Iris Data

Import the following libraries:

- Axes3D from mpl_toolkits.mplot3d

- KMeans from sklearn.cluster

- load_iris from sklearn.datasets

Note: It is presumed that numpy and matplotlib.pyplot are both imported as np and plt respectively from previous imports. If that is not the case, please import them!

#import libraries
from mpl_toolkits.mplot3d import Axes3D 
from sklearn.cluster import KMeans 
from sklearn.datasets import load_iris

Then we will set the random seed and the centers for K-means.

np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]

Using the load_iris() function, declare the iris datset as the variable iris

iris = load_iris()

Also declare X as the iris' data component, and y as iris' target component

X = iris.data 
y = iris.target

Now let's run the rest of the code and see what K-Means produces!

estimators = {'k_means_iris_3': KMeans(n_clusters=3),
              'k_means_iris_8': KMeans(n_clusters=8),
              'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,
                                              init='random')}

fignum = 1
for name, est in estimators.items():
    fig = plt.figure(fignum, figsize=(4, 3))
    plt.clf()
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

    plt.cla()
    est.fit(X)
    labels = est.labels_

    ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel('Petal width')
    ax.set_ylabel('Sepal length')
    ax.set_zlabel('Petal length')
    fignum = fignum + 1

# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()

for name, label in [('Setosa', 0),
                    ('Versicolour', 1),
                    ('Virginica', 2)]:
    ax.text3D(X[y == label, 3].mean(),
              X[y == label, 0].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.show()

output:

The following plots (1-3) show the different end results you obtain by using different initalization processes. Plot 4 holds what the answer should be, however it is clear that K-means is heavily reliant on the initalization of the centroid

Hierarchical Clustering We will be looking at the next clustering technique, which is Agglomerative Hierarchical Clustering. Remember that agglomerative is the bottom up approach. In this lab, we will be looking at Agglomerative clustering, which is more popular than Divisive clustering. We will also be using Complete Linkage as the Linkage Criteria. NOTE: You can also try using Average Linkage wherever Complete Linkage would be used to see the difference!

Import Libraries:

numpy as np
ndimage from scipy
hierarchy from scipy.cluster
pyplot as plt from matplotlib
manifold from sklearn
datasets from sklearn
AgglomerativeClustering from sklearn
make_blobs from sklearn.datasets.samples_generator

Also run %matplotlib inline that that wasn't run already.

#import libraries
import numpy as np 
from scipy import ndimage 
from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 
from matplotlib import pyplot as plt 
from sklearn import manifold, datasets 
from sklearn.cluster import AgglomerativeClustering 
from sklearn.datasets.samples_generator import make_blobs 
%matplotlib inline

Generating Random Data We will be generating another set of data using the make_blobs class once again. This time you will input your own values! Input these parameters into make_blobs:

n_samples: The total number of points equally divided among clusters.
- Choose a number from 10-1500
centers: The number of centers to generate, or the fixed center locations.
- Choose arrays of x,y coordinates for generating the centers. Have 1-10 centers (ex. centers=[[1,1], [2,5]])
cluster_std: The standard deviation of the clusters. The larger the number, the further apart the clusters
- Choose a number between 0.5-1.5

Save the result to X2 and y2.


X2, y2 = make_blobs(n_samples=50, centers=[[4,4], [-2, -1], [1, 1], [10,4]], cluster_std=0.9)

Plot the scatter plot of the randomly generated data.

plt.scatter(X2[:, 0], X2[:, 1], marker='.')

Output:

Agglomerative Clustering

We will start by clustering the random data points we just created.

The AgglomerativeClustering class will require two inputs:

n_clusters: The number of clusters to form as well as the number of centroids to generate.
- Value will be: 4
linkage: Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.
- Value will be: 'complete'
- Note: It is recommended you try everything with 'average' as well

Save the result to a variable called agglom

agglom = AgglomerativeClustering(n_clusters = 4, linkage = 'average')

Fit the model with X2 and y2 from the generated data above.

agglom.fit(X2,y2)

Output:

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='average', memory=None,
            n_clusters=4, pooling_func=<function mean at 0x2b3e3efb8048>)

Run the following code to show the clustering!

Remember to read the code and comments to gain more understanding on how the plotting works.

# Create a figure of size 6 inches by 4 inches.
plt.figure(figsize=(6,4))

# These two lines of code are used to scale the data points down,
# Or else the data points will be scattered very far apart.

# Create a minimum and maximum range of X2.
x_min, x_max = np.min(X2, axis=0), np.max(X2, axis=0)

# Get the average distance for X2.
X2 = (X2 - x_min) / (x_max - x_min)

# This loop displays all of the datapoints.
for i in range(X2.shape[0]):
    # Replace the data points with their respective cluster value 
    # (ex. 0) and is color coded with a colormap (plt.cm.spectral)
    plt.text(X2[i, 0], X2[i, 1], str(y2[i]),
             color=plt.cm.spectral(agglom.labels_[i] / 10.),
             fontdict={'weight': 'bold', 'size': 9})
    
# Remove the x ticks, y ticks, x and y axis
plt.xticks([])
plt.yticks([])
plt.axis('off')

# Display the plot
plt.show()

# Display the plot of the original data before clustering
plt.scatter(X2[:, 0], X2[:, 1], marker='.')

Output:

Dendrogram

Remember that a distance matrix contains the distance from each point to every other point of a dataset . Use the function distance_matrix, which requires two inputs. Use the Feature Matrix, X2 as both inputs and save the distance matrix to a variable called dist_matrix Remember that the distance values are symmetric, with a diagonal of 0's. This is one way of making sure your matrix is correct. (print out dist_matrix to make sure it's correct)

dist_matrix = distance_matrix(X2,X2) 
print(dist_matrix)

Next, we will save the dendrogram to a variable called dendro. In doing this, the dendrogram will also be displayed. Using the dendrogram class from hierarchy, pass in the parameter:

dendro = hierarchy.dendrogram(Z)

Output:

Density-based Clustering

We will be looking at the next clustering technique, which is Agglomerative Hierarchical Clustering. Remember that agglomerative is the bottom up approach. In this lab, we will be looking at Agglomerative clustering, which is more popular than Divisive clustering. We will also be using Complete Linkage as the Linkage Criteria. NOTE: You can also try using Average Linkage wherever Complete Linkage would be used to see the difference!

Import Libraries:

numpy as np
ndimage from scipy
hierarchy from scipy.cluster
pyplot as plt from matplotlib
manifold from sklearn
datasets from sklearn
AgglomerativeClustering from sklearn
make_blobs from sklearn.datasets.samples_generator

Also run %matplotlib inline that wasn't run already.

#import libraries
import numpy as np 
from sklearn.cluster import DBSCAN 
from sklearn.datasets.samples_generator import make_blobs 
from sklearn.preprocessing import StandardScaler 
import matplotlib.pyplot as plt 
%matplotlib inline

The function below will generate the data points and requires these inputs:

centroidLocation: Coordinates of the centroids that will generate the random data.
- Example: input: [[4,3], [2,-1], [-1,4]]
numSamples: The number of data points we want generated, split over the number of centroids (# of centroids defined in centroidLocation)
- Example: 1500
clusterDeviation: The standard deviation between the clusters. The larger the number, the further the spacing.
- Example: 0.5

def createDataPoints(centroidLocation, numSamples, clusterDeviation):
    # Create random data and store in feature matrix X and response vector y.
    X, y = make_blobs(n_samples=numSamples, centers=centroidLocation, 
                                cluster_std=clusterDeviation)
    
    # Standardize features by removing the mean and scaling to unit variance
    X = StandardScaler().fit_transform(X)
    return X, y

The function below will generate the DBSCAN using the input data:

epsilon: A float that describes the maximum distance between two samples for them to be considered as in the same neighborhood.
- Example: 0.3
minimumSamples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
- Examples: 7

def displayDBSCAN(epsilon, minimumSamples):
    
    # Initialize DBSCAN with specified epsilon and min. smaples. Fit the model with feature
    # matrix X
    db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(X)
    
    # Create an array of booleans using the labels from db.
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    
    # Replace all elements with 'True' in core_samples_mask that are
    # in the cluster, 'False' if the points are outliers.
    core_samples_mask[db.core_sample_indices_] = True
    labels = db.labels_

    # Number of clusters in labels, ignoring noise if present.
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)


    # Black color is removed and used for noise instead.
    
    # Remove repetition in labels by turning it into a set.
    unique_labels = set(labels)
    
    # Create colors for the clusters.
    colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
    
    # Plot the points with colors
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Black used for noise.
            col = 'k'

        class_member_mask = (labels == k)
        
        # Plot the datapoints that are clustered
        xy = X[class_member_mask & core_samples_mask]
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
                 markeredgecolor='k', markersize=14)

        # Plot the outliers
        xy = X[class_member_mask & ~core_samples_mask]
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
                 markeredgecolor='k', markersize=6)

    plt.title('Estimated number of clusters: %d' % n_clusters_)
    plt.show()

Use createDataPoints with the 3 inputs and store the output into variables X and y.

createDataPoints([[4,3], [2,-1], [-1,4]] , 1500, 0.5)

Output:

(array([[-1.20481012, 0.8947502 ], [-1.33347483, 0.64553883], [ 0.63510302, -1.77670748], ..., [ 0.18916549, -1.41081505], [-1.11560064, 0.80583478], [-0.20255851, -1.16925803]]), array([2, 2, 1, ..., 1, 2, 1]))

displayDBSCAN(0.3, 7)

Output:

Feature Selection

In this lab exercise, you will learn how to use Dimensionality Reduction in the form of Feature Selection and Feature Extraction.

We will first be looking at Feature Selection with VarianceThreshold. VarianceThreshold is a useful tool to removing features with a threshold variance. It is a simple and basic Feature Selection.

Data Set

Now we will be working with the skulls dataset once again. Using the my_data variable and removeColumns function, create a variable called X which has the row column dropped.

!pip install wget

Output:

Collecting wget Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip Building wheels for collected packages: wget Running setup.py bdist_wheel for wget ... done Stored in directory: /home/dsxuser/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f Successfully built wget Installing collected packages: wget Successfully installed wget-3.2

Read Data From Git Hub

import wget
link_to_data = 'https://github.com/tuliplab/mds/raw/master/Jupyter/data/skulls.csv'
DataSet = wget.download(link_to_data)

import pandas
my_data = pandas.read_csv("skulls.csv", delimiter=",")

# Remove the column containing the target name since it doesn't contain numeric values.
# Also remove the column that contains the row number
# axis=1 means we are removing columns instead of rows.
# Function takes in a pandas array and column numbers and returns a numpy array without
# the stated columns
def removeColumns(pandasArray, *column):
    return pandasArray.drop(pandasArray.columns[[column]], axis=1).values

X = removeColumns(my_data, 0, 1)

Now use the target function to obtain the Response Vector of my_data and store it as y

def target(numpyArray, targetColumnIndex):
    target_dict = dict()
    target = list()
    count = -1
    for i in range(len(my_data.values)):
        if my_data.values[i][targetColumnIndex] not in target_dict:
            count += 1
            target_dict[my_data.values[i][targetColumnIndex]] = count
        target.append(target_dict[my_data.values[i][targetColumnIndex]])
    return np.asarray(target)

y = target(my_data, 1)

Variance Feature Selection

First import VarianceThreshold from sklearn.feature_selection

from sklearn.feature_selection import VarianceThreshold

Now let's instantiate VarianceThreshold as a variable called sel

sel = VarianceThreshold()

Now VarianceThreshold removes all zero-variance features by default. These features are any constant value features. Given the dataset below, let's try to run fit_transform function from sel on it.

sel.fit_transform(X)

Output:

array([[131, 138, 89, 49], [125, 131, 92, 48], [131, 132, 99, 50], [119, 132, 96, 44], [136, 143, 100, 54], [138, 137, 89, 56], [139, 130, 108, 48], [125, 136, 93, 48], [131, 134, 102, 51], [134, 134, 99, 51], [129, 138, 95, 50], [134, 121, 95, 53],

...

Now you should have only two features left. The first and second features were removed since they had a variance of 0. You probably won't encounter constant value features very often, therefore you will want to keep a certain threshold.

We can change the threshold by adding threshold='threshold value' inside the brackets during the instantiation of VarianceThreshold. Where 'threshold value' is equal to

𝑉𝑎𝑟(𝑋)=𝑝(1−𝑝)

Where 'p' is your threshold % in decimal format.

So, for example if I wanted a threshold of 60%, I would equate threshold=0.6 * (1 - 0.6)

Now let's instantiate another VarianceThreshold but with a threshold of 90%. We'll called it sel90.

sel90 = VarianceThreshold(threshold=(0.9 * (1 - 0.9)))

sel90.fit_transform(X)

Output:

You should only have one feature left. The last column feature was the only feature to have a variance of at least 90%.

Univariance Feature Selection

Now let's look at Univariance Feature Selection.

We will need to import SelectKBest from sklearn.feature_selection, chi2 from sklearn.feature_selection, numpy as np, and pandas.

How Univariance works is that it selects features based on univariance statistical tests. chi2 is used as a univariance scoring function which returns p values. We specified k=3 for the 3 best features to be chosen. Now we will move onto Feature Extraction!

#import Libraries
from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2 
import numpy as np 
import pandas

Now take a look at X's shape before the feature selection

X.shape

(150, 4)

Now we will use the fit_transform function with parameters X, y of SelectKBest with parameters chi2, k=3. This will be stored as X_new.

Note: There is a VisibleDeprecationWarning, you can ignore it.

X_new = SelectKBest(chi2, k=3).fit_transform(X, y)

Now let's check out the shape of X_new, it should have one less feature than before!

X_new.shape

(150, 3)

If you need any programming assignment help in Machine Learning programming, Machine Learning project or Machine Learning homework or need solution of above problem then we are ready to help you.

Send your request at realcode4you@gmail.com and get instant help with an affordable price.

We are always focus to delivered unique or without plagiarism code which is written by our highly educated professional which provide well structured code within your given time frame.

If you are looking other programming language help like C, C++, Java, Python, PHP, Asp.Net, NodeJs, ReactJs, etc. with the different types of databases like MySQL, MongoDB, SQL Server, Oracle, etc. then also contact us.