# K-mean Clustering Homework Help | What is K-mean Clustering?

**K-Means Clustering**

**Import the following libraries:**

random

numpy as np

matplotlib.pyplot as plt

KMeans from sklearn.cluster

make_blobs from sklearn.datasets.samples_generator

Also run **%matplotlib inline **since we will be plotting in this section.

```
#Import Libraries
import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
%matplotlib inline
```

**Generating Random Data**

So we will be creating our own dataset!

First we need to set up a random seed. Use **numpy's random.seed()** function, where the seed will be set to **0**

ex.

`random.seed(0)`

Next we will be making *random clusters *of points by using the **make_blobs **class. The **make_blobs **class can take in many inputs, but we will be using these specific ones.
__Input__

**n_samples**: The total number of points equally divided among clusters.Value will be: 5000

**centers**: The number of centers to generate, or the fixed center locations.Value will be: [[4, 4], [-2, -1], [2, -3],[1,1]]

**cluster_std**: The standard deviation of the clusters.Value will be: 0.9

__Output__

**X**: Array of shape [n_samples, n_features]. (Feature Matrix)The generated samples.

**y**: Array of shape [n_samples]. (Response Vector)The integer labels for cluster membership of each sample.

`X, y = make_blobs(n_samples=5000, centers=[[4,4], [-2, -1], [2, -3], [1, 1]], cluster_std=0.9)`

Display the scatter plot of the randomly generated data.

`plt.scatter(X[:, 0], X[:, 1], marker='.')`

Output:

**Setting up K-means**

Now that we have our random data, let's set up our K-Means Clustering.

The KMeans class has many parameters that can be used, but we will be using these three:

**init**: Initialization method of the centroids.Value will be: "k-means++"

k-means++: Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.

**n\_clusters**: The number of clusters to form as well as the number of centroids to generate.Value will be: 4 (since we have 4 centers)

**n\_init**: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n\_init consecutive runs in terms of inertia.Value will be: 12

Initialize K-Means with these parameters, where the output parameter is called **k_means**.

`k_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)`

Now let's fit the K-Means model with the feature matrix we created above, **X**

`k_means.fit(X)`

Output: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=4, n_init=12, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)

Now let's grab the labels for each point in the model using KMeans' **.labels_ **attribute and save it as **k_means_labels**

`k_means_labels = k_means.labels_`

We will also get the coordinates of the cluster centers using KMeans' **.cluster_centers_ **and save it as **k_means_cluster_centers**

`k_means_cluster_centers = k_means.cluster_centers_`

**Creating the Visual Plot**

So now that we have the random data generated and the KMeans model initialized, let's plot them and see what it looks like!

Please read through the code and comments to understand how to plot the model.

```
# Initialize the plot with the specified dimensions.
fig = plt.figure(figsize=(6, 4))
# Colors uses a color map, which will produce an array of colors based on
# the number of labels there are. We use set(k_means_labels) to get the
# unique labels.
colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means_labels))))
# Create a plot with a black background (background is black because we can see the points
# connection to the centroid.
ax = fig.add_subplot(1, 1, 1, axisbg = 'black')
# For loop that plots the data points and centroids.
# k will range from 0-3, which will match the possible clusters that each
# data point is in.
for k, col in zip(range(len([[2, 2], [-2, -1], [4, -3], [1, 1]])), colors):
# Create a list of all data points, where the data poitns that are
# in the cluster (ex. cluster 0) are labeled as true, else they are
# labeled as false.
my_members = (k_means_labels == k)
# Define the centroid, or cluster center.
cluster_center = k_means_cluster_centers[k]
# Plots the datapoints with color col.
ax.plot(X[my_members, 0], X[my_members, 1], 'w',
markerfacecolor=col, marker='.')
# Plots the centroids with specified color, but with a darker outline
ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
# Title of the plot
ax.set_title('KMeans')
# Remove x-axis ticks
ax.set_xticks(())
# Remove y-axis ticks
ax.set_yticks(())
# Show the plot
plt.show()
# Display the scatter plot from above for comparison.
plt.scatter(X[:, 0], X[:, 1], marker='.')
```

Output:

**Clustering Iris Data**

Import the following libraries:

- Axes3D from mpl_toolkits.mplot3d

- KMeans from sklearn.cluster

- load_iris from sklearn.datasets

*Note: It is presumed that numpy and matplotlib.pyplot are both imported as np and plt respectively from previous imports. If that is not the case, please import them!*

```
#import libraries
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
```

Then we will set the **random seed** and the **centers** for **K-means**.

```
np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]
```

Using the **load_iris() **function, declare the iris datset as the variable **iris**

`iris = load_iris()`

Also declare **X** as the **iris' data component**, and y as **iris' target component**

```
X = iris.data
y = iris.target
```

Now let's run the rest of the code and see what **K-Means produces!**

```
estimators = {'k_means_iris_3': KMeans(n_clusters=3),
'k_means_iris_8': KMeans(n_clusters=8),
'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,
init='random')}
fignum = 1
for name, est in estimators.items():
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
fignum = fignum + 1
# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
for name, label in [('Setosa', 0),
('Versicolour', 1),
('Virginica', 2)]:
ax.text3D(X[y == label, 3].mean(),
X[y == label, 0].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.show()
```

output:

The following **plots** (1-3) show the different **end results** you obtain by using different **initalization processes**. **Plot 4** holds what the answer should be, however it is clear that **K-means** is **heavily reliant** on the **initalization** of the **centroid**

**Hierarchical Clustering**
We will be looking at the next clustering technique, which is **Agglomerative Hierarchical Clustering**. Remember that agglomerative is the bottom up approach.
In this lab, we will be looking at Agglomerative clustering, which is more popular than Divisive clustering.
We will also be using Complete Linkage as the Linkage Criteria.
*NOTE: You can also try using Average Linkage wherever Complete Linkage would be used to see the difference!*

Import Libraries:

**numpy as np****ndimage**from**scipy****hierarchy**from**scipy.cluster****pyplot as plt**from**matplotlib****manifold**from**sklearn****datasets**from**sklearn****AgglomerativeClustering**from**sklearn****make_blobs**from**sklearn.datasets.samples_generator**

Also run **%matplotlib inline** that that wasn't run already.

```
#import libraries
import numpy as np
from scipy import ndimage
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix
from matplotlib import pyplot as plt
from sklearn import manifold, datasets
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets.samples_generator import make_blobs
%matplotlib inline
```

**Generating Random Data**
We will be generating another set of data using the **make_blobs** class once again. This time you will input your own values!
Input these parameters into make_blobs:

**n_samples**: The total number of points equally divided among clusters.Choose a number from 10-1500

**centers**: The number of centers to generate, or the fixed center locations.Choose arrays of x,y coordinates for generating the centers. Have 1-10 centers (ex. centers=[[1,1], [2,5]])

**cluster_std**: The standard deviation of the clusters. The larger the number, the further apart the clustersChoose a number between 0.5-1.5

Save the result to **X2** and **y2**.

```
X2, y2 = make_blobs(n_samples=50, centers=[[4,4], [-2, -1], [1, 1], [10,4]], cluster_std=0.9)
```

Plot the scatter plot of the randomly generated data.

`plt.scatter(X2[:, 0], X2[:, 1], marker='.') `

Output:

**Agglomerative Clustering**

We will start by clustering the random data points we just created.

The **AgglomerativeClustering **class will require two inputs:

**n_clusters**: The number of clusters to form as well as the number of centroids to generate.Value will be: 4

**linkage**: Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.Value will be: 'complete'

**Note**: It is recommended you try everything with 'average' as well

Save the result to a variable called **agglom**

`agglom = AgglomerativeClustering(n_clusters = 4, linkage = 'average')`

Fit the model with **X2 **and **y2 **from the generated data above.

`agglom.fit(X2,y2)`

Output:

```
AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
connectivity=None, linkage='average', memory=None,
n_clusters=4, pooling_func=<function mean at 0x2b3e3efb8048>)
```

Run the following code to show the clustering!

Remember to read the code and comments to gain more understanding on how the plotting works.

```
# Create a figure of size 6 inches by 4 inches.
plt.figure(figsize=(6,4))
# These two lines of code are used to scale the data points down,
# Or else the data points will be scattered very far apart.
# Create a minimum and maximum range of X2.
x_min, x_max = np.min(X2, axis=0), np.max(X2, axis=0)
# Get the average distance for X2.
X2 = (X2 - x_min) / (x_max - x_min)
# This loop displays all of the datapoints.
for i in range(X2.shape[0]):
# Replace the data points with their respective cluster value
# (ex. 0) and is color coded with a colormap (plt.cm.spectral)
plt.text(X2[i, 0], X2[i, 1], str(y2[i]),
color=plt.cm.spectral(agglom.labels_[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
# Remove the x ticks, y ticks, x and y axis
plt.xticks([])
plt.yticks([])
plt.axis('off')
# Display the plot
plt.show()
# Display the plot of the original data before clustering
plt.scatter(X2[:
```