# K-mean Clustering Homework Help | What is K-mean Clustering?

**K-Means Clustering**

**Import the following libraries:**

random

numpy as np

matplotlib.pyplot as plt

KMeans from sklearn.cluster

make_blobs from sklearn.datasets.samples_generator

Also run **%matplotlib inline **since we will be plotting in this section.

```
#Import Libraries
import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
%matplotlib inline
```

**Generating Random Data**

So we will be creating our own dataset!

First we need to set up a random seed. Use **numpy's random.seed()** function, where the seed will be set to **0**

ex.

`random.seed(0)`

Next we will be making *random clusters *of points by using the **make_blobs **class. The **make_blobs **class can take in many inputs, but we will be using these specific ones.
__Input__

**n_samples**: The total number of points equally divided among clusters.Value will be: 5000

**centers**: The number of centers to generate, or the fixed center locations.Value will be: [[4, 4], [-2, -1], [2, -3],[1,1]]

**cluster_std**: The standard deviation of the clusters.Value will be: 0.9

__Output__

**X**: Array of shape [n_samples, n_features]. (Feature Matrix)The generated samples.

**y**: Array of shape [n_samples]. (Response Vector)The integer labels for cluster membership of each sample.

`X, y = make_blobs(n_samples=5000, centers=[[4,4], [-2, -1], [2, -3], [1, 1]], cluster_std=0.9)`

Display the scatter plot of the randomly generated data.

`plt.scatter(X[:, 0], X[:, 1], marker='.')`

Output:

**Setting up K-means**

Now that we have our random data, let's set up our K-Means Clustering.

The KMeans class has many parameters that can be used, but we will be using these three:

**init**: Initialization method of the centroids.Value will be: "k-means++"

k-means++: Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.

**n\_clusters**: The number of clusters to form as well as the number of centroids to generate.Value will be: 4 (since we have 4 centers)

**n\_init**: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n\_init consecutive runs in terms of inertia.Value will be: 12

Initialize K-Means with these parameters, where the output parameter is called **k_means**.

`k_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)`

Now let's fit the K-Means model with the feature matrix we created above, **X**

`k_means.fit(X)`

Output: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=4, n_init=12, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)

Now let's grab the labels for each point in the model using KMeans' **.labels_ **attribute and save it as **k_means_labels**

`k_means_labels = k_means.labels_`

We will also get the coordinates of the cluster centers using KMeans' **.cluster_centers_ **and save it as **k_means_cluster_centers**

`k_means_cluster_centers = k_means.cluster_centers_`

**Creating the Visual Plot**

So now that we have the random data generated and the KMeans model initialized, let's plot them and see what it looks like!

Please read through the code and comments to understand how to plot the model.

```
# Initialize the plot with the specified dimensions.
fig = plt.figure(figsize=(6, 4))
# Colors uses a color map, which will produce an array of colors based on
# the number of labels there are. We use set(k_means_labels) to get the
# unique labels.
colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means_labels))))
# Create a plot with a black background (background is black because we can see the points
# connection to the centroid.
ax = fig.add_subplot(1, 1, 1, axisbg = 'black')
# For loop that plots the data points and centroids.
# k will range from 0-3, which will match the possible clusters that each
# data point is in.
for k, col in zip(range(len([[2, 2], [-2, -1], [4, -3], [1, 1]])), colors):
# Create a list of all data points, where the data poitns that are
# in the cluster (ex. cluster 0) are labeled as true, else they are
# labeled as false.
my_members = (k_means_labels == k)
# Define the centroid, or cluster center.
cluster_center = k_means_cluster_centers[k]
# Plots the datapoints with color col.
ax.plot(X[my_members, 0], X[my_members, 1], 'w',
markerfacecolor=col, marker='.')
# Plots the centroids with specified color, but with a darker outline
ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
# Title of the plot
ax.set_title('KMeans')
# Remove x-axis ticks
ax.set_xticks(())
# Remove y-axis ticks
ax.set_yticks(())
# Show the plot
plt.show()
# Display the scatter plot from above for comparison.
plt.scatter(X[:, 0], X[:, 1], marker='.')
```

Output:

**Clustering Iris Data**

Import the following libraries:

- Axes3D from mpl_toolkits.mplot3d

- KMeans from sklearn.cluster

- load_iris from sklearn.datasets

*Note: It is presumed that numpy and matplotlib.pyplot are both imported as np and plt respectively from previous imports. If that is not the case, please import them!*

```
#import libraries
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
```

Then we will set the **random seed** and the **centers** for **K-means**.

```
np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]
```

Using the **load_iris() **function, declare the iris datset as the variable **iris**

`iris = load_iris()`

Also declare **X** as the **iris' data component**, and y as **iris' target component**

```
X = iris.data
y = iris.target
```

Now let's run the rest of the code and see what **K-Means produces!**

```
estimators = {'k_means_iris_3': KMeans(n_clusters=3),
'k_means_iris_8': KMeans(n_clusters=8),
'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,
init='random')}
fignum = 1
for name, est in estimators.items():
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
fignum = fignum + 1
# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
for name, label in [('Setosa', 0),
('Versicolour', 1),
('Virginica', 2)]:
ax.text3D(X[y == label, 3].mean(),
X[y == label, 0].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.show()
```

output:

The following **plots** (1-3) show the different **end results** you obtain by using different **initalization processes**. **Plot 4** holds what the answer should be, however it is clear that **K-means** is **heavily reliant** on the **initalization** of the **centroid**

**Hierarchical Clustering**
We will be looking at the next clustering technique, which is **Agglomerative Hierarchical Clustering**. Remember that agglomerative is the bottom up approach.
In this lab, we will be looking at Agglomerative clustering, which is more popular than Divisive clustering.
We will also be using Complete Linkage as the Linkage Criteria.
*NOTE: You can also try using Average Linkage wherever Complete Linkage would be used to see the difference!*

Import Libraries:

**numpy as np****ndimage**from**scipy****hierarchy**from**scipy.cluster****pyplot as plt**from**matplotlib****manifold**from**sklearn****datasets**from**sklearn****AgglomerativeClustering**from**sklearn****make_blobs**from**sklearn.datasets.samples_generator**

Also run **%matplotlib inline** that that wasn't run already.

```
#import libraries
import numpy as np
from scipy import ndimage
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix
from matplotlib import pyplot as plt
from sklearn import manifold, datasets
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets.samples_generator import make_blobs
%matplotlib inline
```

**Generating Random Data**
We will be generating another set of data using the **make_blobs** class once again. This time you will input your own values!
Input these parameters into make_blobs:

**n_samples**: The total number of points equally divided among clusters.Choose a number from 10-1500

**centers**: The number of centers to generate, or the fixed center locations.Choose arrays of x,y coordinates for generating the centers. Have 1-10 centers (ex. centers=[[1,1], [2,5]])

**cluster_std**: The standard deviation of the clusters. The larger the number, the further apart the clustersChoose a number between 0.5-1.5

Save the result to **X2** and **y2**.

```
X2, y2 = make_blobs(n_samples=50, centers=[[4,4], [-2, -1], [1, 1], [10,4]], cluster_std=0.9)
```

Plot the scatter plot of the randomly generated data.

`plt.scatter(X2[:, 0], X2[:, 1], marker='.') `

Output:

**Agglomerative Clustering**

We will start by clustering the random data points we just created.

The **AgglomerativeClustering **class will require two inputs:

**n_clusters**: The number of clusters to form as well as the number of centroids to generate.Value will be: 4

**linkage**: Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.Value will be: 'complete'

**Note**: It is recommended you try everything with 'average' as well

Save the result to a variable called **agglom**

`agglom = AgglomerativeClustering(n_clusters = 4, linkage = 'average')`

Fit the model with **X2 **and **y2 **from the generated data above.

`agglom.fit(X2,y2)`

Output:

```
AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
connectivity=None, linkage='average', memory=None,
n_clusters=4, pooling_func=<function mean at 0x2b3e3efb8048>)
```

Run the following code to show the clustering!

Remember to read the code and comments to gain more understanding on how the plotting works.

```
# Create a figure of size 6 inches by 4 inches.
plt.figure(figsize=(6,4))
# These two lines of code are used to scale the data points down,
# Or else the data points will be scattered very far apart.
# Create a minimum and maximum range of X2.
x_min, x_max = np.min(X2, axis=0), np.max(X2, axis=0)
# Get the average distance for X2.
X2 = (X2 - x_min) / (x_max - x_min)
# This loop displays all of the datapoints.
for i in range(X2.shape[0]):
# Replace the data points with their respective cluster value
# (ex. 0) and is color coded with a colormap (plt.cm.spectral)
plt.text(X2[i, 0], X2[i, 1], str(y2[i]),
color=plt.cm.spectral(agglom.labels_[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
# Remove the x ticks, y ticks, x and y axis
plt.xticks([])
plt.yticks([])
plt.axis('off')
# Display the plot
plt.show()
# Display the plot of the original data before clustering
plt.scatter(X2[:, 0], X2[:, 1], marker='.')
```

Output:

**Dendrogram**

Remember that a **distance matrix** contains the **distance from each point to every other point of a dataset **.
Use the function **distance_matrix, **which requires **two inputs**. Use the Feature Matrix, **X2 **as both inputs and save the distance matrix to a variable called **dist_matrix**
Remember that the distance values are symmetric, with a diagonal of 0's. This is one way of making sure your matrix is correct.
(print out dist_matrix to make sure it's correct)

```
dist_matrix = distance_matrix(X2,X2)
print(dist_matrix)
```

Next, we will save the dendrogram to a variable called **dendro**. In doing this, the dendrogram will also be displayed. Using the **dendrogram **class from hierarchy, pass in the parameter:

Z

`dendro = hierarchy.dendrogram(Z)`

Output:

**Density-based Clustering**

We will be looking at the next clustering technique, which is **Agglomerative Hierarchical Clustering**. Remember that agglomerative is the bottom up approach.
In this lab, we will be looking at Agglomerative clustering, which is more popular than Divisive clustering.
We will also be using Complete Linkage as the Linkage Criteria.
*NOTE: You can also try using Average Linkage wherever Complete Linkage would be used to see the difference!*

Import Libraries:

**numpy as np****ndimage**from**scipy****hierarchy**from**scipy.cluster****pyplot as plt**from**matplotlib****manifold**from**sklearn****datasets**from**sklearn****AgglomerativeClustering**from**sklearn****make_blobs**from**sklearn.datasets.samples_generator**

Also run **%matplotlib inline** that wasn't run already.

```
#import libraries
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline
```

The function below will generate the data points and requires these inputs:

**centroidLocation**: Coordinates of the centroids that will generate the random data.Example: input: [[4,3], [2,-1], [-1,4]]

**numSamples**: The number of data points we want generated, split over the number of centroids (# of centroids defined in centroidLocation)Example: 1500

**clusterDeviation**: The standard deviation between the clusters. The larger the number, the further the spacing.Example: 0.5

```
def createDataPoints(centroidLocation, numSamples, clusterDeviation):
# Create random data and store in feature matrix X and response vector y.
X, y = make_blobs(n_samples=numSamples, centers=centroidLocation,
cluster_std=clusterDeviation)
# Standardize features by removing the mean and scaling to unit variance
X = StandardScaler().fit_transform(X)
return X, y
```

The function below will generate the DBSCAN using the input data:

**epsilon**: A float that describes the maximum distance between two samples for them to be considered as in the same neighborhood.Example: 0.3

**minimumSamples**: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.Examples: 7

```
def displayDBSCAN(epsilon, minimumSamples):
# Initialize DBSCAN with specified epsilon and min. smaples. Fit the model with feature
# matrix X
db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(X)
# Create an array of booleans using the labels from db.
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
# Replace all elements with 'True' in core_samples_mask that are
# in the cluster, 'False' if the points are outliers.
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
# Black color is removed and used for noise instead.
# Remove repetition in labels by turning it into a set.
unique_labels = set(labels)
# Create colors for the clusters.
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
# Plot the points with colors
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
# Plot the datapoints that are clustered
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
# Plot the outliers
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
```

Use **createDataPoints** with the **3 inputs** and store the output into variables **X** and **y**.

`createDataPoints([[4,3], [2,-1], [-1,4]] , 1500, 0.5)`

Output:

(array([[-1.20481012, 0.8947502 ], [-1.33347483, 0.64553883], [ 0.63510302, -1.77670748], ..., [ 0.18916549, -1.41081505], [-1.11560064, 0.80583478], [-0.20255851, -1.16925803]]), array([2, 2, 1, ..., 1, 2, 1]))

`displayDBSCAN(0.3, 7)`

Output:

**Feature Selection**

In this lab exercise, you will learn how to use **Dimensionality Reduction** in the form of **Feature Selection** and **Feature Extraction**.

We will first be looking at Feature Selection with **VarianceThreshold**. VarianceThreshold is a useful tool to removing features with a threshold variance. It is a simple and basic Feature Selection.

**Data Set**

Now we will be working with the **skulls dataset** once again. Using the **my_data** variable and **removeColumns** function, create a variable called **X** which has the **row column dropped**.

`!pip install wget`

Output:

Collecting wget Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip Building wheels for collected packages: wget Running setup.py bdist_wheel for wget ... done Stored in directory: /home/dsxuser/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f Successfully built wget Installing collected packages: wget Successfully installed wget-3.2

**Read Data From Git Hub**

```
import wget
link_to_data = 'https://github.com/tuliplab/mds/raw/master/Jupyter/data/skulls.csv'
DataSet = wget.download(link_to_data)
```

```
import pandas
my_data = pandas.read_csv("skulls.csv", delimiter=",")
```

```
# Remove the column containing the target name since it doesn't contain numeric values.
# Also remove the column that contains the row number
# axis=1 means we are removing columns instead of rows.
# Function takes in a pandas array and column numbers and returns a numpy array without
# the stated columns
def removeColumns(pandasArray, *column):
return pandasArray.drop(pandasArray.columns[[column]], axis=1).values
```

`X = removeColumns(my_data, 0, 1)`

Now use the **target function** to obtain the **Response Vector** of **my_data** and store it as **y**

```
def target(numpyArray, targetColumnIndex):
target_dict = dict()
target = list()
count = -1
for i in range(len(my_data.values)):
if my_data.values[i][targetColumnIndex] not in target_dict:
count += 1
target_dict[my_data.values[i][targetColumnIndex]] = count
target.append(target_dict[my_data.values[i][targetColumnIndex]])
return np.asarray(target)
```

`y = target(my_data, 1)`

**Variance Feature Selection**

First import **VarianceThreshold** from sklearn.feature_selection

`from sklearn.feature_selection import VarianceThreshold`

Now let's instantiate **VarianceThreshold** as a variable called **sel**

`sel = VarianceThreshold()`

Now **VarianceThreshold** removes all **zero-variance features** by default. These features are any **constant value** features. Given the dataset below, let's try to run **fit_transform** function from **sel** on it.

`sel.fit_transform(X)`

Output:

array([[131, 138, 89, 49], [125, 131, 92, 48], [131, 132, 99, 50], [119, 132, 96, 44], [136, 143, 100, 54], [138, 137, 89, 56], [139, 130, 108, 48], [125, 136, 93, 48], [131, 134, 102, 51], [134, 134, 99, 51], [129, 138, 95, 50], [134, 121, 95, 53],

...

...

Now you should have only **two features** left. The first and second features were removed since they had a **variance** of 0. You probably won't encounter constant value features very often, therefore you will want to keep a certain **threshold**.

We can change the threshold by adding **threshold='threshold value'** inside the brackets during the instantiation of **VarianceThreshold**. Where **'threshold value'** is equal to

𝑉𝑎𝑟(𝑋)=𝑝(1−𝑝)

Where **'p'** is your threshold % in **decimal format**.

So, for example if I wanted a threshold of **60%**, I would equate **threshold=0.6 * (1 - 0.6)**

Now let's instantiate another **VarianceThreshold** but with a threshold of **90%**. We'll called it **sel90**.

`sel90 = VarianceThreshold(threshold=(0.9 * (1 - 0.9)))`

`sel90.fit_transform(X)`

Output:

array([[131, 138, 89, 49], [125, 131, 92, 48], [131, 132, 99, 50], [119, 132, 96, 44], [136, 143, 100, 54], [138, 137, 89, 56], [139, 130, 108, 48], [125, 136, 93, 48], [131, 134, 102, 51], ... ... ...

You should only have **one feature** left. The last column feature was the only feature to have a variance of **at least 90%**.

**Univariance Feature Selection**

Now let's look at **Univariance Feature Selection**.

We will need to import **SelectKBest** from **sklearn.feature_selection**, **chi2** from **sklearn.feature_selection**, **numpy** as **np**, and **pandas**.

How **Univariance** works is that it selects **features** based on **univariance statistical tests**. **chi2** is used as a **univariance scoring function** which returns **p** values. We specified **k=3** for the **3 best features** to be chosen. Now we will move onto **Feature Extraction!**

```
#import Libraries
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import numpy as np
import pandas
```

Now take a look at **X's shape** before the feature selection

`X.shape`

(150, 4)

Now we will use the **fit_transform** function with parameters **X**, **y** of **SelectKBest** with parameters **chi2**, **k=3**. This will be stored as **X_new**.

**Note**: There is a VisibleDeprecationWarning, you can ignore it.

`X_new = SelectKBest(chi2, k=3).fit_transform(X, y)`

Now let's check out the shape of **X_new**, it should have **one less** feature than before!

`X_new.shape`

(150, 3)

If you need any programming assignment help in Machine Learning programming, Machine Learning project or Machine Learning homework or need solution of above problem then we are ready to help you.

Send your request at realcode4you@gmail.com and get instant help with an affordable price.

We are always focus to delivered unique or without plagiarism code which is written by our highly educated professional which provide well structured code within your given time frame.

If you are looking other programming language help like C, C++, Java, Python, PHP, Asp.Net, NodeJs, ReactJs, etc. with the different types of databases like MySQL, MongoDB, SQL Server, Oracle, etc. then also contact us.