### Algorithms

K-means (hard clustering)

Single Link Clustering (hard clustering)

Expectation Maximization (soft clustering)

A. K-means Algorithm

1. Assign

We will choose cluster centers

They are called centroids

Choosing the initial location of the centroids would affect the final clustering results

2. Optimize

We will move the cluster centers to minimize the total bands' length

The web connects the cluster center to the observations like a spider web

So we want to minimize the total length of the web for each centroid

Limitations of K-means

Suppose you use a fixed number of centroids and training set

You would not get the same results

It is highly dependent on where you put your centroids initially

You might reach a local minima

The more centroids you have, the more local minima you will have

B. Single Linkage Clustering (SLC)

Consider each object a cluster (n objects)

Define intercluster distance as the distance between the closest two points in the two two clusters

Merge two closest clusters

Repeat n-k times to make k clusters

In sum, it's just linking up the nearest points

Just connect the dots to the nearest dots in a linear fashion

Running Time of SLC

O(n^3) or slightly lower

Soft Clustering <br >The idea of 'soft' clustering is that, instead of deciding on definite clusters for each data point, you instead assign each data point a probability that it belongs to one of K clusters.

Select one of K Gaussians uniformly

Sample x_i from that Gaussian

Repeat n times

Find a hypothesis that maximizes the probability of the data (maximum likelihood)

C. Expectation Maximization

The general overview of this is similar to K-means, but instead of judging solely by distance, you are judging by probability, and you are never fully assigning one data point fully to one cluster or one cluster fully to one data point

You instead assign each point partially to each cluster based on the probability that it would belong to the cluster if you fully knew the clusters, and then assign the mean of each cluster based on the assumption that your prior probabilities were correct, and repeat until you stop getting significant changes

Properties of Expectation Maximization

Monotonically non-decreasing likelihood

Does not converge (practically does)

Will not diverge

Can get stuck in local optima

Solution: randomly restart

Works with any distribution (if EM solvable)

Clustering Properties <br > No clustering scheme (algorithm) that can achieve all three properties

Richness

For any assignment of objects to clusters, there is some distance matrix D such that P_d, clustering scheme, returns that clustering

There are multiple types of clustering

Scale-invariance

Scaling distances by a positive value does not change the clustering

Changing units (km to m)

Consistency

Shrinking intracluster distances and expanding intercluster distances does not change the clustering

K-means clustering using scikit-learn

Hyperparameters available for tuning

n_clusters=8

This is what you can and should change

max_iter=300

This determines the number of iterations of:

Assign

Optimize (moving the centroids)

n_init=10

Number of times to initialize the algorithm

If you think your data might need more initializations, you can increase this

### Import Libraries

```
# Imports
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.cross_validation import train_test_split
```

Split Dataset:

```
# Set up data
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
```

Show Target Data

```
# Explore target names
iris.target_names
```

Output

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

```
Show Features
# Explore feature names
iris.feature_names
```

Output

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

```
# Instantiate
kmc = KMeans(n_clusters=3, max_iter=1000, n_init=20)
# Fit
kmc.fit(X_train, y_train)
```

Output

KMeans(copy_x=True, init='k-means++', max_iter=1000, n_clusters=3, n_init=20, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)

## Comments