Here we first learn about clustering.

What is Clustering?

How do I group these documents by topic?

How do I group my customers by purchase patterns?

Sort items into groups by similarity:

  • Items in a cluster are more similar to each other than they are to items in other clusters.

  • Need to detail the properties that characterize “similarity”

Not a predictive method; finds similarities, relationships

Our Example: K-means Clustering

What is Cluster Analysis?

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.

Types of Clusters: Well-Separated

Well-Separated Clusters:

  • A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.

Types of Clusters: Center-Based


  • A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster

  • The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster

K-Means Clustering - What is it?

Used for clustering numerical data, usually a set of measurements about objects of interest.

Input: numerical. There must be a distance metric defined over the variable space.

  • Euclidian distance

Output: The centers of each discovered cluster, and the assignment of each input to a cluster.

  • Centroid

What Euclidian Distance?

K-means Clustering