The Beginner’s Guide to Clustering in Machine Learning

Utsav Desai
11 min readApr 28, 2023

--

What is Clustering?

Clustering is a technique in machine learning that involves grouping similar data points together based on their characteristics or features. The goal of clustering is to identify patterns and similarities in data sets without any prior knowledge of the groups or categories that exist within the data.

In clustering, the algorithm tries to find similarities between data points and creates groups, or clusters, of similar data points. The similarity between data points is based on a set of features or attributes that describe each data point. The algorithm aims to minimize the distance between the data points within each cluster while maximizing the distance between different clusters.

Clustering has various applications, including customer segmentation, image segmentation, anomaly detection, and recommendation systems.

There are min two types of Clustering techniques:

  • K-Means Clustering
  • Hierarchical Clustering

K-Means Clustering

K-means clustering is a popular unsupervised machine learning algorithm used to group data points into K clusters based on their similarity. The K in K-means represents the number of clusters that the algorithm will create.

The algorithm starts by selecting K random centroids (points in the feature space) as the initial cluster centers. It then assigns each data point to the nearest centroid based on the Euclidean distance between the data point and the centroid. Next, it computes the mean of each cluster, which becomes the new centroid for that cluster. The algorithm repeats these steps until the centroids no longer change or a maximum number of iterations is reached.

K-means clustering aims to minimize the sum of squared distances between each data point and its assigned centroid. This means that the algorithm is trying to find the cluster centroids that minimize the variance within each cluster and maximize the distance between clusters.

How K-Means Clustering Work?

K-means clustering works by iteratively assigning data points to the nearest cluster centroid, computing the mean of each cluster, and updating the centroids until convergence is reached.

The following steps summarize the K-means clustering process:

  1. Initialization: Select the number of clusters (K) and randomly initialize K cluster centroids in the feature space.
  2. Assignment: Assign each data point to the nearest cluster centroid based on the Euclidean distance between the data point and each centroid.
  3. Update: Compute the mean of each cluster, which becomes the new centroid for that cluster.
  4. Repeat: Repeat steps 2 and 3 until the centroids no longer change or a maximum number of iterations is reached.
  5. Convergence: Once convergence is reached, the K clusters represent groups of data points that are similar to each other and dissimilar to data points in other clusters.

To explain this process further, let’s consider an example where we want to cluster a dataset of points in a two-dimensional feature space.

  1. Initialization: We randomly select K centroids in the feature space. Let’s say we want to create three clusters, so we randomly select three points as the initial centroids.
  2. Assignment: We calculate the Euclidean distance between each data point and each centroid. We assign each data point to the nearest centroid and form clusters.
  3. Update: We calculate the mean of each cluster, which becomes the new centroid for that cluster.
  4. Repeat: We repeat steps 2 and 3 until the centroids no longer change or a maximum number of iterations is reached.
  5. Convergence: Once convergence is reached, the K clusters represent groups of data points that are similar to each other and dissimilar to data points in other clusters. We can use these clusters to analyze and understand the underlying patterns in the data.

In summary, K-means clustering is an iterative process that partitions data points into K clusters based on their similarity. It is an unsupervised learning algorithm that is widely used in data mining, pattern recognition, and image segmentation.

Example of K-means clustering

Cluster the following eight points (with (x, y) representing locations) into three clusters:

A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9).

Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).

The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as → Ρ(a, b) = |x2 — x1| + |y2 — y1|

Use K-Means Algorithm to find the three cluster centers after the second iteration.

We follow the above-discussed K-Means Clustering Algorithm:

→ Iteration-01:

  • We calculate the distance of each point from each of the centers of the three clusters.
  • The distance is calculated by using the given distance function.

The following illustration shows the calculation of the distance between point A1(2, 10) and each of the centers of the three clusters:

Calculating Distance Between A1(2, 10) and C1(2, 10)

Ρ(A1, C1)

= |x2 — x1| + |y2 — y1|

= |2–2| + |10–10|

= 0

Calculating Distance Between A1(2, 10) and C2(5, 8)

Ρ(A1, C2)

= |x2 — x1| + |y2 — y1|

= |5–2| + |8–10|

= 3 + 2

= 5

Calculating Distance Between A1(2, 10) and C3(1, 2)

Ρ(A1, C3)

= |x2 — x1| + |y2 — y1|

= |1–2| + |2–10|

= 1 + 8

= 9

In a similar manner, we calculate the distance of other points from each of the centers of the three clusters.

Next,

  • We draw a table showing all the results.
  • Using the table, we decide which point belongs to which cluster.
  • The given point belongs to that cluster whose center is nearest to it.

From here, New clusters are-

Cluster-01:

The first cluster contains points-

  • A1(2, 10)

Cluster-02:

The second cluster contains points-

  • A3(8, 4)
  • A4(5, 8)
  • A5(7, 5)
  • A6(6, 4)
  • A8(4, 9)

Cluster-03:

The third cluster contains points-

  • A2(2, 5)
  • A7(1, 2)

Now,

  • We re-compute the new cluster clusters.
  • The new cluster center is computed by taking mean of all the points contained in that cluster.

For Cluster-01:

  • We have only one point A1(2, 10) in Cluster-01.
  • So, the cluster center remains the same.

For Cluster-02:

Center of Cluster-02

= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)

= (6, 6)

For Cluster-03:

Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)

= (1.5, 3.5)

This is the completion of Iteration-01.

→ Iteration-02:

  • We calculate the distance of each point from each of the centers of the three clusters.
  • The distance is calculated by using the given distance function.

The following illustration shows the calculation of the distance between point A1(2, 10) and each of the centers of the three clusters:

Calculating Distance Between A1(2, 10) and C1(2, 10)

Ρ(A1, C1)

= |x2 — x1| + |y2 — y1|

= |2–2| + |10–10|

= 0

Calculating Distance Between A1(2, 10) and C2(6, 6)

Ρ(A1, C2)

= |x2 — x1| + |y2 — y1|

= |6–2| + |6–10|

= 4 + 4

= 8

Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)

Ρ(A1, C3)

= |x2 — x1| + |y2 — y1|

= |1.5–2| + |3.5–10|

= 0.5 + 6.5

= 7

In a similar manner, we calculate the distance of other points from each of the centers of the three clusters.

Next,

  • We draw a table showing all the results.
  • Using the table, we decide which point belongs to which cluster.
  • The given point belongs to that cluster whose center is nearest to it.

From here, New clusters are:

Cluster-01:

The first cluster contains points-

  • A1(2, 10)
  • A8(4, 9)

Cluster-02:

The second cluster contains points-

  • A3(8, 4)
  • A4(5, 8)
  • A5(7, 5)
  • A6(6, 4)

Cluster-03:

The third cluster contains points-

  • A2(2, 5)
  • A7(1, 2)

Now,

  • We re-compute the new cluster clusters.
  • The new cluster center is computed by taking mean of all the points contained in that cluster.

For Cluster-01:

Center of Cluster-01

= ((2 + 4)/2, (10 + 9)/2)

= (3, 9.5)

For Cluster-02:

Center of Cluster-02

= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)

= (6.5, 5.25)

For Cluster-03:

Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)

= (1.5, 3.5)

This is the completion of Iteration-02.

After the second iteration, the center of the three clusters is:

  • C1(3, 9.5)
  • C2(6.5, 5.25)
  • C3(1.5, 3.5)

Hierarchical Clustering

Hierarchical clustering is a type of clustering algorithm used in machine learning and data analysis to group similar objects into clusters. It works by iteratively merging the closest pairs of data points or clusters until all data points are in a single cluster or until a desired number of clusters is reached.

There are two types of hierarchical clustering:

  • agglomerative
  • divisive

agglomerative, which starts with individual data points as clusters and merges them together, and divisive, which starts with all data points in a single cluster and recursively divides them into smaller clusters.

Hierarchical clustering is useful in applications such as image segmentation, social network analysis, and gene expression analysis.

How Hierarchical Clustering Work?

Hierarchical clustering is a clustering algorithm that works by iteratively merging or dividing clusters of data points based on their similarity. The algorithm can be performed in two ways: agglomerative and divisive.

Agglomerative Hierarchical Clustering: Agglomerative hierarchical clustering begins with each data point as a separate cluster and iteratively merges the closest pairs of clusters until all data points belong to a single cluster or a desired number of clusters is reached.

The steps involved in agglomerative hierarchical clustering are:

  1. Calculate the distance matrix between all pairs of data points.
  2. Assign each data point to its own cluster.
  3. Find the pair of clusters with the smallest distance and merge them into a single cluster.
  4. Update the distance matrix by computing the distances between the new cluster and all other clusters.
  5. Repeat steps 3 and 4 until all data points belong to a single cluster or the desired number of clusters is reached.

Divisive Hierarchical Clustering: Divisive hierarchical clustering begins with all data points in a single cluster and recursively divides them into smaller clusters until each data point belongs to its own cluster or a desired number of clusters is reached.

The steps involved in divisive hierarchical clustering are:

  1. Assign all data points to a single cluster.
  2. Compute the distance matrix between all pairs of data points.
  3. Find the data point that is farthest away from all other data points and assign it to a new cluster.
  4. Divide the remaining data points into two clusters based on their distance from the newly created cluster.
  5. Repeat step 4 recursively until each data point belongs to its own cluster or the desired number of clusters is reached.

The choice of distance metric and linkage criteria, which define how distances between clusters are computed, can have a significant impact on the clustering results. Common distance metrics include Euclidean distance, Manhattan distance, and cosine distance, while linkage criteria include single linkage, complete linkage, and average linkage.

Linkage criteria in hierarchical clustering

There are several types of linkage criteria used in hierarchical clustering, including:

Single Linkage: In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points.

Complete Linkage: In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points.

Average Linkage: In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster. For example, the distance between clusters “r” and “s” to the left is equal to the average length of each arrow between connecting the points of one cluster to the other.

Example of Hierarchical Clustering

Clustering the following 7 data points.

The vertical axis represents the distance between the clusters, while the horizontal axis represents the data points or clusters being merged or divided.

The steps involved in an Example of hierarchical clustering are:

Step 1: Calculate distances between all data points using the Euclidean distance function. The shortest distance is between data points C and G.

Step 2: We use “Average Linkage” to measure the distance between the “C, G” cluster and other data points.

Step 3: Continue with Step 2.

Step 4: Continue with Step 3.

Step 5: Continue with Step 4.

Step 6: Continue with Step 5.

A diagram to show the results of hierarchical clustering applied to a dataset with 10 data points.

--

--

Utsav Desai
Utsav Desai

Written by Utsav Desai

Utsav Desai is a technology enthusiast with an interest in DevOps, App Development, and Web Development.

No responses yet