In this article, I compare three well-known techniques for validating the quality of clustering: the Davies-Bouldin Index, the Silhouette Score and the Elbow Method. All the aforementioned techniques are used for determining the optimal number of clusters. If you are not familiar with clustering techniques please do read my previous article where I implemented the K-means algorithm from scratch in Python (9 lines).
✏️ Table of Contents
- Quality of clustering
- Access clustering
- Silhouette Analysis
- The Elbow method
- Davies-Bouldin Index
- Silhouette Analysis vs Elbow Method vs Davies-Bouldin Index
- Python Implementation
🧐 Quality of clustering
We know for a fact that clustering is a very well-known form of unsupervised learning. It consists in partitioning the data samples such that similar instances are grouped together in the same partition. However, in the absence of ground truth values the following questions arise:
- How do we know the correct k (number of clusters)?
- Usually we have a vague idea (less than 20) but there can be many choices of k which look good.
- How do we choose when the data is 4 dimensional or more when it is not use to plot them?
🔍 Access clustering
The process of accessing the quality of clustering can be a tricky task. For that reason, different measures do exist to determine the quality of clustering. To name a few:
- Silhouette Analysis
- The Elbow method
- Davies-Bouldin Index
🟢 Silhouette Analysis
The silhouette score is a measure of the average similarity of the objects within a cluster and their distance to the other objects in the other clusters.
For each data point i, we first define:
which represent the average distance of the point i to all the other points that belongs to the same cluster Ci. Large
a(i) implies that the data point i is dissimilar to its cluster.In other words, if the point i belongs to the zero cluster this is the average distance of point i with all the other points of the zero cluster.
Secondly, we define:
which represent the average distance of the point i to all the other points in the next nearest cluster. Large
b(i) implies that the data point i is dissimilar to its neighbouring cluster.
This involves two steps:
- Find all the average distance of the point i with all the other points that don't belong to another cluster (different from Ci).
- Take the minimum of those averages.
For example, if the point i belongs to the zero cluster, you find the average distance between the point i and all other points of cluster 1, then of 2 etc. Once you have calculated all those averages distances you take the minimum.
Finally, we define the silhouette score of a data point i as:
Please note the following:
-1 ≦ s(i) ≦1
s(i) ≈ 1, then
a(i) ≪ b(i)the point is in the correct cluster; it is much closer to its own cluster than to its neighbour.
s(i) ≈ -1, then
a(i) ≫ b(i)the point should be in another cluster; it is must closer to its neighbour than to its assigned cluster.
- s(i) = 0 if |Ci | = 1 (i.e., the silhouette score of a point in a single-element cluster is 0).
- If s(i) ≈ 0, the point could be in another cluster.
Global silhouette score is define as:
When using the the silhouette analysis we are normally looking for a value of k that provides:
- High average silhouette scores.
- No clusters with a maximum silhouette score less than the average score as this can indicate poor clustering and a bad choice of k.
- Well clustered data tends to have clusters where the silhouettes are similar in size and the silhouettes of each cluster member are similar
🟠 Elbow Method
The most common approach for selecting the optimal number of clusters is the so-called "elbow method". A simple (but not very robust) visual method to estimate the optimal number of clusters k. It involves running the algorithm multiple times with an increasing number of cluster choice and then plotting a clustering score (the within-cluster SSE - “distortion”) as a function of the number of clusters.
The score is usually calculated as the mean squared distance between each instance and its closest centroid.
As the number of clusters increases, the score is decreasing. This is because the samples will be closer to the centroids they are assigned to.
The idea behind the elbow method is to identify the value of k where the score begins to decrease most rapidly before the curve reached a plateau.However, it can get confusing sometimes.
🟡 Davies-Bouldin Index
The Davies–Bouldin (DB) criterion is based on a ratio between “within-cluster” and “between-cluster” distances:
Let's deconstruct this seemingly complicated mathematical formula:
Dijis the "within-to-between cluster distance ratio" for the ith and jth clusters.
d¯ i is the average distance between every data point in cluster i and its centroid, similar for
dij is the Euclidean distance between the centroids of the two clusters.
If two clusters are close together (small
dij ) but have a large spread (large
d¯i+ d¯ j ) then this ratio will be large, indicating that these clusters are not very distinct.
max Dijdenotes the “worst-case within-to-between cluster ratio” for cluster i
For example, assuming that we have four clusters 0,1, 2 and 3 we calculate D between cluster 0 and 1, 0 and 2, 0 and 3 and we take the maximum of those values. We do exactly the same for cluster 1,2 and 3. We then calculate the average of those maximum values.
Remember the optimal clustering has the smallest Davies–Bouldin index value.
🆚 Davies-Bouldin Index vs Silhouette Analysis vs Elbow Method
I have drawn the following conclusions for these techniques:
- Silhouette is the most computationally demanding.
- Silhouette analysis is the most informative.
- Try all! Before deciding which is the best.
For example, the Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences while the Silhouette score measure the distance between each data point, the centroid of the cluster it was assigned to and the closest centroid belonging to another cluster. Depending of what criterion is better for your case select the appropriate metric.
The best way to evaluate the results of your clustering efforts is to start by actually examining -- human inspection -- the clusters formed and making a determination based on an understanding of what the data represents, what a cluster represents, and what the clustering is intended to achieve.
🐍 Python Implementation
Let's cluster a dataset and evaluate its performance by using the above techniques. Below you can screenshots of a working example. The full notebook is available on my GitHub page.
🚀 For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:
📚 While for book lovers:
- "Python for Data Analysis" by Wes McKinney, best known for creating the Pandas project.
- "Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurelien Geron, currently ranking first in the best sellers Books in AI & Machine Learning on Amazon.
- "Deep Learning" by Ian Goodfellow research scientist at OpenAI.
This brings us to the end of this article. Hope you become aware of the different techniques for validating the quality of clustering.
Thanks for reading; if you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.