Last week, I was asked to implement the K-Means clustering algorithm from scratch in python as part of my MSc Data Science Degree Apprenticeship from the University of Exeter.
In this article, I present briefly the K-Means clustering algorithm and my Python implementation without using SkLearn.⠀
✏️ Table of Contents
- Python Implementation
Clustering is an unsupervised machine learning method that is used when you do not have labels for your data. The goal of clustering algorithms is to segment similar data points into groups; to extract meaning from the data.
One of the most common applications of clustering in industry is customer segmentation. A company can benefit by targeting each group with different marketing emails/ strategies (i.e relevant offers).
A definition that I found online is:
Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data.
K-Means is a very simple unsupervised clustering algorithm which clusters the data into K groups based on similar attributes. It is one of the 'must-haves' algorithms in a Data Scientist's tool box.
Let's see in the picture below a pseudo-code implementation of K-Mean:
Just to clarify the following:
- X is our n-dimension data matrix
- P is an array of
len(P)==len(X)holding the class in which each data point belongs to.
- Representative is the centroid of the cluster; the average point
- The stopping criteria is defined when P doesn't change between two consecutive iterations or we reach the maximum number of iterations
🐍 Python Implementation
Let's implement the above pseudo-code in Python:
def kmeans(X,k=3,max_iterations=100): ''' X: multidimensional data k: number of clusters max_iterations: number of repetitions before clusters are established Steps: 1. Convert data to numpy aray 2. Pick indices of k random point without replacement 3. Find class (P) of each data point using euclidean distance 4. Stop when max_iteration are reached of P matrix doesn't change Return: np.array: containg class of each data point ''' if isinstance(X, pd.DataFrame):X = X.values idx = np.random.choice(len(X), k, replace=False) centroids = X[idx, :] P = np.argmin(distance.cdist(X, centroids, 'euclidean'),axis=1) for _ in range(max_iterations): centroids = np.vstack([X[P==i,:].mean(axis=0) for i in range(k)]) tmp = np.argmin(distance.cdist(X, centroids, 'euclidean'),axis=1) if np.array_equal(P,tmp):break P = tmp return P
Below you can screenshots of a working example. The notebook is also available on my GitHub page.
Note that we normalize the data before using the K-Means.
Standardizing data is recommended because otherwise the range of values in each feature will act as a weight when determining how to cluster data, which is typically undesired.
🚀 For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:
📚 While for book lovers:
- "Python for Data Analysis" by Wes McKinney, best known for creating the Pandas project.
- "Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurelien Geron, currently ranking first in the best sellers Books in AI & Machine Learning on Amazon.
- "Deep Learning" by Ian Goodfellow research scientist at OpenAI.
This brings us to the end of this article. Hope you learn how to implement the K-Means clustering algorithm from scratch in Python .
Thanks for reading; if you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.