Last week, I was asked to implement the K-Means clustering algorithm from scratch in python as part of my MSc Data Science Degree Apprenticeship from the University of Exeter.

In this article, I present briefly the K-Means clustering algorithm and my Python implementation without using SkLearn.⠀


✏️ Table of Contents

  • Clustering
  • K-Means
  • Pseudo-code
  • Python Implementation
  • Conclusion

🪓 Clustering

Clustering is an unsupervised machine learning method that is used when you do not have labels for your data. The goal of clustering algorithms is to segment similar data points into groups; to extract meaning from the data.

One of the most common applications of clustering in industry is customer segmentation. A company can benefit by targeting each group with different marketing emails/ strategies (i.e relevant offers).

A definition that I found online is:

Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data.

⚙️ K-Means

K-Means is a very simple unsupervised clustering algorithm which clusters the data into K groups based on similar attributes. It is one of the 'must-haves' algorithms in a Data Scientist's tool box.


💻 Pseudo-code

Let's see in the picture below a pseudo-code implementation of K-Mean:

Just to clarify the following:

  • X is our n-dimension data matrix
  • P is an array of len(P)==len(X) holding the class in which each data point belongs to.
  • Representative is the centroid of the cluster; the average point
  • The stopping criteria is defined when P doesn't change between two consecutive iterations or we reach the maximum number of iterations

🐍 Python Implementation

Let's implement the above pseudo-code in Python:

def kmeans(X,k=3,max_iterations=100):
    '''
    X: multidimensional data
    k: number of clusters
    max_iterations: number of repetitions before clusters are established
    
    Steps:
    1. Convert data to numpy aray
    2. Pick indices of k random point without replacement
    3. Find class (P) of each data point using euclidean distance
    4. Stop when max_iteration are reached of P matrix doesn't change
    
    Return:
    np.array: containg class of each data point
    '''
    if isinstance(X, pd.DataFrame):X = X.values
    idx = np.random.choice(len(X), k, replace=False)
    centroids = X[idx, :]
    P = np.argmin(distance.cdist(X, centroids, 'euclidean'),axis=1)
    for _ in range(max_iterations):
        centroids = np.vstack([X[P==i,:].mean(axis=0) for i in range(k)])
        tmp = np.argmin(distance.cdist(X, centroids, 'euclidean'),axis=1)
        if np.array_equal(P,tmp):break
        P = tmp
    return P

Below you can screenshots of a working example. The notebook is also available on my GitHub page.

Note that we normalize the data before using the K-Means.

Standardizing data is recommended because otherwise the range of values in each feature will act as a weight when determining how to cluster data, which is typically undesired.

🚀 For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:

Learn to Become a Data Scientist Online | Udacity | Udacity
Gain real-world data science experience with projects from industry experts. Take the first step to becoming a data scientist. Learn online, with Udacity.

📚 While for book lovers:


🤖 Conclusion

This brings us to the end of this article. Hope you learn how to implement the K-Means clustering algorithm from scratch in Python .

Thanks for reading; if you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌

💪💪💪💪 As always keep studying, keep creating 🔥🔥🔥🔥