K-Means Clustering Algorithm: ‘Intuition + Use Case’ -#70daysofMLStudy with Data Science Nigeria

Amina Mardiyyah Rufai
8 min readAug 20, 2020

--

K-means, also known as k-center algorithm, is an Unsupervised Machine Learning Algorithm. This simply means unlike supervised machine learning algorithms, there are no labels provided in the data, hence it uses a different approach to solve problems. It is a ‘data-driven’ approach i.e the ‘data speaks for itself’.

K-means is just one type of unsupervised algorithms, there are several others such as Association and Hierarchical algorithms, Dimensionality reduction et cetera. K-means uses the clustering technique to group and classify data points. In this article, the focus is a basic intuition around how k-means works behind the scenes.

Source: Edureka

WHAT IS CLUSTERING?

Clustering is the process of dividing data points into groups/classes based on similar or dissimilar features. Let’s break that down with an example. So in a classroom, there are students who have different cultural backgrounds, traits, and even learning patterns. So as a recent recruited teacher, taking a look at the students in the classroom and observing closely how they interact, learn, likes, and dislikes et cetera, I realized I could place these students into different groups based on similar traits. So I came up with the following groups(clusters).

  • Group A- Students who wear glasses
  • Group B- Students who ask a lot of questions( the overactive ones)
  • Group C- Students who are always quiet and avoid eye contact (the introverts or those with a special medical condition)
  • Group D- Students who easily get distracted and distracts others as well
  • Group E- Student who love making jokes in class and plays around with everyone(the extrovert)
  • Group F- Students who love a particular set of subjects(science or Art)and do well in it.
  • Et cetera

These are clusters I simply came up with just by observation, no one told me anything, these results came about just by observing the traits and attributes of the students(Data-driven results). This is the intuition behind “Clustering(an unsupervised approach)”.

Clustering is a very important technique in the Marketing industry. It is used for customer segmentation(placing customers into similar groups for targetted and personalized marketing/ads, which is more profitable and effective than random ads), for recommendation engines, understanding customer behavior, and trends for improved services. It is also used as an exploratory analysis technique to derive insights from data, understanding the features of the data before moving on into building a model.

How does the algorithm come up with these clusters? how does it assign groups to data points using similar features?

Let’s move on to exploring more concepts.

FACT: Most often, people tend to get confused about the difference between classification and clustering. I was also confused at a certain time . After all, both algorithms end up placing data points in classes, right? Well, not too long ago, a dear friend cleared this up for me. Both algorithms are different, and they also have their similarities. Both methods lead to pattern identification. However, the method in identification of these patterns is different, one classifies from the surface and the other goes beyond. An example is the classification of fruits/vegetables based on color and shape. In this sense, a classification algorithm would classify a peach and an apple as belonging to the same class, whereas a clustering algorithm would recognize that these are different objects. I encourage you to make more research to truly understand the difference between these two unique learning algorithms in machine learning.

TYPES OF CLUSTERING

Source: Here

There are several types of clustering algorithms. The most common ones are:

  • Partitioning algorithms
  • Hierarchical algorithms
  • Fuzzy clustering algorithms
  • Density-Based algorithms
  • Model-based algorithms

K-means fall under a partitioning clustering algorithm, as it divides data points into several partitions/groups using a weight/center method, where data points are categorized as belonging to the same cluster based on a general center or mean(Hence the name, K-means).

HOW K-MEANS WORKS

Source: Wikipedia

K-means is the most popular of the partitioning clustering algorithms and was first used in 1967 by James Macqueen. ‘K’ is just a parameter specifying the number of clusters in which data points will be grouped. K-means works through an iterative process in finding the best k-cluster groups for data points, in such a way that one data point can not belong to the same group/cluster(no overlapping). For each cluster, a centroid(using the mean) is calculated, then, data points closet by distance to these clusters are assigned as belonging to the same group. Data points are assigned to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is as minimal as possible. Hence the iterative process for the best clusters. The distance used is calculated using the Euclidean distance measure.

Fun fact: K + means = Number of clusters + mean of the clusters for grouping data points.

A summary of the step in Pseudocode is:

  1. Determine the value of k(using Elbow method) for data points to be grouped into K-clusters.
  2. Initiate random centroids for data points
  3. Assign observations to the closest cluster center
  4. Determine the cluster centroid coordinates
  5. Determine the distances of each data point to the centroids.
  6. Re-calculate and re-assign each point to the closest cluster centroid using minimal distance.
  7. Calculate cluster centroids again.
  8. Repeat until no change in centroid.

Just like every other algorithm, the K-means algorithm has objectives for each use case. The two primary objectives are:

  • Calculating the distance between points
  • Reducing the sum of squared distances or errors between centroids and points(SSE).

One important point to note is that choosing the optimal value of k is very important. While smaller k values lead to fewer clusters with data mismatch, too large k values lead to more clusters with likely clusters in separate groups. The optimal value of K will be at that point when there is no change in centroids i.e centroid values remain constant. One very common method for determining this is the ‘ Elbow method’.

The elbow method fits the data points in a range of value of k using the ‘within-cluster sum of square(Wcss)’ principle, such that WCSS is as minimal as possible.

Source: Edureka

As seen in the image above. Using the Elbow method, the optimal K value is simply that tippest point where no significant changes occur. The optimal point can be identified by what is called the ‘Knee of the curve’.

USE CASE: Customer Segmentation based on Annual income

The dataset is a very simple data just to demonstrated with code how k-means works. Find below the link to the dataset and the code used uploaded to the Github repository.

REQUIRED PYTHON LIBRARIES

  1. Numpy
  2. Pandas
  3. Matlplotlib
  4. ScikitLearn

STEPS:

  1. Import the required Libraries
Importing the required Python Libraries

2. Import the dataset

3. Read the dataset

Import and read dataset

4. Define X and Y (Dependent and Independent Variable). X will be all data points and Y will be the reference column with which we want clusters to be calculated. Take note that Y here is not the target column as with what we have in supervised learning. It is more like a reference point for analysis. Y here is the annual income as that defines the objective of the analysis. For X, the ‘CustomerID’ column will be omitted because it plays no role here.

X and Y Variables defined

5. Visualize data points before Clustering.

Data Visualization before k-means

6. Split Dataset into Train and test set for model development.

Splitting data into Train and Test Set

7. Find K using the Elbow method, and plot results using Matplotlib.

K value obtained using Elbow method

As can be seen optimal value of k is 5.

8. Fit K-means to dataset(also optionally view centers obtained by model).

K-means algorithm fitted into the dataset

9. Print Labels predicted by the model.

K-means Label Prediction

10. Visualize Clusters by K-means

Cluster prediction by K-means

11. Visualize actual clusters versus Predicted Clusters for comparison.

Actual vs Predicted CLusters

BONUS

As a bonus tip, i would love to introduce you to a special library which makes it easier with fewer lines of code to calculate and visualize the Elbow method plot and optimal k-value for the model.

This library is called the ‘ YelloBrickVisualizer’. It works with the Matplolib and ScikitLearn as major wrappers and works best with ScikitLearn version 0.20 or later and Matplotlib version 3.0.1 or later.

Take a look at the plot for the Elbow method and K-value using this Library.

Amazing right!

To use this library and explore its functionalities, just like every other external Python library, it has to be installed. Use the following methods to achieve this:

  • On Anaconda command prompt:

“ conda install -c districtdatalabs yellowbrick”

Documentation: https://anaconda.org/DistrictDataLabs/yellowbrick

  • Using Pip manager:

pip install yellowbrick”

Documentation: https://pypi.org/project/yellowbrick/ , https://www.scikit-yb.org/en/latest/quickstart.html

Thank you for reading!

CONNECT ON SOCIAL MEDIA

Twitter: @diyyah92

LinkedIn: https://www.linkedin.com/in/aminah-mardiyyah-rufa-i

--

--

Amina Mardiyyah Rufai
Amina Mardiyyah Rufai

Written by Amina Mardiyyah Rufai

Machine Learning Engineer @ EMBL-EBI | Machine Learning Researcher | Previously, intern @Idiap.ch, @epfl.ch | MSC Machine Intelligence from AIMS-AMMI Senegal;

No responses yet