Understanding K-Means Clustering: A Step-by-Step Guide

2 minute read

Published: February 01, 2025

1. What is K-Means Clustering?

K-Means is an unsupervised learning algorithm used for clustering data points into groups based on similarity. It is widely used in data segmentation, customer profiling, and image compression.

How Does K-Means Work?

Select K (the number of clusters).
Randomly initialize K centroids.
Assign each data point to the nearest centroid (forming clusters).
Update centroids by computing the mean of all points in a cluster.
Repeat steps 3-4 until centroids stop moving or a stopping condition is met.

2. Intuitive Example: Grouping Customers by Spending Behavior

Imagine a shopping mall wants to group customers into three clusters:

Low spenders
Medium spenders
High spenders

K-Means can analyze past transactions and segment customers into groups with similar spending habits, helping businesses target their marketing efforts efficiently.

3. K-Means Algorithm Explained Mathematically

K-Means minimizes the Within-Cluster Sum of Squares (WCSS):

J = \sum_{i = 1}^{K} \sum_{x \in C_{i}} | | x - μ_{i} | |^{2}

Where:

$K$ = Number of clusters
$x$ = Data point
$μ_{i}) = C e n t r o i d o f c l u s t e r \(C_{i}$
$ x - \mu_i ^2 $ = Squared distance between a point and its cluster centroid

The goal is to find cluster assignments that minimize this function.

4. Learn with Code: Implementing K-Means from Scratch

import numpy as np
import matplotlib.pyplot as plt

class KMeans:
    def __init__(self, k=3, max_iters=100, tol=1e-4):
        """
        K-Means clustering algorithm implementation.
        
        Parameters:
        k (int): Number of clusters.
        max_iters (int): Maximum iterations.
        tol (float): Convergence threshold.
        """
        self.k = k
        self.max_iters = max_iters
        self.tol = tol
        self.centroids = None

    def fit(self, X):
        """
        Train K-Means on dataset X.
        """
        np.random.seed(42)  # For reproducibility
        self.centroids = X[np.random.choice(X.shape[0], self.k, replace=False)]
        
        for _ in range(self.max_iters):
            clusters = self.assign_clusters(X)
            new_centroids = np.array([X[clusters == i].mean(axis=0) for i in range(self.k)])
            
            if np.max(np.abs(new_centroids - self.centroids)) < self.tol:
                break
            self.centroids = new_centroids

    def assign_clusters(self, X):
        """
        Assign each data point to the nearest centroid.
        """
        distances = np.sum((X[:, None] - self.centroids) ** 2, axis=2)
        return np.argmin(distances, axis=1)

    def predict(self, X):
        """
        Predict cluster for new data points.
        """
        return self.assign_clusters(X)

# Example Usage:
if __name__ == "__main__":
    from sklearn.datasets import make_blobs
    X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
    
    kmeans = KMeans(k=3)
    kmeans.fit(X)
    clusters = kmeans.predict(X)
    
    # Plot results
    plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', alpha=0.6)
    plt.scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1], s=200, c='red', marker='X')
    plt.title("K-Means Clustering")
    plt.show()

5. Key Takeaways

K-Means is an iterative clustering method that minimizes WCSS.
Choosing K is crucial—Elbow Method is often used to determine the optimal number of clusters.
K-Means assumes clusters are spherical and evenly sized—it may not work well for non-convex clusters.

🤖 Disclaimer: This post is inspired by Educative.io AI learning course, and generated with AI-assisted but reviewed and refined by Dr. Rebecca Li, blending AI efficiency with human expertise for a balanced perspective.

Share on

X (formerly Twitter) Facebook LinkedIn

Rebecca Li

Understanding K-Means Clustering: A Step-by-Step Guide

1. What is K-Means Clustering?

How Does K-Means Work?

2. Intuitive Example: Grouping Customers by Spending Behavior

3. K-Means Algorithm Explained Mathematically

4. Learn with Code: Implementing K-Means from Scratch

5. Key Takeaways

Share on

You May Also Enjoy

Precomputed Embeddings vs. Real-Time Retrieval (RAG)

Fine-Tune GenAI Models

GenAI Models Quality Evaluations: Text and Image

From Text Transformer to Vision Transformer Model