Understanding K-Means Clustering: A Step-by-Step Guide

2 minute read

Published:

1. What is K-Means Clustering?

K-Means is an unsupervised learning algorithm used for clustering data points into groups based on similarity. It is widely used in data segmentation, customer profiling, and image compression.

How Does K-Means Work?

  1. Select K (the number of clusters).
  2. Randomly initialize K centroids.
  3. Assign each data point to the nearest centroid (forming clusters).
  4. Update centroids by computing the mean of all points in a cluster.
  5. Repeat steps 3-4 until centroids stop moving or a stopping condition is met.

2. Intuitive Example: Grouping Customers by Spending Behavior

Imagine a shopping mall wants to group customers into three clusters:

  • Low spenders
  • Medium spenders
  • High spenders

K-Means can analyze past transactions and segment customers into groups with similar spending habits, helping businesses target their marketing efforts efficiently.


3. K-Means Algorithm Explained Mathematically

K-Means minimizes the Within-Cluster Sum of Squares (WCSS):

\[J = \sum_{i=1}^{K} \sum_{x \in C_i} || x - \mu_i ||^2\]

Where:

  • \( K \) = Number of clusters
  • \( x \) = Data point
  • \( \mu_i ) = Centroid of cluster \( C_i \)
  • \( x - \mu_i ^2 \) = Squared distance between a point and its cluster centroid

The goal is to find cluster assignments that minimize this function.


4. Learn with Code: Implementing K-Means from Scratch

import numpy as np
import matplotlib.pyplot as plt

class KMeans:
    def __init__(self, k=3, max_iters=100, tol=1e-4):
        """
        K-Means clustering algorithm implementation.
        
        Parameters:
        k (int): Number of clusters.
        max_iters (int): Maximum iterations.
        tol (float): Convergence threshold.
        """
        self.k = k
        self.max_iters = max_iters
        self.tol = tol
        self.centroids = None

    def fit(self, X):
        """
        Train K-Means on dataset X.
        """
        np.random.seed(42)  # For reproducibility
        self.centroids = X[np.random.choice(X.shape[0], self.k, replace=False)]
        
        for _ in range(self.max_iters):
            clusters = self.assign_clusters(X)
            new_centroids = np.array([X[clusters == i].mean(axis=0) for i in range(self.k)])
            
            if np.max(np.abs(new_centroids - self.centroids)) < self.tol:
                break
            self.centroids = new_centroids

    def assign_clusters(self, X):
        """
        Assign each data point to the nearest centroid.
        """
        distances = np.sum((X[:, None] - self.centroids) ** 2, axis=2)
        return np.argmin(distances, axis=1)

    def predict(self, X):
        """
        Predict cluster for new data points.
        """
        return self.assign_clusters(X)

# Example Usage:
if __name__ == "__main__":
    from sklearn.datasets import make_blobs
    X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
    
    kmeans = KMeans(k=3)
    kmeans.fit(X)
    clusters = kmeans.predict(X)
    
    # Plot results
    plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', alpha=0.6)
    plt.scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1], s=200, c='red', marker='X')
    plt.title("K-Means Clustering")
    plt.show()

5. Key Takeaways

  • K-Means is an iterative clustering method that minimizes WCSS.
  • Choosing K is crucial—Elbow Method is often used to determine the optimal number of clusters.
  • K-Means assumes clusters are spherical and evenly sized—it may not work well for non-convex clusters.

🤖 This post was AI-assisted but reviewed and refined by Dr. Rebecca Li, ensuring clarity and accuracy.