Understanding K-Means Clustering: A Step-by-Step Guide
Published:
1. What is K-Means Clustering?
K-Means is an unsupervised learning algorithm used for clustering data points into groups based on similarity. It is widely used in data segmentation, customer profiling, and image compression.
How Does K-Means Work?
- Select K (the number of clusters).
- Randomly initialize K centroids.
- Assign each data point to the nearest centroid (forming clusters).
- Update centroids by computing the mean of all points in a cluster.
- Repeat steps 3-4 until centroids stop moving or a stopping condition is met.
2. Intuitive Example: Grouping Customers by Spending Behavior
Imagine a shopping mall wants to group customers into three clusters:
- Low spenders
- Medium spenders
- High spenders
K-Means can analyze past transactions and segment customers into groups with similar spending habits, helping businesses target their marketing efforts efficiently.
3. K-Means Algorithm Explained Mathematically
K-Means minimizes the Within-Cluster Sum of Squares (WCSS):
\[J = \sum_{i=1}^{K} \sum_{x \in C_i} || x - \mu_i ||^2\]Where:
- \( K \) = Number of clusters
- \( x \) = Data point
- \( \mu_i ) = Centroid of cluster \( C_i \)
\( x - \mu_i ^2 \) = Squared distance between a point and its cluster centroid
The goal is to find cluster assignments that minimize this function.
4. Learn with Code: Implementing K-Means from Scratch
import numpy as np
import matplotlib.pyplot as plt
class KMeans:
def __init__(self, k=3, max_iters=100, tol=1e-4):
"""
K-Means clustering algorithm implementation.
Parameters:
k (int): Number of clusters.
max_iters (int): Maximum iterations.
tol (float): Convergence threshold.
"""
self.k = k
self.max_iters = max_iters
self.tol = tol
self.centroids = None
def fit(self, X):
"""
Train K-Means on dataset X.
"""
np.random.seed(42) # For reproducibility
self.centroids = X[np.random.choice(X.shape[0], self.k, replace=False)]
for _ in range(self.max_iters):
clusters = self.assign_clusters(X)
new_centroids = np.array([X[clusters == i].mean(axis=0) for i in range(self.k)])
if np.max(np.abs(new_centroids - self.centroids)) < self.tol:
break
self.centroids = new_centroids
def assign_clusters(self, X):
"""
Assign each data point to the nearest centroid.
"""
distances = np.sum((X[:, None] - self.centroids) ** 2, axis=2)
return np.argmin(distances, axis=1)
def predict(self, X):
"""
Predict cluster for new data points.
"""
return self.assign_clusters(X)
# Example Usage:
if __name__ == "__main__":
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
kmeans = KMeans(k=3)
kmeans.fit(X)
clusters = kmeans.predict(X)
# Plot results
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1], s=200, c='red', marker='X')
plt.title("K-Means Clustering")
plt.show()
5. Key Takeaways
- K-Means is an iterative clustering method that minimizes WCSS.
- Choosing K is crucial—Elbow Method is often used to determine the optimal number of clusters.
- K-Means assumes clusters are spherical and evenly sized—it may not work well for non-convex clusters.
🤖 This post was AI-assisted but reviewed and refined by Dr. Rebecca Li, ensuring clarity and accuracy.