Understanding Stochastic Gradient Descent (SGD)

2 minute read

Published:

Stochastic Gradient Descent (SGD) is an optimization algorithm used in machine learning and deep learning to minimize the loss function. Unlike standard Gradient Descent, which computes the gradient using the entire dataset, SGD updates the model one sample at a time, making it more efficient for large datasets.

Why Use SGD?

Faster for large datasets – Processes one sample per step instead of the entire dataset.
Efficient for online learning – Updates in real-time as new data arrives.
Avoids local minima better than batch gradient descent – Random updates help escape saddle points.


2. How Does SGD Work?

  1. Initialize model parameters (weights and biases).
  2. Randomly shuffle data (to avoid bias in updates).
  3. Pick one sample from the dataset.
  4. Compute the gradient of the loss function for that sample.
  5. Update the parameters using:

    w=wηL(w)

    Where:

    • w = Model weights
    • η (eta) = Learning rate
    • L(w) = Gradient of the loss function
  6. Repeat for all samples until convergence.

3. Intuitive Example: Walking Down a Hill

Imagine you’re trying to reach the lowest point of a hilly landscape.

  • Batch Gradient Descent: Looks at the entire landscape before taking a step.
  • SGD: Takes small, random steps based on immediate surroundings, making it faster but noisier.

SGD is like finding the fastest way down by making quick decisions, rather than analyzing the entire terrain first.


4. Learn with Code: Implementing SGD from Scratch

import numpy as np
import matplotlib.pyplot as plt

def compute_gradient(x, y, w):
    """
    Compute gradient of Mean Squared Error loss.
    """
    return -2 * x * (y - (w * x))

def stochastic_gradient_descent(X, y, lr=0.01, epochs=100):
    """
    Perform Stochastic Gradient Descent (SGD).
    
    Parameters:
    X (ndarray): Input features.
    y (ndarray): Target values.
    lr (float): Learning rate.
    epochs (int): Number of passes through the dataset.
    """
    w = np.random.randn()  # Initialize weight
    losses = []
    
    for epoch in range(epochs):
        for i in range(len(X)):
            gradient = compute_gradient(X[i], y[i], w)
            w -= lr * gradient  # Parameter update
        
        loss = np.mean((y - (w * X)) ** 2)
        losses.append(loss)
    
    return w, losses

# Example Usage:
if __name__ == "__main__":
    np.random.seed(42)
    X = np.linspace(0, 10, 50)
    y = 3 * X + np.random.randn(50) * 2  # Linear function with noise
    
    final_weight, loss_history = stochastic_gradient_descent(X, y, lr=0.01, epochs=50)
    
    # Plot loss over time
    plt.plot(loss_history)
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.title("SGD Loss Over Time")
    plt.show()
    
    print(f"Final estimated weight: {final_weight:.2f}")

5. Key Takeaways

  • SGD updates parameters one sample at a time, making it faster for large datasets.
  • It introduces randomness, which can help escape local minima.
  • Learning rate is crucial—too high makes learning unstable, too low slows down convergence.

🤖 Disclaimer: This post is inspired by Educative.io AI learning course, and generated with AI-assisted but reviewed and refined by Dr. Rebecca Li, blending AI efficiency with human expertise for a balanced perspective.