Machines learn from their mistakes using a technique called optimization. No matter how well you design or train a model, without proper optimization, it won't achieve good accuracy — and that makes it ineffective.
In this two-part series, we explore what gradient descent is, walk through popular optimizers, and dive into how these relate to dynamic programming. Part 1 covers the core optimizers — from classic gradient descent through momentum-based methods.
Gradient Descent
The most basic algorithm used for optimization in machine learning is Gradient Descent. Imagine you're standing on a mountain and want to reach the lowest point in the valley. As a human, you can visually observe the terrain and decide where to step next. But a machine can't see the landscape — so it takes small steps in all directions, compares positions, and moves in the direction where loss decreases the most — the negative gradient.
This repeats until no further movement improves the result — the global minimum of the loss function.
Mathematics Behind Gradient Descent
We minimize some function L(θ), where L is the loss function and θ represents the model's parameters.
∇L(θ) → gradient (direction where function increases most) α → learning rate (step size) α·∇L(θ) → move in opposite direction of gradient = downhill
Problems with Basic Gradient Descent
Types of Gradient Descent
- Stochastic GD (SGD) — Updates using one data point at a time. Fast but noisy. Like asking one random person for directions.
- Batch GD — Uses the entire dataset per step. Stable but slow for large datasets. Like surveying the entire city before making one move.
- Mini-Batch GD — Most commonly used. Splits data into batches of 32/64/128. Best balance of speed and stability.
Momentum: Speeding Up Gradient Descent
Instead of updating weights purely based on the current gradient, momentum adds "memory" — it remembers the direction it was already moving and keeps pushing that way.
Update Rule
New velocity = "a bit of old velocity" + "new push from current gradient"
Nesterov Accelerated Gradient (NAG)
NAG is an improved version of Momentum. With regular momentum, you take a step and then calculate the gradient at the new position. With NAG, you first look ahead using the velocity, estimate where you'll land, and calculate the gradient at that future point — helping avoid overshooting.