Vasist Kandagatla

03 Things I Built

Projects

watermelonDB

Disk-backed B-tree–style storage engine in Go using fixed-size 4KB pages. Features leaf node layouts with sorted variable-length key–value cells, binary search–based lookup, and manual binary encoding with direct page-level disk I/O.

AI Terminal

Developer CLI that translates natural language to shell commands. Maintains execution context using recent command history for improved accuracy. Extensible plugin architecture for tooling automation — Git, Docker, and more.

YT Summarizer

Full-stack YouTube video summarizer with a Python/FastAPI backend and a JS frontend. Fetches transcripts and condenses long-form content into concise, readable summaries using LLMs.

URL Shortener

URL shortening service built entirely in Haskell — a deliberate dive into functional programming. Demonstrates type-safe routing, pure functional design, and backend service construction outside the mainstream stack.

lockstep-pong

Deterministic multiplayer Pong using lockstep networking — the same architecture behind StarCraft and modern fighting games. Simulation core in C, networking in Go. Synced across two independent processes via UDP with only 14 bytes per packet. FNV-1a state hashing verified across every tick for guaranteed sync.

Go C CGo UDP Networking

07 Writing

Blogs &
Articles

Medium · Jul 2025 · 10 min read

How Machines Learn: A Beginner's Guide to Gradient Descent & Momentum

Part 1 — Core optimizers from classic gradient descent through momentum-based methods. Covers SGD, Batch GD, Mini-Batch GD, Momentum, and Nesterov Accelerated Gradient.

Read

Machines learn from their mistakes using a technique called optimization. No matter how well you design or train a model, without proper optimization, it won't achieve good accuracy — and that makes it ineffective.

In this two-part series, we explore what gradient descent is, walk through popular optimizers, and dive into how these relate to dynamic programming. Part 1 covers the core optimizers — from classic gradient descent through momentum-based methods.

Gradient Descent

The most basic algorithm used for optimization in machine learning is Gradient Descent. Imagine you're standing on a mountain and want to reach the lowest point in the valley. As a human, you can visually observe the terrain and decide where to step next. But a machine can't see the landscape — so it takes small steps in all directions, compares positions, and moves in the direction where loss decreases the most — the negative gradient.

This repeats until no further movement improves the result — the global minimum of the loss function.

Mathematics Behind Gradient Descent

We minimize some function L(θ), where L is the loss function and θ represents the model's parameters.

θ = θ − α · ∇L(θ)

∇L(θ) → gradient (direction where function increases most) α → learning rate (step size) α·∇L(θ) → move in opposite direction of gradient = downhill

You're standing on a slope (your current θ). You figure out the steepest upward direction and go the other way. Take a step of size α. Repeat until the slope flattens.

Problems with Basic Gradient Descent

⚠ 1. Stuck in Local Minima

The machine calculates the gradient, finds it's zero, and assumes it's done — even though a much lower point exists nearby. If GD settles in the wrong dip, the model ends up with suboptimal weights and lower accuracy.

⚠ 2. Slow in Flat Regions

In plateau areas, the slope is nearly zero so gradient descent takes baby steps — convergence becomes painfully slow even when moving in the right direction.

⚠ 3. Oscillations in Steep Areas

In narrow valleys, updates overshoot — instead of smoothly reaching the minimum, the optimizer bounces back and forth, never fully settling.

Types of Gradient Descent

Stochastic GD (SGD) — Updates using one data point at a time. Fast but noisy. Like asking one random person for directions.
Batch GD — Uses the entire dataset per step. Stable but slow for large datasets. Like surveying the entire city before making one move.
Mini-Batch GD — Most commonly used. Splits data into batches of 32/64/128. Best balance of speed and stability.

Momentum: Speeding Up Gradient Descent

Instead of updating weights purely based on the current gradient, momentum adds "memory" — it remembers the direction it was already moving and keeps pushing that way.

Pushing a ball down a hill. At first it rolls slowly, but it builds speed — even when the slope becomes shallow, it keeps rolling because of momentum it gained.

Update Rule

vₜ = γ · vₜ₋₁ + η · ∇L(θₜ) θₜ₊₁ = θₜ − vₜ γ → momentum coefficient (~0.9), how much previous velocity to retain η → learning rate vₜ → current velocity (how fast we're moving in parameter space)

New velocity = "a bit of old velocity" + "new push from current gradient"

⚠ Issue: Saddle Points

A saddle point is flat but neither a peak nor a true valley — some directions go up, others go down. Momentum can escape local minima (it carries velocity forward), but at saddle points, gradients cancel out and the optimizer drifts slowly with no clear path to descend.

Nesterov Accelerated Gradient (NAG)

NAG is an improved version of Momentum. With regular momentum, you take a step and then calculate the gradient at the new position. With NAG, you first look ahead using the velocity, estimate where you'll land, and calculate the gradient at that future point — helping avoid overshooting.

Running downhill with momentum (you can't see ahead). NAG is like someone whispering "there's a sharp turn coming!" — it looks ahead, realizes you're going too fast, and adjusts before you overshoot.

NAG Math

1. Lookahead: θ̃ₜ = θₜ − γvₜ₋₁ 2. Gradient: vₜ = γvₜ₋₁ + η∇L(θ̃ₜ) 3. Update: θₜ₊₁ = θₜ − vₜ

⚠ NAG Limitations

Still struggles in flat regions and saddle points (gradients near zero → it stalls). Still sensitive to learning rate — high LR can cause oscillations in highly curved surfaces. These are reasons why adaptive optimizers like Adam were developed — covered in Part 2.

Read on Medium ↗

Notion · Data Analytics

IPL Analytics: Insights from the Cricket Data

A deep-dive data analysis into IPL statistics — uncovering patterns from match results, player performance, run rates, and team dynamics across seasons. Built with data exploration and visualization to surface non-obvious cricket insights.

Cricket Data Analytics IPL Statistics Visualization

Read on Notion ↗

🏏

IPL

Skills &
Stack

Experience

Projects

Open Source
Contributions

Achievements

Blogs &
Articles

Gradient Descent

Mathematics Behind Gradient Descent

Problems with Basic Gradient Descent

Types of Gradient Descent

Momentum: Speeding Up Gradient Descent

Update Rule

Nesterov Accelerated Gradient (NAG)

NAG Math

Let's Build
Something.

VASIST KANDAGATLA

Skills &Stack

Experience

Projects

Open SourceContributions

Achievements

Blogs &Articles

Gradient Descent

Mathematics Behind Gradient Descent

Problems with Basic Gradient Descent

Types of Gradient Descent

Momentum: Speeding Up Gradient Descent

Update Rule

Nesterov Accelerated Gradient (NAG)

NAG Math

Let's BuildSomething.

Skills &
Stack

Open Source
Contributions

Blogs &
Articles

Let's Build
Something.