Understanding Gradient Descent

Machine learning models do not learn in the way humans do. They do not “understand” data or “think” about patterns. Instead, they optimize. At the heart of this optimization process lies one of the most fundamental algorithms in machine learning: Gradient Descent.

This blog takes a theoretical yet intuitive approach to explain:

What a loss function really represents
Why we cannot directly compute optimal solutions

Mathematically Learning as an Optimization Problem

In supervised machine learning, at its core, our goal is incredibly straightforward: Find a model that best maps inputs to outputs. But what does “best” really mean, and how do we achieve it?

Mathematically, this means optimizing a set of parameters called weights and biases. so, that our model’s predictions are as close as possible to the true values. To break it down:

Let 𝑥 be our input data (which could be anything like images, text, or sensor readings).
Let 𝑦 be the true output or the label we’re trying to predict.
𝑦̂ = 𝑓(𝑥;𝜃) represents the predicted output, where 𝑓 is our model and 𝜃 are the parameters (weights and biases) that govern the model’s behavior.

Thus, the task of machine learning is reduced to finding the best parameters 𝜃, which generate predictions closest to the true output 𝑦. But here’s the tricky part: What exactly does “best” mean in this context?

At a high level, we aim to minimize the difference between the model’s predictions and the true values. To do this effectively, we need a systematic way to measure the “goodness” of our model’s performance and this is where loss functions come in.

We Need a Loss Function to Measure Mistakes

In machine learning, a model can’t judge whether its predictions are right or wrong on its own. We need a way to quantify how far off the model’s predictions are from the true values. Enter the loss function.

What is a Loss Function?

A loss function is a mathematical tool that measures how “bad” a model’s predictions are. It gives us a numerical value that represents the error between the model’s predicted output 𝑦̂ and the true output 𝑦.

Mathematical Definition:

Loss = 𝐿(𝑦, 𝑦̂)

For example, in regression problems, where the goal is to predict continuous values, Mean Squared Error (MSE) might be used as the loss function. For classification tasks, where the model has to pick one category out of several, Cross-Entropy Loss might be more appropriate. Each type of loss function affects how the model learns and how it interprets its performance.

Why Loss Functions Are Essential:

Quantify Performance: Without a numerical measure, it’s impossible to know how good or bad the model’s predictions are.
Define the Learning Goal: Different loss functions guide the model in different ways. The choice of loss function impacts the model’s learning behavior and what it focuses on improving.
Make Learning Differentiable: Most loss functions are continuous and differentiable, which is crucial for optimization techniques like gradient descent.

In simple terms, the loss function is the “feedback” that tells the model how far it is from the right answer, which helps it adjust and improve.

How to Optimization using Loss Function

Once we have a loss function, the learning process becomes clear: Minimize the loss. In mathematical terms, this means finding the parameters 𝜃 that minimize the loss function 𝐿(𝜃).

Thus, we turn the problem of machine learning into an optimization problem — we are trying to find the best set of parameters that produce the lowest loss.

However, this raises an important question: If we know the loss function, why don’t we just compute the best parameters directly? The answer to this is a little more complicated.

Why Can’t We Compute the Optimal Solution Directly?

While it might seem tempting to calculate the best parameters 𝜃 directly, there are several deep theoretical and practical reasons why this isn’t always possible.

Closed-Form Solutions Rarely Exist

In simple models, like linear regression, we can sometimes calculate a direct solution using algebra. For example, in linear regression, the parameters can be computed using a closed-form equation:

𝜃 = (𝑋^T * 𝑋)^−1 * 𝑋^T * 𝑦

But in modern machine learning, especially with complex models like neural networks, this closed-form solution doesn’t exist. The models are non-linear, the loss surfaces are high-dimensional, and the number of parameters can reach into the millions or even billions. For such models, there is no easy formula to compute the best parameters directly.

Computational Infeasibility

Even if a solution did exist, calculating it would often be computationally impractical. For example, computing matrix inverses can be very slow and resource-intensive, especially with large datasets. The memory and time required to compute these solutions would explode, making direct optimization infeasible.

This is why we rely on gradient descent, an optimization technique that avoids calculating the entire solution upfront and instead uses local information to make incremental improvements.

Non-Convex Loss Landscapes

In modern models like neural networks, the loss function is often non-convex, meaning it can have multiple local minima, saddle points, and flat regions. There’s no guarantee that any given point on the loss surface will be the lowest possible point (the global minimum). This adds another layer of complexity, as there’s no single equation that leads directly to the global minimum.

Viewing Loss as a Landscape

To better understand how optimization works, imagine the loss function as a surface, or “landscape.”

The horizontal axes represent the model’s parameters (𝜃).
The vertical axis represents the loss value.

The goal is to find the lowest point on this surface, where the loss is minimized.

However, there’s a catch: We can’t see the entire surface. We only know the loss value at our current position on the surface, so we need a strategy to decide where to move next.

Viewing Loss as a Landscape

The Gradient: Direction of Steepest Increase

The gradient is a vector that points in the direction of the steepest increase of the loss function. If you think of it like climbing a mountain, the gradient tells you which direction is uphill.

Mathematically:

∇𝐿(𝜃) gives the direction of steepest increase of the loss.

If the gradient points uphill, moving in the opposite direction will move us downhill. This is the essence of the optimization strategy known as gradient descent: to minimize the loss, we move in the opposite direction of the gradient.

The update rule is:

𝜃_new = 𝜃_old − α * ∇𝐿(𝜃)

Where:

α (alpha) is the learning rate, determining the size of each step.
∇𝐿(𝜃) is the gradient of the loss function.

Interpretation:

The gradient tells us where the loss increases the fastest.
The learning rate controls how large a step we take.
Repeating this process gradually reduces the loss, ideally leading us toward the global minimum.

Key Advantages of Gradient Descent:

It scales well to massive datasets.
It works for non-linear models.
It doesn’t require closed-form equations.
It only needs first-order derivatives, which are easier to compute.

Why do we use Learning Rate as parameter

The learning rate α is a critical factor in gradient descent. If it’s too small, learning will be very slow. If it’s too large, the model might overshoot the optimal point or even fail to converge altogether.

This highlights a key trade-off in optimization: Speed of convergence vs. Stability of learning. To handle this, techniques like adaptive optimizers (e.g., Adam, RMSProp) have been developed to adjust the learning rate dynamically.

Most loss functions are locally smooth, meaning the gradient provides reliable information about which direction to move.
Perfect optimality isn’t necessary, what matters more is generalizing well to new data, not reaching the absolute lowest point of the loss function.

Mathematically Learning as an Optimization Problem

We Need a Loss Function to Measure Mistakes

What is a Loss Function?

Why Loss Functions Are Essential:

How to Optimization using Loss Function

Why Can’t We Compute the Optimal Solution Directly?

Closed-Form Solutions Rarely Exist

Computational Infeasibility

Non-Convex Loss Landscapes

Viewing Loss as a Landscape

The Gradient: Direction of Steepest Increase

Interpretation:

Why do we use Learning Rate as parameter

Tags

Related Articles

How I Combined NLP, Freshness & Wilson Score to Build a Better Ranking System