Optimization

Gradient Descent

Iterative parameter updates guided by the gradient of a loss function

Loss function L(θ)

Click a point on the plot to reposition starting point

L per step

Key ideas

  • Gradient descent update: $\hat{\theta} = \theta - \gamma \cdot \nabla_\theta L(x,y:\theta)$. Each step moves θ opposite to the gradient of the loss function
  • Learning rate γ controls step size. Too high and the optimizer overshoots and diverges. Too low and convergence is slow
  • Convex loss functions have a single global minimum. Gradient descent always converges with a small enough γ
  • Non-convex loss functions can trap the optimizer in local minima. The final result depends on the starting point
  • The chain rule enables computing gradients through composed functions: $\frac{\partial L}{\partial \theta_k} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial \theta_k}$