Optimization
Gradient Descent
Iterative parameter updates guided by the gradient of a loss function
Loss function L(θ)
Click a point on the plot to reposition starting point
L per step
Key ideas
- Gradient descent update: $\hat{\theta} = \theta - \gamma \cdot \nabla_\theta L(x,y:\theta)$. Each step moves θ opposite to the gradient of the loss function
- Learning rate γ controls step size. Too high and the optimizer overshoots and diverges. Too low and convergence is slow
- Convex loss functions have a single global minimum. Gradient descent always converges with a small enough γ
- Non-convex loss functions can trap the optimizer in local minima. The final result depends on the starting point
- The chain rule enables computing gradients through composed functions: $\frac{\partial L}{\partial \theta_k} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial \theta_k}$