Better Gradient Descent Methods

Motivation. While gradient descent is accurate it is computationally intensive. We shall consider heuristics to improve the computational speed.

def. Gradient Descent. We will use the following notation/characterization of gradient descent.

w_{k + 1} = w_{k} - α_{k} [\frac{1}{N} i = 1 \sum N \nabla f_{i} (w_{k})]

where

$α_{1}, \dots, α_{k}, \dots$ are learning rates on step $k$
$f_{i}$ is the loss function for datapoint $i$ and weghts $w_{k}$

Motivaiton. The first step to optimize is to avoid calculating the gradients of all datapoints $i = 1, \dots, N$ .

def. Stochastic Gradient Descent. For each step, we instead randomly choose one datapoint, $j$ to update:

w_{k + 1} = w_{k} - α \hat{\nabla} f_{j} (w_{k})

This works because $expectation of gradient E [\hat{\nabla} f_{j}] = average gradient \frac{1}{N} i = 1 \sum N \nabla f_{i}$

Motivation. This is better, but this leads to lots of steps $k$ . We can group or “batch” data together to reduce that, while also not calculating all gradients like classical GD. The gradient used like this, is an unbaised Estimator of the true gradient calculated over all datapoints.

def. Batch Gradient Descent. For each step, choose a random batch of size $n$ , ${j_{1}, \dots, j_{n}}$ and average them.

w_{k + 1} = w_{k} - α_{k} [\frac{1}{n} i = 1 \sum n \nabla f_{j_{i}} (w_{k})]

Momentum

Intuition. Momentum is a way to “dampen” the oscillations when taking many tiny steps, and to reduce the number of steps required for convergence (by a factor of a square root). This dampening, implemented for stochastic/batch gradient descent, can be considered as a bias-variance-tradeoff of the gradient estimator.

alg. Nesterov’s Accelerated Gradient Descent (NAG). This is convoluted but it’s optimized to guarantee a certain rate of convergence, see below. This is also the original implementation of momentum.

let $λ_{0} = 0, λ_{k} = 1 + \frac{1}{2} \cdot 1 + r λ_{k - 1}^{2}$
let $γ_{k} = \frac{1 - λ _{k}}{λ _{k + 1}}$
for steps $k = 1..$ , setting $w_{1} = t_{1}$ :
1. Momentum: $t_{k + 1} = w_{k} - \frac{1}{β} \nabla f_{j} (w_{k})$
2. Descent: $w_{k + 1} = weighted sum of momentum (1 - γ_{k}) t_{k + 1} + γ_{k} t_{k}$

thm. (informal statement). Step $k$ will minimize:

distance to optimal at step k f (w_{k}) - f (w^{*}) \leq \frac{1}{k ^{2}} constant 2 β ∣∣ w_{1} - w^{*} ∣ ∣_{2}^{2}

i.e., the distance to optimal drops quadratically wrt $k$

alg. GD with Momentum. A more intuitive way to incorporate momentum. Setting $β \in [0, 1]$ to how much you want to accumulate prior momentum:

for $k = 0 \dots$ :
1. Momentum: $m_{k + 1} = β m_{k} + \nabla f_{i} (w_{k})$
2. Descent: $w_{k + 1} = w_{k} - α m_{k + 1}$

Higher-order Methods

Intuition. Curvature is useful for gradient descent.

Adaptive Learning Rates

Intuition. We adapt the learning rates for each parameter. Small learning rate for frequently updated parameters, and large learning rates for infrequently updated parameters. This works because:

Frequently updated parameters can be overfit, and infrequently updated ones underfit
By adjusting learning rates inversly to frequency/magnitude of update we reduce overand underfitting, while also improving convergence rates

alg. GD with Adaptive Learning Rates. (Adagrad) It is proven that it reduces regret bounds

for $k = 0, \dots$
1. Sum-of-Squares accumulation: $v_{k + 1} = v_{k} + sqared gradient \nabla f ⊙ \nabla f$
2. Adaptive rate: $α_{k} = γ \cdot 1/ (ϵ + v_{k + 1})$ (element-wise divide)
3. Descent: $w_{k + 1} = w_{k} - α_{k} ⊙ \nabla f$ The sum of squares will grow fast for parameters with high gradients and/or frequent updates. The learning rate is the reciprocal, and thus diminishes. v.v.

alg. Root-Mean-Squared Propagation. (RMSprop). Adaptive learning rates using exponential moving average. Set a learning rate sequance $α_{1}, \dots, α_{k}, \dots$ such that $α_{k} \to 0$ at a rate $k^{1/2}$ . Then:

for $k = 0, \dots$
1. Sum-of-Squares accumulation: $v_{k} + sqared gradient \nabla f ⊙ \nabla f$
2. Decaying SoS (=“preconditioner”) $b_{k} = γ \cdot 1/ (ϵ + v_{k + 1})$ (element-wise divide)
3. Descent: $w_{k + 1} = w_{k} - α_{k} b_{k} ⊙ \nabla f$

Adaptive Learning Rate + Momentum

alg. Adaptive Learning Rate with Momentum (ADAM). Integrate both the adaptive learning rate (into second moment) and momentum (first moment).

for $k = 0, \dots$
1. First moment: $m_{k + 1}^{(1)} = β_{1} m_{k}^{(1)} + (1 - β_{1}) \nabla f$
  1. Debias: $\tilde{m}_{k + 1}^{(1)} = \frac{m _{k + 1}^{(1)}}{1 - β _{1}^{(k + 1)}}$
2. First moment: $m_{k + 1}^{(2)} = β_{1} m_{k}^{(2)} + (1 - β_{2}) \nabla f ⊙ \nabla f$
  1. Debias: $\tilde{m}_{k + 1}^{(2)} = \frac{m _{k + 1}^{(2)}}{1 - β _{2}^{(k + 1)}}$
3. Descent: $w_{k + 1} = w_{k} - α_{k} \tilde{m}_{k + 1}^{(1)} / (ϵ + \tilde{m}_{k + 1}^{(2)})$

Conclusion

Just use ADAM.

PK's Notes

Explorer

Better Gradient Descent Methods

Momentum

Higher-order Methods

Adaptive Learning Rates

Adaptive Learning Rate + Momentum

Graph View

Table of Contents

Backlinks

PK's Notes

Explorer

Better Gradient Descent Methods

Momentum §

Higher-order Methods §

Adaptive Learning Rates §

Adaptive Learning Rate + Momentum §

Graph View

Table of Contents

Backlinks

Momentum

Higher-order Methods

Adaptive Learning Rates

Adaptive Learning Rate + Momentum