Neural Networks, Backpropagation and Gradient Descent

Motivation 1. Consider the Logistic Model for classification. We observe the limitations of this model as:

The decision boundaries are linear, and thus can only solve linearly separable problems.
- Specifically: it can simulate OR, AND logic gates…
- …but it can’t simulate XOR because it’s not linearly separable.
As seen here, we haven’t considered how to transform the ‘input space’ into the ‘feature space.’ This means we still perform terribly without this ‘magic,’ even as simple as classifiying digits in MNIST. Motivation 2. Consider how the logistic model, with its weights and bias, can be hooked up together. This resemples the neural connections of the human brain.

Feed Forward Neural Network

def. Neural Network. A feed forwarnd neural network connects the operations of multiple Logistic Models (=neurons) together.

Notation of quantities are as follows:

V_{row, column}^{(layer)}

Let the no. of neurons (=dimensions) of each ofthe layers be, from the input layer: $D^{(0)}, D^{(1)}, \dots, D^{(L + 1)}$ . The components of the neural net are:

Input neurons $x$ in input space $x \in R^{D^{(0)}}$ (layer $k = 0$ ) → $h^{(0)} = x$
Hidden neurons (layer $k = 2 \dots L$ ) → $h^{(k + 1)} = g (a^{(k + 1)}) = g (W^{(k + 1)} h^{(k)})$
Output later neuron (layer $k = L + 1$ ) → $f (x) = h^{(L + 1)} = σ (a^{(L + 1)}) = σ (W^{(L + 1)} h^{(k)})$ The parameters of the neural net are: Weight parameters. Layer $k$ ’s outgoing weights are packaged into a nice $W^{(k)}$ matrix with shape $D^{k + 1} \times D^{k}$ , where $W_{i, j}^{(k)}$ refers to the weight connecting neuron $j$ in layer $k$ to neuron $i$ in layer $k + 1$ . (This backward-ness of the indexing comes from the matrix multiplication shown later.) Bias parameters. Layer $k$ injects a bias, $b$ into each of the neurons in layer $k + 1$ . In matrix multiplication terms this is simply multiplying by one, as it is thusly shown in the image. Activation Function. Each neuron in layer $k + 1$ , having gotten its weighted input sum passes through an activation function. Without this activation funcion the whole NN will collapse into a simple linear map. Visualization. The following visualization lacks the bias term for simplicity.

Mathematical Backing

Intuition. Why is this supposedly random arragement of logistic neurons so effective expeirmentally? Here is a possible reason: neural networks can simulate any function.

thm. Universal Approximation (non-rigorous statement). Given enough depth and width, and given “normal” activation functions (sigmoid, tanh), a neural network can simulate any function to arbitrary precision.

Backpropagation

Motivation is simple. How do we find $W^{(1)} \dots W^{(L + 1)}, b^{(1)} \dots b^{(L + 1)}$ ? To do that we must minimize the loss function, i.e. find the gradient of the loss function.

def. Loss Minimization of a NN. Let NN be $f : x \in R^{D^{(0)}} \mapsto h^{(L + 1)} \in R^{D^{(L + 1)}}$ function $ℓ$ is the loss of the neural network with parameters $θ = (W^{(1)} \dots W^{(L + 1)}, b^{(1)} \dots b^{(L + 1)})$ . Given training data and label ${(x_{t}, y_{t}) for t \in 1 \dots T}$ The optimal parameters are:

θ^{*} = a r g θ min \frac{1}{T} \forall t \sum ℓ (f (x_{(t)} ∣ θ), y_{(t)}) + λ Ω (θ)

Choice of loss function

While we can choose any loss function, from now on we will simply use the cross-entropy. similar to the loss function here, but only with one data so the second summation dissapears.

For the loss funciton we have:

ℓ (f (x ∣ θ), y) = - ln \forall k \in [K] \prod f (x ∣ θ)_{k}^{y_{k}} = - \forall k \in [K] \sum f (x ∣ θ)_{k} \cdot y_{k} = - ln f (x∣ θ)_{y^{*}}

where we let $y^{⋆} = a r g max y$ as above.

def. Gradient Descent. When a analytic solution is not possible, we can minimize a function (=loss function) by numerical methods:

Randomly select parameters $θ$
? Find the gradient of the loss function with one training example $\frac{\partial}{\partial θ} ℓ (f (x ∣ θ), y)$
Move a little bit ( $η$ , learning rate) in the direction of the gradient
Repeat for every training example $(x, y)$ The question then is how to calculate $\frac{\partial ℓ}{\partial θ}$ from step 2. Here we use backpropagation:

alg. Backpropagation. We can derive the formulae for the derivatives $\frac{\partial ℓ}{\partial θ}$ using the chain rule. First, on the last layer $k = L + 1$ : ¹

\frac{\partial ℓ}{\partial h ^{(L + 1)}} = - \frac{e ( y ^{*} )}{h _{y^{*}}^{(L + 1)}}

then for the previous layer $k = L$ : ²

\frac{\partial ℓ}{\partial h ^{(L)}} 1 \times D^{(L)} = \frac{\partial ℓ}{\partial h ^{(L + 1)}} \cdot = g^{'} (\cdot) \frac{\partial h ^{(L + 1)}}{\partial a ^{(L + 1)}} \cdot = W^{L} \frac{\partial a ^{(L + 1)}}{\partial h ^{(L)}} = - \frac{e ( y ^{*} )}{h _{y^{*}}^{(L + 1)}} 1 \times D^{(L + 1)} \cdot g^{'} (a^{(L + 1)}) D^{(L + 1)} \times D^{(L + 1)} \cdot W^{(L)} D^{(L + 1)} \times D^{(L)}

While we can extract the differentials wrt $W^{(L)}$ by:

\frac{\partial ℓ}{\partial W ^{(L)}} = \frac{\partial ℓ}{\partial h ^{(L + 1)}} \cdot g^{'} (\cdot) \frac{\partial h ^{(L + 1)}}{\partial a ^{(L + 1)}} \cdot (h^{(L)})^{⊤} per row, all layers \frac{\partial a ^{(L + 1)}}{\partial W ^{(L)}} = - \frac{e ( y ^{*} )}{h _{y^{*}}^{(L + 1)}} 1 \times D^{(L + 1)} \cdot g^{'} (a^{(L + 1)}) D^{(L + 1)} \times D^{(L + 1)} \cdot (\cdot) D^{(L + 1)} \times D^{(L)} \times D^{(L + 1)}

And extract the differentials wtr $b^{(L)}$ by:

We can continue this until necessary according to the following visualization. Visualization.

def. Gradient Descent. A form of optimization. Average the gradient of all the datapoints.

θ_{k + 1} \leftarrow θ_{k} - α average over all data \frac{1}{N} i = 1 \sum N \nabla_{θ_{k}} ℓ (f (x_{i} ∣ θ_{k}), y_{i})

where $α$ is a chosen learning rate.

Deep Learning Slides ↩
Deep Learning Slides ^9wnois ^qrkwvn ^ek9eg9 ^bvhuk5 ^bgjjby ^1t1avf ^p87mcc ^xibkdm ↩

PK's Notes

Explorer

Neural Networks, Backpropagation and Gradient Descent

Feed Forward Neural Network

Mathematical Backing

Backpropagation

Graph View

Table of Contents

Backlinks

PK's Notes

Explorer

Neural Networks, Backpropagation and Gradient Descent

Feed Forward Neural Network §

Mathematical Backing §

Backpropagation §

Footnotes §

Graph View

Table of Contents

Backlinks

Feed Forward Neural Network

Mathematical Backing

Backpropagation

Footnotes