Backpropagation - godelia.org

Backpropagation is a generalization of the Delta rule to be able to take into account non-linear functions.

Suppose a differentiable function of the input vector

$$y_{k}^{p}=f(s_{k}^{p})$$

where $k$ represents the $k$-th unit of the network and $p$ the $p$-th pattern. $f (s_ {k} ^ {p})$ is called activation function, and depends on a linear combination of the weights and outputs of the units of the previous layer of the network, which become the inputs for the unit $k$. Of course, $y_ {k} ^ {p}$ is the output generated by the network for the $k$-th unit, according to the functio $s_ {k} ^ {p}$

$$s_{k}^{p}=\sum_{j}w_{jk}y_{j}^{p}+\theta_{k}$$

The summation extends to all units of the level immediately preceding the one in which unit $k$ is located

$$j=k-1$$

$\theta_ {k}$ is the bias term. It is associated to the layer in which the unit $k$ is located and not to the layer $j$.

We want to find the weight variation rule that causes a decrease in the error

$$\Delta_{p}w_{jk}=-\gamma\frac{\partial E^{p}}{\partial w_{jk}}$$

being $\gamma$ a constant of proportionality called learning rate. Using squared error as measure of error

$$E^{p}=\frac{1}{2}\sum_{o=1}^{N_{o}}(t_{o}^{p}-y_{o}^{p})^{2}$$

where the error is measured on all output units as the difference between the value generated by the network, $y_ {o} ^ {p}$, and the target value $t_ {o} ^ {p}$. The $\frac {factor 1} {2}$ is arbitrary and is introduced to be canceled with the quadratic exponent during derivation. This allows a slightly more simplified final expression.

The total error is the sum of the quadratic error of the set of patterns

$$E=\sum_{p}E^{p}$$

According to the chain rule

$$\frac{\partial E^{P}}{\partial w_{jk}}=\frac{\partial E^{p}}{\partial s_{k}^{p}}\frac{\partial s_{k}^{p}}{\partial w_{jk}}$$

from the equation which connects the output of one unit and the weighted sum of the units of the previous layer, we get

$$\frac{\partial s_{k}^{p}}{\partial w_{jk}}=y_{j}^{p}$$

We define the other derivative in the equation as

$$\frac{\partial E^{p}}{\partial s_{k}^{p}}=-\delta_{k}^{p}$$

and we get

$$\Delta_{p}w_{jk}=\gamma\delta_{k}^{p}y_{j}^{p}$$

$\delta_ {k} ^ {p}$ exists for every unit on the network. It is possible to compute $\delta_ {k} ^ {p}$ through a recursive procedure that propagates the errors back through the network, starting from the output units where the error is computable since the target value is known.

To calculate $ \delta_ {k} ^ {p}$ we divide the derivative of the equation, using the chain rule, into two contributions: one that reflects changes with the variation of the input and another that reflects changes with the variation of the output

$$\delta_{k}^{p}=-\frac{\partial E^{p}}{\partial s_{k}^{p}}=-\frac{\partial E^{p}}{\partial y_{k}^{p}}\frac{\partial y_{k}^{p}}{\partial s_{k}^{p}}$$

and from the definition of activation function

$$\frac{\partial y_{k}^{p}}{\partial s_{k}^{p}}=f'(s_{k}^{p})$$

To calculate $\frac{\partial E^{p}}{\partial y_{k}^{p}}$, we consider two cases.

1 $k$ is a network output unit

$$\frac{\partial E^{p}}{\partial y_{o}^{p}}=\frac{\partial}{\partial y_{o}^{p}}\left[\frac{1}{2}\sum_{o=1}^{N_{o}}(t_{o}^{p}-y_{o}^{p})^{2}\right]=-(t_{o}^{p}-y_{o}^{p})$$

and therefore

$$\delta_{o}^{p}=(t_{o}^{p}-y_{o}^{p})f'(s_{o}^{p})$$

2 $k$ is a hidden unit

For a hidden unit $k = h$ we do not know the contribution of that unit to the output of the network. However, we can write the error as a function of the input of the hidden units to the output units

$$E^{p}=f(s_{1}^{p},s_{2}^{p},\ldots,s_{j}^{p},\ldots)$$

and using the chain rule

$$\begin{array}{rcl} \frac{\partial E^{p}}{\partial y_{h}^{p}} & = & \sum_{o=1}^{N_{o}}\frac{\partial E^{p}}{\partial s_{o}^{p}}\frac{\partial s_{o}^{p}}{\partial y_{h}^{p}}\\ & = & \sum_{o=1}^{N_{o}}\frac{\partial E^{p}}{\partial s_{o}^{p}}\frac{\partial}{\partial y_{h}^{p}}\left(\sum_{j=1}^{N_{h}}w_{jo}y_{j}^{p}\right)\\ & = & \sum_{o=1}^{N_{o}}\frac{\partial E^{p}}{\partial s_{o}^{p}}w_{ho}\\ & = & -\sum_{o=1}^{N_{o}}\delta_{o}^{p}w_{ho} \end{array}$$

we obtain

$$s_{h}^{p}=f'(s_{h}^{p})\sum_{o=1}^{N_{o}}\delta_{o}^{p}w_{ho}$$