Wednesday, December 14, 2016

back propagation intro... Or what happens to the output when I change a given weight

The question is, if I change a weight in a NN, how much does it affect the output?

In other words, how much is $ \dfrac{\partial E}{\partial w_{jk}}$?

where the error is defined as $ E = \dfrac{(T - L_k)^2}{2} $, where $T$ is the expected/target output

Given a NN with 3 layers:

$ L_h = input$
$ L_i = intermediate$
$ L_j = intermediate$
$ L_k = output$

the output of each neuron would be:

$L_{i}=\sigma (\sum w_{hi}L_{h}) = \sigma (net_i)$
$L_{j}=\sigma (\sum w_{ij}L_{i}) = \sigma (net_j)$
$L_{k}=\sigma (\sum w_{jk}L_{j}) = \sigma (net_k)$

The output of the NN given the $weights$ and the $inputs$ is then:

$L_{k}=\sigma(\sum w_{jk}\sigma(\sum w_{ij}\sigma(\sum w_{hi}L_{h})))$

Now, just some derivatives that will come handy later on:
  • $ \dfrac{\partial \sigma (x)}{\partial x} = \sigma (x)(1-\sigma (x)) $
  • $(\dfrac{\partial net_k}{\partial{  L_j}})=w_jk$

Let's start from the Output neuron:

$ \dfrac{\partial E}{\partial w_{jk}} =
\dfrac{\partial E}{\partial L_k}\dfrac{\partial L_k}{\partial w_{jk}}=
(\dfrac{\partial E}{\partial L_k})(\dfrac{\partial \sigma (net_k)}{\partial w_{jk}})=(\dfrac{\partial E}{\partial{  L_k}}) ( \dfrac{\partial{  \sigma (net_k)}}{\partial net_k})(\dfrac{\partial{net_k}}{\partial w_{jk}})=$

  • $(\dfrac{\partial E}{\partial{  L_k}})= \dfrac{\partial{(\dfrac{(T - L_k)^2}{2})}}{\partial{L_k}}=(T - L_k)(-1)=(L_k-T)$  
  • $(\dfrac{\partial{  \sigma (net_k)}}{\partial net_k})=\sigma (net_k)(1-\sigma (net_k)=L_k(1-L_k)$
  • $(\dfrac{\partial{net_k}}{\partial w_{jk}})=L_j$

hence:

$\dfrac{\partial E}{\partial w_{jk}} =(L_k - T)\cdot{L_k(1-L_k)}\cdot{L_j}$

Let's look at the previous layer (layer k):

$ E = \dfrac{(T - L_k)^2}{2} = \dfrac{(T -\sigma (net_k))^2}{2}  = \dfrac{(T -\sigma (\sum w_{jk}L_{j}))^2}{2} $

$ \dfrac{\partial E}{\partial w_{ij}} = \dfrac{\partial E}{\partial{  L_j}} [L_j(1-L_j)][L_i]$

$\dfrac{\partial E}{\partial{  L_j}}=(\dfrac{\partial E}{\partial{  L_k}})(\dfrac{\partial L_k}{\partial{  L_j}})=(\dfrac{\partial E}{\partial{  L_k}})(\dfrac{\partial \sigma (net_k)}{\partial{  L_j}})=(\dfrac{\partial E}{\partial{  L_k}})(\dfrac{\partial \sigma (net_k)}{\partial{  net_k}})(\dfrac{\partial net_k}{\partial{  L_j}})=(L_k - T) \cdot{L_k(1-L_k)} \cdot{w_{jk}}$

hence:

$ \dfrac{\partial E}{\partial w_{ij}} = [(L_k - T) \cdot{L_k(1-L_k)} \cdot{w_{jk}}] \cdot{[L_j(1-L_j)][L_i]}$

From this we see that for any other intermediate layer we have that:

$ \dfrac{\partial E}{\partial w_{xy}} = \dfrac{\partial E}{\partial L_z} \cdot{[L_y(1-L_y)][L_x]}$

(where $x$ and $y$ are from intermediate layers)

Hence for the layer before $k$, that is Layer $i$:

$ \dfrac{\partial E}{\partial w_{hi}} = \dfrac{\partial E}{\partial L_j} \cdot{[L_i(1-L_i)][L_h]}$

where the tricky bit is to compute $\dfrac{\partial E}{\partial L_x}, from the above steps we can see the following recurrence rule:
  • $\dfrac{\partial E}{\partial L_k} =(L_k-T)$
  • $\dfrac{\partial E}{\partial L_j} =(L_k-T) \cdot{L_k(1-L_k)} \cdot{w_{jk}}$
  • $\dfrac{\partial E}{\partial L_i} =(L_k-T) \cdot{L_k(1-L_k)} \cdot{w_{jk}} \cdot{L_j(1-L_j)} \cdot{w_{ij}}$

Once we have $\dfrac{\partial E}{\partial w_{xy}}$ we can use this value to increase/decrease the weight by that amount, all modulated by an $\alpha$

$\Delta{w_{xy}}= -\alpha \dfrac{\partial E}{\partial w_{xy}}$

and the new weight $w'$ would be then:

$w'_{xy} = w_{xy} + \Delta{w_{xy}}$