By Kroese B., van der Smagt P.

An Introduction to Neural Networks

4) o=1 Ep where dpo is the desired output for unit o when pattern p is clamped. We further set E = p as the summed squared error. We can write p ∂E p ∂E p ∂sk = . 2) we see that the second factor is ∂spk = yjp . 7) When we define δkp = − we will get an update rule which is equivalent to the delta rule as described in the previous chapter, resulting in a gradient descent on the error surface if we make the weight changes according to: ∆p wjk = γδkp yjp . 8) The trick is to figure out what δkp should be for each unit k in the network.

42 CHAPTER 4. 6: Slow decrease with conjugate gradient in non-quadratic systems. The hills on the left are very steep, resulting in a large search vector ui . When the quadratic portion is entered the new search direction is constructed from the previous direction and the gradient, resulting in a spiraling minimisation. This problem can be overcome by detecting such spiraling minimisations and restarting the algorithm with u0 = −∇f . Some improvements on back-propagation have been presented based on an independent adaptive learning rate parameter for each weight.

2 The generalised delta rule Since we are now using units with nonlinear activation functions, we have to generalise the delta rule which was presented in chapter 3 for linear functions to the set of non-linear activation 1 Of course, when linear activation functions are used, a multi-layer network is not more powerful than a single-layer network. 33 34 CHAPTER 4. 1: A multi-layer network with l layers of units. functions. The activation is a differentiable function of the total input, given by ykp = F(spk ), in which spk = wjk yjp + θk .

An Introduction to Neural Networks by Kroese B., van der Smagt P.

