# Common extensions to Backpropagation

Preconditioning weights

The outcome and speed of a learning process is influenced by the initial state of the network. However, it is impossible to tell which condition will be the most ideal. The commonly accepted way is initializing weights by uniformly distributed random numbers on the (0,1) interval.

Preconditioning data

Very often the training set fed to the network consists of real-time measurement data measured in various units and quantities. These numbers must be normalized before using them to train the network. The normalizing interval depends on the activation function of the network: in case of the sigmoid it should be (0,1) and in case of the hyperbolic tangent it should be (-1,1).

Avoiding local minima

Gradient descent algorithms attempt to move towards the global minimum of the error function. If the error surface does not have local minima the situation is very simple because any step downwards will take us closer to the global minimum point. However, error surfaces of real problems can be very complex. If there are local minimum points in the error surface the gradient descent algorithm can get trapped in one of them.

$latex \Delta w_{l,i,j}^{(t+1)}=-\alpha(1-\beta)\frac{\partial E^{(t)}}{w_{l,i,j}}+\beta\Delta w_{l,i,j}^{(t)}=-\alpha(1-\beta)\sum\limits_{k=0}^t \beta^k\frac{\partial E^{(t-k)}}{\partial w_{l,i,j}}. &s=-2&bg=ffffff&fg=000000$