If there are multiple layers in a neural network the inner layers have neither target values nor errors. This problem remained unsolved until the 1970s when mathemathicians found the backpropagation algorithm to be usable for this particular problem.

Backpropagation provides a way to train neural networks with any number of hidden layers. The neurons don’t even have to be organized into layers, the algorithm will work fine with any directed (inputs to outputs) graph of neurons that don’t contain cycles. These networks are called feedforward networks or directed acyclic graphs.

First we have to define a so called Error function that represents the difference between the outputs and the target values. Let’s make the error function compute the

values.

The error function only depends on the activation functions, so it can be described as the continous and differentiable function of *w _{1}, w_{2}, … w_{l}* weights of a network containing

*l*number of weights:

Input vectors are fed to the network through the input layer. While the vector k is fed, the neurons of the first layer compute their outputs, and propagate them forward to the next layer. The output of the network is the output of the last layer. The error of the last layer can easily be computed knowing the output vector:

The partial derivative of the error function by the weights of neurons in the output layer:

Introducing the value of *δ ^{k}_{M,i}* as below:

The partial derivative of the error function for the output layer:

Any neuron at a hidden layer *l*. affects only the layer *l+1. *directly so it affects the output layer indirectly.

The value of *δ ^{k}_{M,i}* can be generalized for all layers:

According to the last equation the errors of the layer *l+1.* **propagate back** to the previous layer. The partial derivative of the error function depending from any of the layers can be expressed as:

The last step is to modify the weights in the direction which decreases the error. This direction is the inverse of the direction of the partial derivative of the last equation:

The value of α_{t} is a constant and it is called **learn rate**. The above update must be applied repeatedly. The error will have to converge to 0 if the learn rate α_{t} is chosen as:

The above method is called **offline** or **batch** version of the backpropagation algorithm because the weights are only updated after processing all of the input patterns.