Lecture 13

Input units  Hidden units  Output units  
Unit  Output  Unit  Weighted Sum Input  Output  Unit  Weighted Sum Input  Output 
I1  10  H1  7  0.999  O1  1.0996  0.750 
I2  30  H2  5  0.0067  O2  3.1047  0.957 
I3  20 
Suppose now that the target categorisation for the example was the one associated with O1. This means that the network miscategorised the example and gives us an opportunity to demonstrate the backpropagation algorithm: we will update the weights in the network according to the weight training calculations provided above, using a learning rate of η = 0.1.
If the target categorisation was associated with O1, this means that the target output for O1 was 1, and the target output for O2 was 0. Hence, using the above notation,
t_{1}(E) = 1; t_{2}(E) = 0; o_{1}(E) = 0.750; o_{2}(E) = 0.957
That means we can calculate the error values for the output units O1 and O2 as follows:
δ_{O1} = o_{1}(E)(1  o_{1}(E))(t_{1}(E)  o_{1}(E)) = 0.750(10.750)(10.750) = 0.0469
δ_{O2} = o_{2}(E)(1  o_{2}(E))(t_{2}(E)  o_{2}(E)) = 0.957(10.957)(00.957) = 0.0394
We can now propagate this information backwards to calculate the error terms for the hidden nodes H1 and H2. To do this for H1, we multiply the error term for O1 by the weight from H1 to O1, then add this to the multiplication of the error term for O2 and the weight between H1 and O2. This gives us: (1.1*0.0469) + (3.1*0.0394) = 0.0706. To turn this into the error value for H1, we multiply by h_{1}(E)*(1h_{1}(E)), where h_{1}(E) is the output from H1 for example E, as recorded in the table above. This gives us:
δ_{H1} = 0.0706*(0.999 * (10.999)) = 0.0000705
A similar calculation for H2 gives the first part to be: (0.1*0.0469)+(1.17*0.0394) = 0.0414, and the overall error value to be:
δ_{H2} 0.0414 * (0.067 * (10.067)) = 0.00259
We now have all the information required to calculate the weight changes for the network. We will deal with the 6 weights between the input units and the hidden units first:
Input unit  Hidden unit  η  δ_{H}  x_{i}  Δ = η*δ_{H}*x_{i}  Old weight  New weight 
I1  H1  0.1  0.0000705  10  0.0000705  0.2  0.1999295 
I1  H2  0.1  0.00259  10  0.00259  0.7  0.69741 
I2  H1  0.1  0.0000705  30  0.0002115  0.1  0.1002115 
I2  H2  0.1  0.00259  30  0.00777  1.2  1.20777 
I3  H1  0.1  0.0000705  20  0.000141  0.4  0.39999 
I3  H2  0.1  0.00259  20  0.00518  1.2  1.1948 
We now turn to the problem of altering the weights between the hidden layer and the output layer. The calculations are similar, but instead of relying on the input values from E, they use the values calculated by the sigmoid functions in the hidden nodes: h_{i}(E). The following table calculates the relevant values:
Hidden unit 
Output unit 
η  δ_{O}  h_{i}(E)  Δ = η*δ_{O}*h_{i}(E)  Old weight  New weight 
H1  O1  0.1  0.0469  0.999  0.000469  1.1  1.100469 
H1  O2  0.1  0.0394  0.999  0.00394  3.1  3.0961 
H2  O1  0.1  0.0469  0.0067  0.00314  0.1  0.10314 
H2  O2  0.1  0.0394  0.0067  0.0000264  1.17  1.16998 
We note that the weights haven't altered all that much, so it might be a good idea in this situation to use a bigger learning rate. However, remember that, with sigmoid units, small changes in the weighted sum can produce big changes in the output from the unit.
As an exercise, check whether the retrained network performs better with respect to the example than the original network.
The error rate of multilayered networks over a training set could be calculated as the number of misclassified examples. Remembering, however, that there are many output nodes, all of which could potentially misfire (e.g., giving a value close to 1 when it should have output 0, and viceversa), we can be more sophisticated in our error evaluation. In practice the overall network error is calculated as:
This is not as complicated as it first appears. The calculation simply involves working out the difference between the observed output for each output unit and the target output and squaring this to make sure it is positive, then adding up all these squared differences for each output unit and for each example.
Backpropagation can be seen as using searching a space of network configurations (weights) in order to find a configuration with the least error, measured in the above fashion. The more complicated network structure means that the error surface which is searched can have local minima, and this is a problem for multilayer networks, and we look at ways around it below. Having said that, even if a learned network is in a local minima, it may still perform adequately, and multilayer networks have been used to great effect in real world situations (see Tom Mitchell's book for a description of an ANN which can drive a car!)
One way around the problem of local minima is to use random restart as described in the lecture on search techniques. Different initial random weightings for the network may mean that it converges to different local minima, and the best of these can be taken for the learned ANN. Alternatively, as described in Mitchell's book, a "committee" of networks could be learned, with the (possibly weighted) average of their decisions taken as an overall decision for a given test example. Another alternative is to try and skip over some of the smaller local minima, as described below.
Imagine a ball rolling down a hill. As it does so, it gains momentum, so that its speed increases and it becomes more difficult to stop. As it rolls down the hill towards the valley floor (the global minimum), it might occasionally wander into local hollows. However, it may be that the momentum it has obtained keeps it rolling up and out of the hollow and back on track to the valley floor.
The crude analogy describes one heuristic technique for avoiding local minima, called adding momentum, funnily enough. The method is simple: for each weight remember the previous value of Δ which was added on to the weight in the last epoch. Then, when updating that weight for the current epoch, add on a little of the previous Δ. How small to make the additional extra is controlled by a parameter α called the momentum, which is set to a value between 0 and 1.
To see why this might help bypass local minima, note that if the weight change carries on in the direction it was going in the previous epoch, then the movement will be a little more pronounced in the current epoch. This effect will be compounded as the search continues in the same direction. When the trend finally reverses, then the search may be at the global minimum, in which case it is hoped that the momentum won't be enough to take it anywhere other than where it is. Alternatively, the search may be at a fairly narrow local minimum. In this case, even though the backpropagation algorithm dictates that Δ will change direction, it may be that the additional extra from the previous epoch (the momentum) may be enough to counteract this effect for a few steps. These few steps may be all that is needed to bypass the local minimum.
In addition to getting over some local minima, when the gradient is constant in one direction, adding momentum will increase the size of the weight change after each epoch, and the network may converge quicker. Note that it is possible to have cases where (a) the momentum is not enough to carry the search out of a local minima or (b) the momentum carries the search out of the global minima into a local minima. This is why this technique is a heuristic method and should be used somewhat carefully (it is used in practice a great deal).
Left unchecked, backpropagation in multilayer networks can be highly susceptible to overfitting itself to the training examples. The following graph plots the error on the training and test set as the number of weight updates increases. It is typical of networks left to train unchecked.
Alarmingly, even though the error on the training set continues to gradually decrease, the error on the test set actually begins to increase towards the end. This is clearly overfitting, and it relates to the network beginning to find and finetune to ideosyncrasies in the data, rather than to general properties. Given this phenomena, it would be unwise to use some kind of threshold for the error as the termination condition for backpropagation.
In cases where the number of training examples is high, one antidote to overfitting is to split the training examples into a set to use to train the weight and a set to hold back as an internal validation set. This is a minitest set, which can be used to keep the network in check: if the error on the validation set reaches a minima and then begins to increase, then it could be that overfitting is beginning to occur.
Note that (time permitting) it is worth giving the training algorithm the benefit of the doubt as much as possible. That is, the error in the validation set can also go through local minima, and it is not wise to stop training as soon as the validation set error starts to increase, as a better minima may be achieved later on. Of course, if the minima is never bettered, then the network which is finally presented by the learning algorithm should be rewound to be the one which produced the minimum on the validation set.
Another way around overfitting is to decrease each weight by a small weight decay factor during each epoch. Learned networks with large (positive or negative) weights tend to have overfitted the data, because larger weights are needed to accommodate outliers in the data. Hence, keeping the weights low with a weight decay factor may help to steer the network from overfitting.
As we did for decision trees, it's important to know when ANNs are the right representation scheme for the job. The following are some characteristics of learning tasks for which artificial neural networks are an appropriate representation:
The concept (target function) to be learned can be characterised in terms of a realvalued function. That is, there is some translation from the training examples to a set of real numbers, and the output from the function is either realvalued or (if a categorisation) can be mapped to a set of real values. It's important to remember that ANNs are just giant mathematical functions, so the data they play around with are numbers, rather than logical expressions, etc. This may sound restrictive, but many learning problems can be expressed in a way that ANNs can tackle them, especially as real numbers contain booleans (true and false mapped to +1 and 1), integers, and vectors of these data types can also be used.
Long training times are acceptable. Neural networks generally take a longer time to train than, for example, decision trees. Many factors, including the number of training examples, the value chosen for the learning rate and the architecture of the network, have an affect on the time required to train a network. Training times can vary from a few minutes to many hours.
It is not vitally important that humans be able to understand exactly how the learned network carries out categorisations. As we discussed above, ANNs are black boxes and it is difficult for us to get a handle on what its calculations are doing.
When in use for the actual purpose it was learned for, the evaluation of the target function needs to be quick. While it may take a long time to learn a network to, for instance, decide whether a vehicle is a tank, bus or car, once the ANN has been learned, using it for the categorisation task is typically very fast. This may be very important: if the network was to be used in a battle situation, then a quick decision about whether the object moving hurriedly towards it is a tank, bus, car or old lady could be vital.