*by Christos Stergiou*

- How the Human Brain Learns?
- From Human Neurones to Artificial Neurones.
- How Neural Networks Learn?
- An Example to illustrate the above teaching procedure.
- A description of the Back Propagation Algorithm.
- A Back-Propagation Network Example.

Much is still unknown about how the brain trains itself to process information, so theories abound. In the human brain, a
typical neuron collects signals from others through a host of fine structures called *dendrites*. The neuron sends out
spikes of electrical activity through a long, thin stand known as an *axon*, which splits into thousands of branches. At the
end of each branch, a structure called a *synapse* converts the activity from the axon into electrical effects that inhibit or
excite activity from the axon into electrical effects that inhibit or excite activity in the connected neurones. When a
neuron receives excitatory input that is sufficiently large compared with its inhibitory input, it sends a spike of electrical
activity down its axon. Learning occurs by changing the effectiveness of the synapses so that the influence of one neuron
on another changes.

Components of a neuron

The synapse

We conduct these neural networks by first trying to deduce the essential features of neurones and their interconnections. We then typically program a computer to simulate these features. However because our knowledge of neurones is incomplete and our computing power is limited, our models are necessarily gross idealisations of real networks of neurones.

The neuron model

Artificial neural networks are typically composed of interconnected "units", which serve as model neurones. The
function of the *synapse *is modelled by a modifiable weight, which is associated with each connection. Each unit
converts the pattern of incoming activities that it receives into a single outgoing activity that it broadcasts to other units. It
performs this conversion in two stages:

- It multiplies each incoming activity by the weight on the connection and adds together all these weighted inputs to
get a quantity called the
*total input*. - A unit uses an input-output function that transforms the total input into the outgoing activity.

The behaviour of an ANN (Artificial Neural Network) depends on both the weights and the input-output function (transfer function) that is specified for the units. This function typically falls into one of three categories:

- linear
- threshold
- sigmoid

For **linear units**, the output activity is proportional to the total weighted output.

For **threshold units**, the output is set at one of two levels, depending on whether the total input is greater than or less
than some threshold value.

For **sigmoid units**, the output varies continuously but not linearly as the input changes. Sigmoid units bear a greater
resemblance to real neurones than do linear or threshold units, but all three must be considered rough approximations.

To make a neural network that performs some specific task, we must choose how the units are connected to one another, and we must set the weights on the connections appropriately. The connections determine whether it is possible for one unit to influence another. The weights specify the strength of the influence.

The commonest type of artificial neural network consists of three groups, or layers, of units: a layer of "**input**" units is
connected to a layer of "**hidden**" units, which is connected to a layer of **"output**" units.

- The activity of the input units represents the raw information that is fed into the network.
- The activity of each hidden unit is determined by the activities of the input units and the weights on the connections between the input and the hidden units.
- The behaviour of the output units depends on the activity of the hidden units and the weights between the hidden and output units.

This simple type of network is interesting because the hidden units are free to construct their own representations of the input. The weights between the input and hidden units determine when each hidden unit is active, and so by modifying these weights, a hidden unit can choose what it represents.

We can teach a three-layer network to perform a particular task by using the following procedure:

- We present the network with training examples, which consist of a pattern of activities for the input units together with the desired pattern of activities for the output units.
- We determine how closely the actual output of the network matches the desired output.
- We change the weight of each connection so that the network produces a better approximation of the desired output.

Assume that we want a network to recognise hand-written digits. We might use an array of, say, 256 sensors, each recording the presence or absence of ink in a small area of a single digit. The network would therefore need 256 input units (one for each sensor), 10 output units (one for each kind of digit) and a number of hidden units.

For each kind of digit recorded by the sensors, the network should produce high activity in the appropriate output unit and low activity in the other output units.

To train the network, we present an image of a digit and compare the actual activity of the 10 output units with the desired activity. We then calculate the error, which is defined as the square of the difference between the actual and the desired activities. Next we change the weight of each connection so as to reduce the error.We repeat this training process for many different images of each different images of each kind of digit until the network classifies every image correctly.

To implement this procedure we need to calculate the error derivative for the weight (EW) in order to change the weight by an amount that is proportional to the rate at which the error changes as the weight is changed. One way to calculate the EW is to perturb a weight slightly and observe how the error changes. But that method is inefficient because it requires a separate perturbation for each of the many weights.

Another way to calculate the EW is to use the Back-propagation algorithm which is described below, and has become nowadays one of the most important tools for training neural networks. It was developed independently by two teams, one (Fogelman-Soulie, Gallinari and Le Cun) in France, the other (Rumelhart, Hinton and Williams) in U.S.

To train a neural network to perform some task, we must adjust the weights of each unit in such a way that the error
between the desired output and the actual output is reduced. This process requires that the neural network compute the
error derivative of the weights (**EW**). In other words, it must calculate how the error changes as each weight is
increased or decreased slightly. The back propagation algorithm is the most widely used method for determining the
**EW**.

The back-propagation algorithm is easiest to understand if all the units in the network are linear. The algorithm
computes each **EW** by first computing the **EA**, the rate at which the error changes as the activity level of a unit is
changed. For output units, the **EA** is simply the difference between the actual and the desired output. To compute the**
EA** for a hidden unit in the layer just before the output layer, we first identify all the weights between that hidden unit
and the output units to which it is connected. We then multiply those weights by the **EA**s of those output units and add
the products. This sum equals the **EA** for the chosen hidden unit. After calculating all the **EA**s in the hidden layer just
before the output layer, we can compute in like fashion the **EA**s for other layers, moving from layer to layer in a
direction opposite to the way activities propagate through the network. This is what gives back propagation its name.
Once the **EA** has been computed for a unit, it is straight forward to compute the** EW** for each incoming connection of
the unit. The** EW** is the product of the EA and the activity through the incoming connection.

Note that for non-linear units, the back-propagation algorithm includes an extra step. Before back-propagating, the **EA**
must be converted into the **EI**, the rate at which the error changes as the total input received by a unit is changed.

In this example a back-propagation network would be used to solve a specific problem, that one of an X-OR logic gate. That means that patterns of (0,0) or (1,1) should produce a value close to zero in the output node, and input patterns of (1,0) or (0,1) should produce a value near one in the output node.

Finding a set of connection weights for this task is not easy; it requires application of the back-propagation algorithm for several thousand iterations to achieve a good set of connection weights and neuron thresholds.

The basic architecture for this problem has two input nodes, two hidden nodes, and a single output node as shown above. This structure has variable thresholds on the two hidden and one output node (unit). This means that there are a total of 9 variables in the system:

- 4 weights connecting the input to the hidden nodes
- 2 weights connecting the hidden to the output node
- 3 thresholds

Suppose we put in a pattern, say (0,1). That mean that there is 0 activation in the left-hand neuron on the first layer and an activation of 1 in the neuron on the right.

Now we move our attention to the next layer up. For each neuron in this layer, we calculate an input which is the weighted sum of all the activations from the first layer. The weighted sum is achieved by vector multiplying the activations in the first layer by a "connection matrix". In our case we get a value of 0*(-11,62)+ 1*(10,99) = 10,99 for the neuron on the left in the second layer, and 0*(12,88) +1*(13,13) = =13,13 for the neuron on the right.

These are not the activation of these neurones, though. To obtain the activations, we add a "threshold" value (which is found for each neuron using the back-propagation rule), and apply an input-output (transfer) function. The transfer function is defined for each different network. In our case it is a sigmoid:

In this case it has been shown, that the activation of the neuron on the left side of the hidden (middle) layer is the transfer function applied to the difference (10,99-6,06) = 4,94. Applying the transfer function yields an activation value close to 1. The activation of the neuron on the right is the transfer function applied to (-13,13+7,19) = -5,14. Applying the transfer function yields a value close to 0.

Approximating the next step, we use a value of 1 for the activation of the neuron on the left, and 0 for the neuron on the right, multiply each activation by its appropriate connection weight, and sum the values as input to the topmost neuron. This is approximately 1*(13,34)+0*(13,13) = 13,34. We add the threshold of -6,56 to obtain a value of 6,78. Applying the transfer function to it will yield a value close to 1 (0,946), which is the desired result. Using the other 3 binary input patterns, we can similarly show that this network yields the desired classification within an acceptable tolerance.

#### Neural Networks by Eric Davalo and Patrick Naim

#### Learning internal representations by error propagation by Rumelhart, Hinton and Williams (1986).

#### Internet: PNNL, Pacific Northwest National Laboratory