The greedy layer-wise training is a pre-training algorithm that aims to train each layer of a DBN in a sequential way, feeding lower layers’ results to the upper layers. This renders a better optimization of a network than traditional training algorithms, i.e. training method using stochastic gradient descent à la RBMs.
In terms of computational units, deep structures such as the DBN can be much more efficient (25) than their shallow counterparts since they require fewer units (23) for performing the same function. Multi-layer deep structures can represent abstract concepts and varying functions by keeping many non-linear layers in a hierarchy (25). From a lower-level to a higher-level in this hierarchy, layers’ abstractness ascend in terms of the complexity of objects they are representing (in the illustration below, the top layer shows all elemental pixels whereas the images are kept in the bottom layer). The process of how we divide more complex objects into simpler objects is by modeling a set of joint distribution between each visible and hidden layer.
However, training a deep structure can be difficult since there may exist high dependencies across layers’ parameters(22), i.e. the relation between parts of pictures and pixels. To resolve this problem, it is suggested that we must do two things. The first step is adapting lower layers to feed good input to the upper layers’ final setting (the harder part). Next we need to adjust upper layers to make use of that end setting of upper layers(22).
Greedy layer-wise training has been introduced just to tackle this issue. It can be used for training the DBN in a layer-wise sequence where each layer is composed of an RBM, and it is confirmed to bring a better generalization by initializing a local minimum (or local criterion) that helps to formulate a representation of high-level abstractions of the input to the network(25).
Amongst the greedy layer-wise training subset (excluding semi-supervised training which adapts parts of the objectives of both supervised and unsupervised training), unsupervised layer-wise training generally performs better than the supervised layer-wise training. This is because the supervised method may be, so to speak, “too greedy” and discard some useful information in the hidden layers(25). In this report, we choose to examine unsupervised greedy layer-wise training algorithm.
A DBN is a stack of RBMs is trained in a greedy and sequential manner to capture the representation of hierarchy of relationships within the training data. A model of distribution between observed vector \(x\) and \(l\) hidden layer \(h_{k}\) is as follows)(21)
In an unsupervised training, a layer learns a more abstract representation of the layers below it and no label is require in the process since the training criterion does not depend on labels(25). The algorithm of greedy layer-wise unsupervised training for a DBN can be generalized as following(21):
The advantage of unsupervised training procedure is that it allows us to use all of our data in the process of training (shared lower-level representation) and it does not require training criterion to be labeled (unsupervised). This unsupervised training process provides an optimal start for supervised training as well as to restrict the range of parameters for further supervised training (22).