Personal page of Lun Ai

Computing Topics project on Neural Networks and RBM

Parts responsible in Group g1516326

  • Research on Deep Belief NetWork, i.e Greedy layer-wise Pre-training

    What is greedy layer-wise training?

    The greedy layer-wise training is a pre-training algorithm that aims to train each layer of a DBN in a sequential way, feeding lower layers’ results to the upper layers. This renders a better optimization of a network than traditional training algorithms, i.e. training method using stochastic gradient descent à la RBMs.

    In terms of computational units, deep structures such as the DBN can be much more efficient (25) than their shallow counterparts since they require fewer units (23) for performing the same function. Multi-layer deep structures can represent abstract concepts and varying functions by keeping many non-linear layers in a hierarchy (25). From a lower-level to a higher-level in this hierarchy, layers’ abstractness ascend in terms of the complexity of objects they are representing (in the illustration below, the top layer shows all elemental pixels whereas the images are kept in the bottom layer). The process of how we divide more complex objects into simpler objects is by modeling a set of joint distribution between each visible and hidden layer.

    Why Do We Need Greedy Layer-Wise Training?

    However, training a deep structure can be difficult since there may exist high dependencies across layers’ parameters(22), i.e. the relation between parts of pictures and pixels. To resolve this problem, it is suggested that we must do two things. The first step is adapting lower layers to feed good input to the upper layers’ final setting (the harder part). Next we need to adjust upper layers to make use of that end setting of upper layers(22).

    Greedy layer-wise training has been introduced just to tackle this issue. It can be used for training the DBN in a layer-wise sequence where each layer is composed of an RBM, and it is confirmed to bring a better generalization by initializing a local minimum (or local criterion) that helps to formulate a representation of high-level abstractions of the input to the network(25).

    Amongst the greedy layer-wise training subset (excluding semi-supervised training which adapts parts of the objectives of both supervised and unsupervised training), unsupervised layer-wise training generally performs better than the supervised layer-wise training. This is because the supervised method may be, so to speak, “too greedy” and discard some useful information in the hidden layers(25). In this report, we choose to examine unsupervised greedy layer-wise training algorithm.

    How does it work?

    A DBN is a stack of RBMs is trained in a greedy and sequential manner to capture the representation of hierarchy of relationships within the training data. A model of distribution between observed vector \(x\) and \(l\) hidden layer \(h_{k}\) is as follows)(21)

    In an unsupervised training, a layer learns a more abstract representation of the layers below it and no label is require in the process since the training criterion does not depend on labels(25). The algorithm of greedy layer-wise unsupervised training for a DBN can be generalized as following(21):

    1. Let raw input x be the first RBM layer that we want to train, \(x = h(0)\).
    2. Use the resulting representation from the first layer as an input data to the second layer. This representation can either mean activation data \(p(h^{(1)} =1|h^{(0)})\) (23) or a set of samples of\(p(h^{(1)}|h^{(0)})\) (24).
    3. Then we train the second layer as a RBM and we keep the mean activations or sample from first layer as training data of the visible layer in this RBM.
    4. We repeat step 2 and step 3 for desired number of layers and for each iteration we feed upwards either the mean activations or the samples.
    5. Finally, we adapt fine-tuning on all parameters of the unsupervised network to transform it into classifiers by adding an extra logistic regression classifier and training by gradient descent on a supervised training criterion (24).

    Conclusion

    The advantage of unsupervised training procedure is that it allows us to use all of our data in the process of training (shared lower-level representation) and it does not require training criterion to be labeled (unsupervised). This unsupervised training process provides an optimal start for supervised training as well as to restrict the range of parameters for further supervised training (22).

  • References

  • See 21-25