Recursive Autoencoders

As we know, autoencoders learn a (usually reduced dimensional) representation of their inputs. On this page, we will discuss how recursive autoencoders can take a sequence of representation vectors, and return a useful (reduced dimensional) representation for that sequence.

Vector Representations for Sequences Suppose we have a matrix $\mathbf{L}$ of representation vectors and an ordered sequence of $m$ elements, each with an index $k$ which is used to get the element’s vector representation in $\mathbf{L}$. Our look-up operation is then a simple projection, using a binary vector $\mathbf{b}$, with zero in every position except for the $k$th index:


We can now express our ordered sequence of $m$ elements as a sequence of $m$ vector representations, namely $(\mathbf{x}_{1},\dots,\mathbf{x}_{m})$. [1]

1 Unsupervised Recursive Autoencoders


Figure 1: The structure of a simple recursive autoencoder.

See autoencoders.

We will now introduce a binary tree structure that allows for recursion, by introducing a number of hidden representations. Each parent node has two children. In the base case, both children are vector representations for two sequence elements, here we will use $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$. In every further case, we have one child as a hidden representation of the sequence so far, say $\mathbf{y}_{j}$, and the next $\mathbf{x}_{i}$ in our sequence to be processed. Such a tree structure is shown in figure 1. It should be noted that every hidden representation will have the same dimensionality as the elements of the sequence, here called $n$. [2]

In order to compute a parent representation $\mathbf{p}$, we consider its two children $\mathbf{c}_{1}$ and $\mathbf{c}_{2}$:


Here we multiply a matrix of parameters $\mathbf{W}\in\mathbb{R}^{n\times2n}$ by the concatenation of the two children and add a bias term $b$. We finally apply an element-wise activation function $f$, usually a sigmoidal function such as tanh.

We then attempt to reconstruct the children from the parent vector in a reconstruction layer:


We can then calculate the reconstruction error by simply calculating the Euclidean distance between the children and their reconstruction:

$E=0.5\Vert [\mathbf{c}_{1};\mathbf{c}_{2}]-[\mathbf{c}_{1}';\mathbf{c}_{2}']\Vert ^{2}$

We then backpropagate the error as usual, finding a representation that encodes the children as well as possible.

After computing the reconstruction $\left[\mathbf{c}_{1}';\mathbf{c}_{2}'\right]$, if one of the children (say $\mathbf{c}_{2}'$) is a non-terminal node (some $\mathbf{y}_{i}$), we can again compute the reconstruction error of that $\mathbf{y}_{i}$ using the reconstruction $\mathbf{c}_{2}'$.

This process is repeated for each parent node until we have constructed the full tree, and calculated the reconstruction error for each element in the original sequence. [3]

1.1 Optimisations

There are several adjustments we can make to optimise recursive autoencoders:

  1. Choosing a good tree structure: Note that there are many ways of building the binary tree structure. One method of determining a good tree structure is to try a greedy algorithm that tests each possibility at each step, and chooses the children nodes that give the lowest reconstruction error.
  2. Choosing a good reconstruction error: We want our reconstruction error to penalise any representation that looses the meaning of the sequence. As such, it may be a good idea to penalise more heavily a reconstruction error on a node that encodes lots of children (i.e. many elements of the sequence), than a reconstruction error on a node that represents fewer children. We can implement this easily, by altering $E$ to take into account the number of elements $n_{1}$ and $n_{2}$ underneath the child nodes $\mathbf{c}_{1}$ and $\mathbf{c}_{1}$ respectively:
    $E=\frac{n_{1}}{n_{1}+n_{2}}\left\Vert \mathbf{c}_{1}-\mathbf{c}_{1}'\right\Vert ^{2}+\frac{n_{2}}{n_{1}+n_{2}}\left\Vert \mathbf{c}_{2}-\mathbf{c}_{2}'\right\Vert ^{2}$

  3. Length normalisation: Recursive autoencoders compute the hidden representations that they then reconstruct. In order to minimise the reconstruction error, the RAE may compute a hidden representation which is very small in magnitude, which will decrease the reconstruction error, but is clearly undesirable as we will lose the meaning of the sequence. We can alleviate this effect by forcing the length of each node to have length 1, through setting $\mathbf{p}=\mathbf{p}/\left\Vert \mathbf{p}\right\Vert $. [4]

The following page discusses how recursive autoencoders can be applied to the problem of sentiment analysis.


[1]   Pollack JB. Recursive Distributed Representations;. Available from:

[2]   Tan S. Recursive Auto-encoders: An Introduction;. Available from:

[3]   Socher R, Pennington J, Huang EH, Ng AY, Manning CD. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2011. Available from:

[4]   Li P, Liu Y, Sun M. Recursive Autoencoders for ITG-Based Translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics; 2013. p. 567–577. Available from: