Vector Representations for Sequences Suppose we have a matrix $\mathbf{L}$ of representation vectors and an ordered sequence of $m$ elements, each with an index $k$ which is used to get the element’s vector representation in $\mathbf{L}$. Our look-up operation is then a simple projection, using a binary vector $\mathbf{b}$, with zero in every position except for the $k$th index:
We can now express our ordered sequence of $m$ elements as a sequence of $m$ vector representations, namely $(\mathbf{x}_{1},\dots,\mathbf{x}_{m})$. [1]
See autoencoders.
We will now introduce a binary tree structure that allows for recursion, by introducing a number of hidden representations. Each parent node has two children. In the base case, both children are vector representations for two sequence elements, here we will use $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$. In every further case, we have one child as a hidden representation of the sequence so far, say $\mathbf{y}_{j}$, and the next $\mathbf{x}_{i}$ in our sequence to be processed. Such a tree structure is shown in figure 1. It should be noted that every hidden representation will have the same dimensionality as the elements of the sequence, here called $n$. [2]
In order to compute a parent representation $\mathbf{p}$, we consider its two children $\mathbf{c}_{1}$ and $\mathbf{c}_{2}$:
Here we multiply a matrix of parameters $\mathbf{W}\in\mathbb{R}^{n\times2n}$ by the concatenation of the two children and add a bias term $b$. We finally apply an element-wise activation function $f$, usually a sigmoidal function such as tanh.
We then attempt to reconstruct the children from the parent vector in a reconstruction layer:
We can then calculate the reconstruction error by simply calculating the Euclidean distance between the children and their reconstruction:
We then backpropagate the error as usual, finding a representation that encodes the children as well as possible.
After computing the reconstruction $\left[\mathbf{c}_{1}';\mathbf{c}_{2}'\right]$, if one of the children (say $\mathbf{c}_{2}'$) is a non-terminal node (some $\mathbf{y}_{i}$), we can again compute the reconstruction error of that $\mathbf{y}_{i}$ using the reconstruction $\mathbf{c}_{2}'$.
This process is repeated for each parent node until we have constructed the full tree, and calculated the reconstruction error for each element in the original sequence. [3]
There are several adjustments we can make to optimise recursive autoencoders:
The following page discusses how recursive autoencoders can be applied to the problem of sentiment analysis.
[1] Pollack JB. Recursive Distributed Representations;. Available from: http://www.demo.cs.brandeis.edu/papers/raam.pdf.
[2] Tan S. Recursive Auto-encoders: An Introduction;. Available from: https://blog.wtf.sg/2014/05/10/recursive-_auto-_encoders-_an-_introduction/.
[3] Socher R, Pennington J, Huang EH, Ng AY, Manning CD. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2011. Available from: http://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf.
[4] Li P, Liu Y, Sun M. Recursive Autoencoders for ITG-Based Translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics; 2013. p. 567–577. Available from: https://pdfs.semanticscholar.org/153a/a8b982f0d6db42502a307276fe6323304140.pdf.