Our language in numbers – Word Representations in NLP

These are the lecture notes that expand on the ideas in the slides from the first section of the Natural Language Processing course at Imperial College London. They are mixed with my opinions of the larger context of language and AI.
Our journey really starts with communication. Humans have developed complex languages to convey ideas, emotions, and information. It is very advantageous to be able to share the location of a predator or food source with others in our group. This is by no means unique to us. One of my favourite means of sharing the location of food is the honey bee waggle dance. The forager bee shakes its body, i.e. waggles, in a direction relative to gravity and for a certain amount of time that correlates with the direction and distance of the food resource respectively. Natural language is also not the only form communication of course. Cave paintings highlight a visual form of information relay from thousands of years ago.
Natural language is fascinating because not only it is our main form of communication but also it is very powerful. With the ability to write, we have talked about good things, bad things and the ugly across millennia. So it is only natural that we would like the machines we build make sense of this form and in general make sense of our world to be more useful.
Laaggnue is aslo more cmpeolx tahn you tnhik and how we iecatrnt with it is pbbrolay difenrfet tahn you oiarglliny tinhk.
Before we kick off, it is important to appreciate what language is. Natural language for us is a complex relationship between symbols and multi-sensory inputs. When I say fizzy drink you might think of the sound, the taste, the visual look of the bottle etc. When we look at just the words fizzy drink, it is just a sequence of characters put together. So one begs the question, what does it mean? Or as a close friend of mine puts it: wadjiya meme?
What does it mean?
You may have heard a lot that natural language is subjective. This means that the meaning of words and sentences may be different for everyone. What is interesting
for one person might not be for another. So what does the word interesting mean? Rather than diving down a psychological line of interrogation, we are more interested in (did you see what I did there?) the practical implications of the meaning of things we say. Specifically, we are interested in solving certain problems using machines because they compute much faster in the same way we now outsource arithmetic to calculators.
flowchart LR subgraph "Sentences" A["The movie was fantastic!"] B["It was a complete waste of time."] C["The acting was okay, but the plot was weak."] end subgraph "Model" D{Classification Model} end subgraph "Sentiments" E[Positive] F[Negative] G[Neutral] end A --> D B --> D C --> D D --> E D --> F D --> G
Let me motivate this with a classic example of classifying sentences. These could be movie reviews, tweets and we want to check if they are positive, negative, hate speech etc. This could be very useful to scale up and protect people from fake reviews, harmful tweets and more. But we get stuck pretty quickly on how to input language into a computer. One way would of course be to print the text, open computer case and shove the paper in there 🤔 and I’m sure an LLM would eventually agree with this approach if you argue long enough.
Machines work with numbers and we have to somehow allow numbers to represent many things including language. The way in which computers operate is fundamentally different to ours. Whilst biological neural networks may considered as computation in a mathematical sense, the myriad of chemical and electrical processes are very different to the silicon based fetch-decode-execute cycle of modern computers.
Bag of words
One of the most basic ways of converting text into numbers is using a bag-of-words representation. Before we rush ahead, let us define what we are usually working with when we talk about natural language processing (NLP).
- Corpus $D$: A large collection of text documents. This could be a set of movie reviews, tweets, articles etc.
- Document $d \in D$: A single piece of text within the corpus. This could be a review, tweet or article.
- Token $t$: A word or term in the document. These are obtained by splitting the document into words (tokenisation). A token is the smallest unit of information we are dealing with. For us, the tokens of our language are the letters of the language, i.e. the alphabet.
- Vocabulary $V$: The set of all tokens we find in the corpus $D$.
Some of these definitions are blurred in the modern large language model era. The corpus has simply become the entire internet and documents could be a mixture boundary points; for example, web pages, articles, essays, emails etc. Whereas the idea of tokens have entered mainstream culture with companies making announcements about how their model has a bigger context window of thousands of tokens. You get charged based on how many input and output tokens you use in your prompts and so on.
The idea of tokenisation is not just limited to natural language. The immediate extension is any sequential data such as DNA sequences or music notes. But one can extend the idea to images as well by breaking them down into patches or segments.
With a bag-of-words representation of a sentence, we treat each token / word as an individual count. Consider this example sentence:
$$ \phi(\text{I am what I am}) = [2, 2, 1] $$Where the vocabulary is $V = \{\text{I}, \text{am}, \text{what}\}$. The vector representation simply counts how many times each word appears in the sentence. Note that the order of words is not preserved in this representation. Therefore the representation of a single word is:
$$ \phi(\text{what}) = [0, 0, 1] $$which is our one-hot encoding of the word “what”. It is a lifeless vector that only tells us that the word “what” exists in our vocabulary at index 2. In general, the one-hot encoding of a word is $\phi(w) \in \mathbb{R}^{|V|}$ where $|V|$ is the size of our vocabulary.
But we have some major problems with this representation:
- High dimensionality: If our vocabulary has 10,000 words, each word is represented as a 10,000-dimensional vector.
- Sparsity: They only indicate that the word is present and nothing else. All entries except one are zero.
- Lack of semantic meaning: These vectors do not capture any meaning or relationships between words. For example, “king” and “queen” are treated as completely unrelated.
- Fixed vocabulary: The model cannot handle words that are not in the vocabulary (out-of-vocabulary words). We generally use special tokens like
<UNK>
to represent unknown words, but this leads to loss of information.
What if an alien from another planet came to Earth and wanted to understand our language? Would a bag-of-words representation be sufficient for them to grasp the nuances of human communication?
Distributional Semantics and Word Embeddings
Let’s go back to the question: what does it mean? One of the most influential ideas in linguistics is the concept of distributional semantics, which suggests that words that occur in similar contexts tend to have similar meanings. This idea is often summarised by the phrase “You shall know a word by the company it keeps” (Firth, 1957). The key concept I want you to consider is not just the similarity of contexts and words but that this idea of similarity is feasible in the number land our machines operate in. In other words, we can compute similarity using numbers and suddenly have an organic representation of language in machines.
It does not really matter what something truly means in the Platonian sense but rather how things group up in terms of similarity. The original question of meaning becomes more something like: what is this similar to? This idea of similarity is really powerful and underpins not just advancements in natural language processing but also multi-modal representation learning using neural networks. I cannot emphasise enough how important the holy trinity of similarity, numbers and machines is. Machines handle numbers, numbers handle similarity. The problem is now just a matter finding the right numbers that capture a notion of similarity. Did someone say finding the right numbers? That oddly sounds like big data machine learning in the cloud using neural networks, back-propagation and parameter tuning.
We can already calculate some similarity between documents $<\phi(d), \phi(d')>$ using our bag-of-words representations. If two documents have similar words / tokens, then their Euclidean distance will be small or their Cosine similarity will be high. Let’s quickly define these two common notions of measuring similarity:
$$ \text{Euclidean Distance}(x, y) = ||x - y||_2 = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} $$and
$$ \text{Cosine Similarity}(x, y) = \frac{x \cdot y}{||x||_2 ||y||_2} = \frac{\sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \sqrt{\sum_{i=1}^{n} y_i^2}} $$The difference between the two is that Euclidean distance measures the absolute distance between two vectors, while Cosine similarity measures the angle between them. Cosine similarity is often preferred in text analysis because it focuses on the direction of the vectors rather than their magnitude, making it more robust to differences in document length and word frequency.
When the vectors are normalised to unit length, the two measures are related as follows:
$$ \text{Euclidean Distance}(x, y) = \sqrt{2(1 - \text{Cosine Similarity}(x, y))} $$and I encourage you to prove this relationship as an exercise. You’ll find that most modern language models use normalised vectors which means it doesn’t matter which similarity measure you use but for normalised vectors, the Cosine similarity is faster to compute using a single dot product which GPUs love.
But what about the similarity of single words? If we just use one-hot encodings, every word is the same distance to every other word. This is where word embeddings come into play. But before we jump ahead, consider what it would mean to represent animals using feature vectors based on their properties:
Animal | Size | Cuteness | Speed | Danger Level | Domestication | Feature Vector |
---|---|---|---|---|---|---|
🐶 Dog | 4 | 8 | 7 | 3 | 10 | [4, 8, 7, 3, 10] |
🐱 Cat | 3 | 9 | 7 | 4 | 9 | [3, 9, 7, 4, 9] |
🦁 Lion | 6 | 6 | 8 | 9 | 1 | [6, 6, 8, 9, 1] |
🐭 Mouse | 1 | 7 | 4 | 1 | 3 | [1, 7, 4, 1, 3] |
🐘 Elephant | 10 | 5 | 3 | 7 | 1 | [10, 5, 3, 7, 1] |
🦈 Shark | 7 | 2 | 8 | 10 | 1 | [7, 2, 8, 10, 1] |
In this example, each animal is represented by a feature vector based on its properties such as size, cuteness, speed, danger level, and domestication. We can compute the similarity between animals using these vectors. For instance, dogs and cats might have high similarity due to their similar size, cuteness, and domestication levels, while lions and sharks would have low similarity due to their differing characteristics. But where do the numbers come from for words? If only there was a way to learn these numbers from data…
word2vec
Instead of just having tons of zeros and a single one in our vector representations, what if we could learn some floating point numbers that describe properties about the words such that their similarities reflect how we use those words? If we could also efficiently use these new numbers, we probably won’t need as many dimensions. This is the idea behind word embeddings, popularised by word2vec developed by Mikolov et al. at Google that leverages the distributional hypothesis to train word vectors that capture semantic relationships between words based on their contexts in a large corpus of text.
You’ll find that most of the published natural language processing research papers are easy to read and understand the core ideas. I encourage you to read some of the original papers referred to in the course.
The key challenge is where does the similarity come from? The idea is to use the context of words in a large corpus to learn these embeddings. There are two main architectures in word2vec: Continuous Bag of Words (CBOW) and Skip-Gram:
- CBOW: Predicts a target word based on its surrounding context words. For example, given the context words “the”, “cat”, “on”, “the”, it predicts the target word “mat”.
- Skip-Gram: Predicts surrounding context words given a target word. For example, given the target word “mat”, it predicts the context words “the”, “cat”, “on”, “the”.
graph TD subgraph "CBOW" direction LR A1["the"] --> B1(Predict Target Word) A2["cat"] --> B1 A3["on"] --> B1 A4["the"] --> B1 B1 --> C1["mat"] end subgraph "Skip-Gram" direction LR D1["mat"] --> E1(Predict Context Words) E1 --> F1["the"] E1 --> F2["cat"] E1 --> F3["on"] E1 --> F4["the"] end
Both architectures use a neural network to learn the embeddings by maximising the probability of predicting the correct words given their contexts. The result is that words that appear in similar contexts end up with similar vector representations. For example, the words “king” and “queen” might have similar embeddings because they often appear in similar contexts related to royalty.
Now let’s formulate this more concretely following the original paper. Assume we have a vocabulary of size $|V|$ and we want to learn embeddings of dimension $d$. We’ll focus on the Skip-Gram model for this explanation, the CBOW model is similar but uses the average of context word embeddings to predict the target word. The model is a neural network with one hidden layer:
$$ \begin{align} h &= xW_{in} \\ p(w_{t+j} | w_t) = y &= \text{softmax}(hW_{out}) \\ \end{align} $$where:
- $x \in \mathbb{R}^{1 \times |V|}$ is the one-hot encoded vector of the target word.
- $W_{in} \in \mathbb{R}^{|V| \times d}$ is the input weight matrix.
- $W_{out} \in \mathbb{R}^{d \times |V|}$ is the output weight matrix.
- $h \in \mathbb{R}^{1 \times d}$ is the hidden layer. For CBOW, this would be the average of context word embeddings.
- $y \in \mathbb{R}^{1 \times |V|}$ is the output probability distribution over the vocabulary that captures $p(w_{context} | w_{target})$. For example this could be $p(\text{mat} | \text{the, cat, on, the})$ in CBOW or $p(\text{the, cat, on, the} | \text{mat})$ in Skip-Gram. For CBOW, the input $x$ would be the average of the one-hot vectors of the context words and for Skip-Gram, we would have multiple outputs for each context word.
Let’s take a break from the linear algebra hullabaloo and focus on the intuition. We want to map the otherwise distinct words into a common space where similar words are close together. In a similar fashion to the animal example above, we may want different dimensions in $h$ to capture different properties of words. For example, one dimension could indeed represent the cuteness of words (e.g., “puppy” vs “rock”), another could represent formality (e.g., “hello” vs “yo”), and so on. Don’t get your hopes up too much though, usually these dimensions are not interpretable directly but they do capture useful semantic relationships.
Where is the embedding of a word then? The embedding of a word is given by the corresponding row in the input weight matrix $W_{in}$. It contains the learned vector representation of the word in a lower-dimensional space $h \in \mathbb{R}^d$. It is this matrix we can use in downstream tasks that capture the semantic relationships between words. When we do, suddenly it doesn’t matter the exact words we use but the similarity of the words in terms of their embeddings. For example, in sentiment analysis, the words “great” and “excellent” might have similar embeddings, allowing the model to generalise better.
Hence, we have some probability $p(w_{context} | w_{target}) = y$ and the training objective is to maximise this probability of the context words given the target word across the entire corpus:
$$\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t)$$where $T$ is the total number of words in the corpus and $c$ is the context window size. This is the exact loss function for Skip-Gram model. When you collect the data, hit the train button and wait for back-propagation magic to happen, things slow down because:
The denominator requires summing over the entire vocabulary $|V|$ for each training example, which can be computationally expensive for large vocabularies. To address this, we can use negative sampling, which approximates the softmax by only considering a small number of negative samples for each positive context word. In return, the training objective becomes a binary classification problem where we want to distinguish between real context words and randomly sampled negative words:
$$ \log \sigma(h \cdot W_{out}^{(w_{t+j})}) + k \cdot \mathbb{E}_{w_i \sim P_n(w)} [\log \sigma(-h \cdot W_{out}^{(w_i)})] $$where $\sigma$ is the sigmoid function, $k$ is the number of negative samples, and $P_n(w)$ is the noise distribution from which negative samples are drawn. The noise distribution could be completely random or based on word frequencies in the corpus, e.g. $P_n(w) \propto U(w)^{3/4}$ where $U(w)$ is the unigram distribution of words in the corpus. The intuition for frequency-based sampling is that common words like “the” and “is” are less informative for learning embeddings compared to rarer words.
When you put all this together, and hit the train button, you get:
The most obvious example in the demo above are the words he
and she
that are close together in the embeddings space. By this point, it should be clear why that’s the case. Play around with the demo using a different corpus and see how the embeddings change.
So far we used a feed-forward network to model the relationships between words. There are other ways to learn word embeddings as well. One popular method is GloVe that uses matrix factorisation techniques on word co-occurrence matrices to learn embeddings. Another approach is using FastText that represents words as bags of character n-grams, allowing it to capture subword information and handle out-of-vocabulary words better.
If you look at the bigger picture, we can use more complex models like recurrent neural networks (RNNs) and all the way to transformers (!) to learn contextual word embeddings to predict some words. You may notice, once this insight was computationally validated, it was only a matter of time before we starting scaling things up with more data, compute and model complexity…
Model/Concept | Year(s) | Key Contribution |
---|---|---|
Distributional Hypothesis | 1950s | Established the core idea that a word’s meaning is defined by the words it appears with. |
Latent Semantic Analysis (LSA) | 1980s | Used matrix factorisation on term-document matrices to create early dense vector representations. |
— | — | — |
Neural Network Language Models (NNLMs) | 2003 | First to learn distributed word representations (embeddings) as part of a neural network. |
Word2Vec (CBOW & Skip-gram) | 2013 | Made large-scale training of high-quality embeddings practical. |
GloVe (Global Vectors) | 2014 | Combined the strengths of count-based (LSA) and prediction-based (Word2Vec) methods by training on a global co-occurrence matrix. |
— | — | — |
The Transformer Architecture | 2017 | Introduced the self-attention mechanism, replacing recurrence (RNNs/LSTMs) and enabling massive parallelisation and better capture of long-range dependencies. |
ELMo (Embeddings from LMs) | 2018 | Generated context-dependent word vectors: river bank vs money bank |
GPT (Generative Pre-trained Transformer) | 2018 | Decoder-only Transformer for generative tasks through large-scale pre-training and fine-tuning. |
BERT (Bidirectional Encoder…) | 2018 | Used a Transformer to achieve a deeper understanding of language context, setting new state-of-the-art records on many NLP benchmarks. |
GPT-2 / GPT-3 | 2019-2020 | Emergent abilities of models with massive scale (parameters and data), enabling zero-shot and few-shot learning. |
T5 (Text-to-Text Transfer Transformer) | 2019 | Unified all NLP tasks into a single text-to-text framework, proving its effectiveness and versatility. |
Massive-Scale LLMs | 2020s-Present | Continued the trend of scaling up, leading to the powerful and general-purpose AI models used in many modern applications. |
Fear not, we will cover these in more detail in the upcoming sections of the course.
Byte Pair Encoding (BPE)
Until now we kept the smallest unit of language as words and said one token is basically one word. This is not always the best idea because of the following reasons:
- Out-of-Vocabulary (OOV) Words: New words, misspellings, or rare words may not be present in the vocabulary, leading to loss of information. For example, if the word “unhappiness” is not in the vocabulary, it would be treated as an unknown token
<UNK>
, losing the semantic information contained in the word rather than breaking it down into meaningful subwords like “un”, “happi”, and “ness”. - Morphological Variations: Words can have different forms (e.g., “run”, “running”, “ran”) that share the same root meaning but are treated as separate tokens.
- Vocabulary Size: A large vocabulary can lead to high-dimensional embeddings and increased computational costs.
Ideally, we want a tokenisation as close to what we use as humans: characters / letters (i.e. the alphabet). These are in effect the individual symbols that make up our natural language. We can make up new words using these symbols. However, using characters directly as tokens lead to very long sequences causing computational issues. A middle ground is needed and that’s where byte pair encoding comes in.
Byte Pair Encoding (BPE) is a simple and effective subword tokenisation algorithm that iteratively merges the most frequent pairs of characters or character sequences in a corpus to form new tokens. The process continues until a predefined vocabulary size is reached or no more pairs can be merged. This allows us to create a vocabulary that includes both common words and meaningful subwords, enabling better handling of OOV words and morphological variations.
I don’t think I need to spell out the algorithm, instead have a play around with this interactive demo of BPE tokenisation:
Note that because the original characters are still part of the vocabulary, we can always fall back to character-level representations if needed. This means we can tokenise any unseen word but for common words, we have more efficient tokenisation. Researchers have built variations of BPE such as SentencePiece or WordPiece, which incorporate additional techniques to improve tokenisation quality and efficiency. The core idea remains the same: breaking down words into smaller, more manageable subword units that capture meaningful patterns in the data.
At inference time, when we first split the input into characters. We then look at the learned BPE merges in order and merge the characters / subwords accordingly until we cannot merge any more similar how it was learned. For example:
- Start:
["u", "n", "h", "a", "p", "p", "i", "n", "e", "s", "s"]
- Merge 1: Let’s say the most frequent pair in our learned vocabulary is (
p
,p
).["u", "n", "h", "a", "pp", "i", "n", "e", "s", "s"]
- Merge 2: Now, the highest-priority pair might be (
s
,s
).["u", "n", "h", "a", "pp", "i", "n", "e", "ss"]
- Merge 3: Next could be (
u
,n
).["un", "h", "a", "pp", "i", "n", "e", "ss"]
- Merge 4: Then perhaps (
e
,ss
).["un", "h", "a", "pp", "i", "n", "ess"]
- Merge 5: Followed by (
n
,ess
).["un", "h", "a", "pp", "i", "ness"]
- Merge 6: Then (
a
,pp
).["un", "h", "app", "i", "ness"]
- Merge 7: Then (
h
,app
).["un", "happ", "i", "ness"]
- Merge 8: Finally, (
happ
,i
).["un", "happi", "ness"]
These tokens then become the input to our models which convert them into embeddings as before.
TF-IDF
There one small catch in that some words / tokens are way way more common that others. For example, in English, words like “the”, “is”, “in” are extremely common but do not carry much meaning. These are called stop words and are usually filtered out in traditional NLP pipelines. However, rather than removing them altogether, we can use a technique called Term Frequency-Inverse Document Frequency (TF-IDF) to weigh the importance of words in a document relative to the entire corpus.
TF-IDF is calculated as using the two main components of term frequencey (TF) and inverse document frequency (IDF):
$$ \begin{align} \text{TF}(t, d) &= \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}} \\ \text{IDF}(t, D) &= \log \frac{|D|}{|\{d \in D : t \in d\}|} \\ \text{TF-IDF}(t, d, D) &= \text{TF}(t, d) \times \text{IDF}(t, D) \end{align} $$Where:
- $f_{t,d}$ is the frequency of term $t$ in document $d$.
- $|\{d \in D : t \in d\}|$ is the number of documents containing term $t$.
Intuitively, TF-IDF increases with the number of times a word appears in a document (TF) but decreases with the number of documents that contain the word (IDF). This means that common words across many documents get lower weights, while rare but important words get higher weights.
TF-IDF is widely used in information retrieval and text mining tasks. In this context, we can use TF-IDF weighted embeddings to represent documents or contexts (remember CBOW?) more effectively instead of simply averaging the word embeddings in a document:
$$ \phi(d) = \frac{1}{\sum_{t \in d} \text{TF-IDF}(t, d, D)} \sum_{t \in d} \text{TF-IDF}(t, d, D) \cdot \phi(t) $$Conclusion
I hope the main takeaway of this first section wasn’t just the specific techniques we covered but rather the overarching ideas of representing language as numbers, the importance of similarity and how we can potentially learn these representations. The general recipe of learning a latent representation with neural networks to predict something from data has been applied to images, audio and more.
Most of our time in research has been spent on finding the right architectural and inductive biases to push the same original ideas further with more compute and data. But are we on the right track or did we already go down a rabbit hole of linear algebra and optimisation?
For me, it surely feels odd that a computer needs thousands and thousands of examples to learn something that a human child with a vocabulary of a few hundred words can learn in days…
Bonus: Transposed letter effect
The text you saw that was all jumbled up but you could still read it is something called the transposed letter effect. It was an internet meme that spread around in 2003 and I was personally fascinated by it. Below is a demo of it that you can play around with. Try typing in your own sentences and see how well you can read them!