Twelve years later and "man" is still more similar to "doctor" than "woman"

It has almost been twleve (typo for authenticity) years since the publication of word2vec, and the idea of converting language into a statistical soup of numbers has become the standard in the field of natural language processing. However, the underlying biases, despite many efforts to mitigate them and raising awareness, remain. Most of the projects I’ve been involved with are aware of these limitations but the temptation of having an AI system is so strong that most decision makers hope that they are a fluke rather than an consistent problem. Let me start by stating the current state-of-the-art situation:

Gemini Embedding, released March 7, 2025, still considers “man” to be more similar to “doctor” than “nurse” and vice versa for “woman”.

The above applies to gemini-embedding-exp-03-07 which attains an impressive score on the MTEB benchmark which evaluates how good the soup of numbers produced by these models capture semantics in different contexts such as sentence similarity. The following code snippet shows how to use the Gemini API to get the embeddings for the words:

words = ["man", "woman", "doctor", "nurse"]

result = client.models.embed_content(
    model="gemini-embedding-exp-03-07",
    contents=words,
    config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")
)

This isn’t to say researchers and indeed Google, rather infamously with their “woke” image generator last year, try to address these concerns. However, I can’t stop thinking if there is a more fundamental issue hidden underneath the cacophony of hype.

Why can’t we seem to be able to create a sensible system in the awe-inspiring decade of AI?

Addicted to data

The choice of poison here is data. The core reason we’re so hooked on data is because the algorithms are designed to work as such. Pick a cauldron, throw in some compute power and some data, churn and churn until something comes out the other end. When we produce and feed internet scale data, it’s mostly biased and all our AI algorithms are addicted to data. Even methods to fine-tuning biases out requires data validated by humans. Telling what LLMs should and should not talk about requires data.

The whole endeavour, the enterprise of modern day AI, is an unregulated addiction to the opium called data.

We can’t fix the biases in our models because we are not creating something new that could break out of the data it is fed. The AI model is literally some optimisation of the data it is fed. I would argue this is true at a mathematical level. It is a common frustration in the natural intelligence research community that it makes no sense why a reinforcement learning AI agent needs to play millions of hours of a game to get good at it. Unless of course, you realise that millions of hours of gameplay is just more data.

Statistical estimation

The colloquial phrase “garbage in, garbage out” comes from what we do with this data. Most of AI is just machine learning otherwise we wouldn’t have needed the term Artificial General Intelligence (AGI), but marketing and hype eventually needed a toothbrush that is AI enabled. Could you imagine if we had a toothbrush that could think for itself? It would actually be brilliantly boring brushing about. Nonetheless, the core principle of machine learning is to search for a model that explains the data that we have. For example, this could be a model that explains how words follow each other or a model that better aligns with human preferences. The mathematical framework for kick-starting this process is statistical estimation. Some of the fundamental principles of estimating the average height of the population is not different from machine learning the distribution of words in a language model. But there are some pitfalls that keep us looping back on creating systems that have inherent biases:

We’re biased so is our data: The source of all of this training data is us. Maybe we’re just stating the obvious here but let it be stated.
The measure of success is test data: which is still data just thought of differently. Just because the model generalises well to some unseen test data doesn’t mean it is doing any better in terms of overcoming biases. The test data comes from the same source.
The path to betterment is more data: We repeat the same overall process with slightly different models (transformers versus recurrent networks etc.), different datasets (filtering etc.) and hope to get a different result. We actually get a different result that requires intensive care post-surgery before the ultimate AI chat customer assistant decides to sell the company for 1 dollar.

Intelligence beyond data

I would like to think that some of our intelligence is independent of the data we’re exposed to. For example, while someone who has never seen snow before may not imagine it but it doesn’t make them less intelligent. Furthermore, a single exposure to snow is often sufficient to understand it. How many examples and articles does modern AI systems need to process data about snow to get that snow is actually blue but appears white. Well, that’s because it isn’t blue and we can reason about what we learn rather than be at the mercy of it.

Addicted to data#

Statistical estimation#

Intelligence beyond data#

Addicted to data

Statistical estimation

Intelligence beyond data