Vector databases are a type of database that stores and manages data in the form of vectors. In other words, vectors are used to index data and later on searched for something relevant. Vectors are mathematical objects that represent data points in a multidimensional space. Each dimension in a vector represents a different feature or attribute of the data point. For example, a vector representing a customer might have dimensions such as age and height [24, 180].

I have recently been involved with SemaDB, a new vector database designed for ease-of-use and cost-efficiency. Some of the comments are based on my experience of the applications we were working on, their advantages and drawbacks. There are also links to more vector databases at the end of the post.

With the rise of neural networks, these vector representations are learnt instead of engineered. They are often hundreds of dimensions of floating point numbers that we don’t really know what they exactly mean:

[0.234, 0.456, 0.123, 0.789, 0.012, 0.987, 0.321, 0.654, 0.890, 0.543]

could be the vector for the word cat according to some neural network model. These days neural networks can be used to learn vector representations of words, sentences, documents, images, videos and more. The advantage they give is that we now have a common representation for these data points despite not knowing what the numbers correspond to. We can use these vectors to find similar items, i.e. similar vectors, cluster items and search across them.

Vector databases are designed to store and manage these vectors efficiently. Unlike relational databases, vectors are used for organising the data into specialised data structures (most commonly graphs).

What are the benefits of using vector databases?

There are several benefits to using vector databases:

  • Similarity search: Vector databases can efficiently perform similarity searches, which is the process of finding data points that are similar to a given query vector. This is a useful feature for applications such as product recommendations, image search, and fraud detection.

  • Clustering: Vector databases can efficiently perform clustering, which is the process of grouping data points together based on their similarity. This is a useful feature for applications such as customer segmentation and market research.

  • Unstructured data: Since we can learn vector representations of complex unstructured inputs such as images, we can also let users search, cluster and store data that is not necessarily tabular.

Most of these advantages directly come from using vectors as our representational medium, unlike for example plain text. Use of vectors lead to novel applications such as:

  • Retrieval Augmented Generation (RAG): We can store and manage word, sentence, document, image, video etc embeddings and use them directly in many upstream applications such as chatbots to generate answers that are hopefully grounded with some context. Think of apps that answer questions from a PDF file.

  • Product recommender: Recommending similar products can be augmented to take into account features learnt from a neural network. We can query a vector database to give us similar products based on what’s in the image, description etc. of a product.

  • Fraud detection: Finding similar instances of events learnt by neural networks can be used to detect fraudulent activity or used as anomaly detection. For example, receiving an event that is dissimilar to anything that is stored in the vector database might indicate something suspicious.

  • Multimodal search: A multi-modal neural network such as CLIP can learn vector embeddings of both text and images in the same vector space. Hence, text queries can be used to search for images or vice versa. This is often beneficial in e-commerce to let users search items based on their own text queries as opposed to what is portrayed in the product image.

What are the common problems of vector databases?

I’ll talk about two main problems with vector database. The first is that vector databases inherent the problems of neural networks. Since we use vectors from neural networks as a way of indexing our data, we hope that what we store and search is good enough. This depends on the underlying neural network and you find your classic problems such as distributional shift, train and test data discrepancies crop up in your database application. For example, a product recommender may suggest blatantly unrelated items because that’s what the neural network has learnt. This has nothing to do with the vector database, but more with the strong coupling of neural networks into the application. More often than not, you end up with something that works sufficiently enough. Although, it is important to note that more data and training leads to better representations and things have improved over the years. Whether the increased computational cost of using neural networks is worth the improvements should still be an important business decision.

The second issue is that vector databases are approximate. Especially when you increase the number of items, those billion scale claims, things start to fall apart but look okay. We have to throw more tricks into the mix such as quantising the vectors so they take up less space, perform two or even two level indexing by mixing and matching different algorithms. In effect, you get approximately similar items which again are good enough for most applications rather than the best item. For example, if we are recommending products to users, we might sometimes miss certain items depending on how the vectors are indexed and traversed during search. These gaps are closed with more software application side such as ensuring most popular products always appear together rather than relying on the neural network and the vector database to yield the desired search result.

As promised, here are some of the popular vector databases available. I’m obviously biased towards SemaDB as I worked on it and I think it offers a cleaner solution compared to the alternatives below:

  • SemaDB: is a fully hosted vector database solution with an easy-to-use RESTful API. It is designed to be developer-friendly and low-cost.
  • Pinecone: Pinecone is a cloud-hosted vector database that is easy to use and scales well. It is geared more towards enterprise and you have to rent resources which they then deploy on AWS or Google Cloud. A bit cumbersome in my experience.
  • Milvus: Milvus is an open-source vector database that is highly scalable and performant. Their tech-stack has too many components to reliably manage by individuals, small teams and businesses. The hosted version can get quite expensive quickly.
  • Weaviate: is also an open-source vector database but require schema definitions. Similar to Pinecone, their hosted version is about renting resources on AWS and Google Cloud which can scale into oblivion.
  • FAISS: is a library for efficient similarity search and clustering over a collection of vectors. It can be used to build custom vector databases. It is actually really good but doesn’t have features such as updating points etc. It makes more sense if you have a specific problem in mind and can take full advantage of the research-level solutions this library provides.