Introduction
Humanity has long since passed the point at which it’s possible to “keep up” with the pace of new data being generated. What’s a measly human supposed to do in the face of all this data, equipped with little more than a pathetic meat brain and their own curiosity? We haven’t even considered retention, comprehension, or recall, without which that speed reading course your cousin has been trying to sell you since 2008 isn’t worth much.
One answer is to get organized with your data. There’s far too much data for one individual or even a team to keep everything they need to know in mind. Data-oriented professionals have been managing data in relational databases for about 50 years, and contemporary data scientists and engineers will be familiar with tools like SQL, Postgres, and Mongo. A more different approach called Vector Databases has been gaining some spotlight through the mainstream emergence of generative AI.
How Vector Databases Organize Data
Vector databases organize data by representing each data point as a vector, which is a mathematical object containing numerical values corresponding to different attributes or features of the data. These vectors are then organized and stored in the database, creating a structured arrangement where each vector occupies a specific location or index. The database uses mathematical operations, such as distance metrics, to efficiently search, retrieve, and manipulate vectors. This organization enables the database to quickly find and analyze similar or related data points by comparing the numerical values in the vectors. As a result, vector databases are well-suited for applications like similarity search, where the goal is to identify and retrieve data points that are closely related to a given query vector based on their mathematical representation.
Vector databases and vector search leverage the capabilities of large neural models like transformers or convolutional networks to extract meaningful numerical features from a data object. These features are represented as embeddings, or feature vectors. Depending on the training of the embedding model, these vectors can encapsulate both concrete properties and semantic information, such as sentiment or the contextual relationship between data.
Vector search typically seeks the nearest neighbors to a query vector. While a naive approach involves calculating distances for every vector, it is impractical for larger applications. Efficient vector search relies on approximate nearest neighbor (ANN) strategies, trading speed for accuracy.
Cutting-edge vector search implementations can handle thousands of queries per second with recall rates exceeding 90%. The http://ann-benchmarks.com page provides comparisons of various ANN implementations. The potential use cases for vector search are vast and continually expanding with the rise of generative AI and the acceleration of new content creation.
Using Vector Database to Matching Idioms and Their Meaning
Understanding idioms can be a special challenge. When learning a new language, trying to keep up with new slang, or even just interacting with a slightly different dialect or culture, these non-literal colloquialisms are one of the ‘final bosses’ of fluency.
If you have an idea of the meaning you want to express, but don’t remember the corresponding idiom, it turns out you can use vector search to find a close match. Consider the phrase “a fatal weakness, especially in the absence of other vulnerabilities.” You may already have in mind the corresponding idiom, but if not, you can easily pick out the match from the lineup below based on the Euclidean distance (lower is better) and cosine similarity (higher positive values are better).
Phrase | Euclidean Distance | Cosine Similarity |
a bitter pill to swallow | 7.075 | 0.338 |
a dime a dozen | 7.280 | 0.307 |
a hot potato | 7.260 | 0.300 |
Achilles' heel | 6.022 | 0.532 |
at the drop of a hat | 6.696 | 0.389 |
dollars to donuts | 7.840 | 0.210 |
When queried with our search phrase, the idiomatic term “Achilles’ heel” has both the lowest Euclidean distance and the highest cosine similarity. In other words we’ve successfully connected the idiom with its intended meaning via a naive vector search. But how do we extract embedding vectors from strings of text (the query phrase and the idioms)? Euclidean distance may be familiar (in 2D or 3D space), but how do we calculate cosine similarity?
Let’s start with embeddings. We’ll use the Hugging Face ecosystem and a SentenceTransformers model to extract feature vectors from each string of text. We’ll take advantage of the AutoTokenizer and AutoModel classes to instantiate the pre-trained model
from transformers import pipeline, AutoTokenizer, AutoModel
# phrases and query in lists
phrases = ["a bitter pill to swallow", \
"a dime a dozen",\
"a hot potato",\
"Achilles' heel", \
"dollars to donuts", \
"at the drop of a hat"]
query = ["fatal weakness”\
“, especially in the absence of other vulnerabilities"]
# load model from sentence-transformers on huggin face
my_model = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(my_model)
model = AutoModel.from_pretrained(my_model)
# convert strings to tokens and pad
data_tokens = tokenizer(phrases, padding="max_length", \
max_length=pad_length,\
return_tensors="pt")
encoded_data = {key: value for key, value \
in data_tokens.items()}
# we’ll use the last hidden state from model_output
model_output = model(**encoded_data)
# same process applied to query
query_tokens = tokenizer(query, padding="max_length", \
max_length=pad_length,\
return_tensors="pt")
encoded_query = {key:value \
for key, value in query_tokens.items()}
query_output = model(**encoded_query)
SentenceTransformers are “Sentence BERT” (SBERT) models, an extension of BERT transformer architecture with modifications tailored specifically for embedding sentences. SBERT models are trained with a special CLS (classification) token, as in the original BERT paper. For the model we’ll use this as the start token ‘<s>’.
tokenizer.special_tokens_map["cls_token"]
# output
'<s>'
One common method to get an embedding from a BERT type model is to take the CLS token representation from the final layer of the model. As our model uses the ‘<s>’ start token as the CLS token, this just means returning the first vector for each sample in a batch as that sample’s embedding.
def cls_pooling(model_output):
cls_vectors = model_output.last_hidden_state[:,0,:]
return cls_vectors
Another pooling strategy for embeddings is to average all the vectors in the last hidden layer of the model. For our Achilles’ heel example, this doesn’t affect the ranking but it does result in a slightly better Euclidean distance and cosine similarity of 4.747 and 0.692, respectively, for “Achilles’ heel” with our original query.
def mean_pooling(model_output):
mean_vectors = model_output.last_hidden_state.mean(1)
return mean_vectors
Once we’ve fed the idioms and the query through the model, we can use one of the functions defined above to pool the vectors from the last hidden layer using either mean pooling or CLS pooling.
embeddings = cls_pooling(model_output)
query_embedding = cls_pooling(query_output)
With embeddings in hand, we can use similarity and/or distance metrics to match the query embedding vector to one or more vectors from the list of idioms. We’ll use Euclidean distance and cosine similarity. The functions for Euclidean (L2) distance and cosine similarity follow from the equations in a straightforward manner.
def l2_distance(query, embeddings):
# query is a 1xk vector,
# embeddings is n vectors in nxk matrix
distances = [((query-e)**2).sum().sqrt() \
for e in embeddings]
return distances
def cosine_similarity(query, embeddings):
q = query
similarities = [(q @ e.t()) \
/ (q @ q.t() * e @ e.t()).sqrt() \
for e in embeddings]
return similarities
distances = l2_distance(query_embedding, embeddings)
cosine_similarities = cosine_similarity(query_embedding, embeddings)
Euclidean distance and cosine similarity are just two of many different options for comparing vectors, any of which can be chosen to help power a vector search. A selection of different distance and similarity metrics are collected in the table below.
Vector comparison metrics
In the table, distance metrics essentially treat embedding vectors like coordinates in a high-dimensional space. L2 distance penalizes smaller distances less harshly than L1 (taxicab/Manhattan) distance, and the Chebyshev distance takes the maximum difference between any vector elements, ignoring elements with smaller differences.
Dot product and cosine similarity compare vectors based on their direction, with the dot product also taking magnitude into account. Cosine similarity, on the other hand, is normalized by the geometric mean of the self-dot products of each vector, and therefore only compares vectors by their direction.
You can experiment with the basic example (swapping out CLS and mean pooling, for example) with the code in this GitHub gist, which also includes extra code for reproducing the figure below. Try adding your own distance/similarity metrics or different pooling strategies, custom queries, etc.
Vector Database Use Cases
This section surveys a very small subset of applications for vector search. These are all examples currently in use, or form the basis for a compelling demo from one of the many vector search companies (or libraries).
Powering Semantic Matches for Employers and Job Seekers
Semantic search can be used to match job searches and candidate profiles to job advertisements and companies, even when keywords and tags don’t line up. For example, there’s no need to manually provide an equivalence mapping between search terms for “frontend engineer” and “web development,” vector embeddings provide that understanding automatically. This may help avoid problems arising from mismatches between concrete skills and fashionable jargon used in the business world.
The European “Virtual Recruiting” service MoBerries uses semantic search to match job candidates to fill jobs for clients. While the technical details of their implementation aren’t published, the German vector search company Qdrant listed MoBerries as a client success case. Qdrant is written in Rust with clients in Python and JavaScript/TypeScript, and you can check the nominal performance of their library at ann-benchmarks.com.
Text-to-Image Search: Cleaning up a Demo Dataset
One of the main demos for vector databases come from Marqo, is a text-to-image search mimicking an e-commerce application. The images in the demo, however, are entirely AI-generated (all ~250,000 of them). A side effect of using such a large synthetic dataset is that some of the images turned out a little weird, and a fair number of the generated images were not the kind of thing you’d want popping up in a conference room demo.
Marqo turned their vector search back on the dataset to clean up the image results and remove inappropriate images from the demo dataset by using search terms that prompted unwanted results to remove about 1500 bad images from the dataset. While this data-cleaning process was not 100% effective it has made a big impact. However, finding the right search terms for images that escaped their sanitation efforts is worth trying out. Marqo also demos image-to-image search and a text-based search of a simple Wikipedia dataset.
Visual vector database system like Marqo can be used for e-commerce that can enable retail stores utilize powerful vector database and search to help shoppers find what they need with visually similar products. For example, a vintage clothing store can utilize vector database and search to label new vintage items within a certain aesthetic (sporty, punk, preppy). Shoppers can use common language to search an aesthetic without having to specify the type of clothing they want (jerseys & trainers, leather jackets & boots, button ups & chinos).
Retrieval Augmented Generation
LLMs make for a tempting replacement for conventional web search. Rather than searching through a page or more of links vaguely related to your search query (including ads and SEO content), wouldn’t it be more convenient to just ask a trusty AI for the answers you want or need?
The rapid adoption of ChatGPT had reportedly caused alarm at Google, the world’s leading search engine with users migrating to the GPT-4-powered Bing Chat now CoPilot. Google quickly followed suit by integrating LaMDA and Palm models into their LLM Bard. Even DuckDuckGo launched an experimental feature called DuckAssist, the product of a collaborative project with Anthropic’s Claude LLMs.
However, a major flaw for LLM chat models, especially as a search engine alternative, is a tendency to hallucinate where the LLMs can confidently output misinformation. Without citing sources (or worse, making up sources to cite), chat LLMs require a level of expertise to check answers that largely obviates the utility. One way to fix this problem is to combine chat tuned LLMs with vector search.
Retrieval Augmented Generation (RAG) is a solution to the gaping flaws encountered when using LLMs like GPT-4, Claude 2, or Palm 2 as a search replacement. LlamaIndex is a RAG resource and data framework for connecting LLMs and data sources for building a RAG question answering system in just a few lines of code.
Many of the new vector database companies that have been sprouting up in the midst of major developments in the generative AI space describe themselves as AI native. This is one way to differentiate the new players, for better or for worse, against vector search offerings from more established projects that often have a background in relational search (e.g. pgvector in PostGres, etc.).
Significant shortcomings with using un-augmented LLMs as a knowledge base include hallucination, limited context windows, and frozen knowledge arising from cut-off dates in training data. Vector search and vector databases are a powerful tool for overcoming these problems.
Current Vector Database Framework Offerings
The table below includes a selection of various companies and frameworks for powering vector databases. With the rapid rate of new project germination and venture capital flowing into vector search, this list is only a small subset of what’s out there.
While most of the companies listed offer cloud solutions for vector databases at large enterprise scale, most also release their code under an open or permissive license. This makes self-hosting or local instances feasible, and you may be surprised at the scale possible with basic hardware. If your organization needs to maintain data security on owned hardware and has the engineering competence for managing one of the open-source frameworks, self-hosting is a great option.
It’s possible to get a lot of vector search performance out of relatively modest hardware resources. To start with, we can consider the hardware used by ann-benchmarks.com. According to the GitHub repo, they run benchmarks on an AWS r6i.16xlarge machine instance. Each instance has 64 vCPUs and 512 MB of RAM.
Coupling vector search to other modules, such as an LLM for retrieval augmented chat functionality, will naturally have additional hardware requirements associated with the additional functionality. For a RAG-enabled LLM, add consideration for one or more GPUs with sufficient VRAM for handling the model. There’s no doubt that the ocean of data is only getting bigger, and methods that seem exotic today may become commonplace in the future.
Conclusions
Vector search has received positive buzz amidst recent developments in generative AI, but that doesn’t mean vector search is a new technology. Hierarchical Navigable Small Worlds (HNSW) is considered by many to be the state-of-the-art approximate nearest neighbor algorithm for vector search published in 2016.
Vector search doesn’t have to supplant the formal structured queries of relational databases to be useful, but rather provides a more abstract and flexible option for matching queries semantically. A well-integrated vector database strategy can supplement more traditional data operations with relational databases and many of the vector database offerings have the option to combine vector and metadata search.
Vector search enhances the capabilities of the very deep learning models that enable the extraction of meaningful vector representations of data, e.g. LLMs and image models like CLIP. A vector database can be combined with LLMs to provide something akin to extended memory, and to overcome limitations like context window constraints and confabulation. There may be quite a few vector database startups that end up on the cutting room floor after the recent influx of investment wanes, but we’re sure to see some interesting new use cases and capabilities as a result of all the attention.
SabrePC is a hardware solutions provider with a stock of storage drives to store your data efficiently and effectively. Explore our storage servers and level up your data infrastructure. If you have any questions, contact us today for system quotes, availability, and more.