The Power Duo: Embeddings and RAG in Practice – Part 1: Embeddings

June 23rd, 2024

Introduction

A few weeks ago, Solution Street hosted an in-person AI/Machine Learning meetup which was very well attended and was overall a great event. One of the speakers, Ryan Gehl, gave a great introduction to the practical application of embeddings using OpenAI’s CLIP model. During this presentation, he demonstrated that by calculating embeddings for 30,000 pairs of glasses, similar pairs of glasses can be identified based on their positions in embedding space. In other words, he could show a list of suggested sunglasses based on the image of a single representative pair of sunglasses. I was amazed at how all of this could be achieved through the use of something called an embedding.

But wait a minute, what is an embedding? The official definition of an embedding is a numerical representation of real-world objects that machine learning (ML) and artificial intelligence (AI) systems use to understand complex knowledge domains like humans do. More simply stated, embeddings enable machine learning models to find similar objects. Embeddings are expressed as multi-dimensional vectors, much like a list of numbers {20,405,313,23}, where each number represents where the object that is represented by the vector falls in that given dimension.

Still trying to wrap my head around this concept, I read another blog post written by Simon Willison called “Embeddings: What they are and why they matter” and this really helped me to better understand the more practical side of embeddings. In that article, I learned that embeddings are created by training a neural network on a large corpus of data which enables the embeddings to capture the semantic meaning of the content they represent. In other words, the embedding numerically captures the meaning and context of the data it represents and this allows the AI to understand and compare this data with other data easily.

Simon Willison goes on to talk about using embeddings to create a recommendation feature on his blog site that recommends posts similar to a given article. This was done by creating a vector embedding from the content of each article and then using something called cosine similarity to calculate the mathematical distance (which is technically the cosine of the angle between the vectors) between one vector and another. The closer this value is to 1, the more similar the two vector embeddings are (and the more similar the objects they represent as shown in the below example).

Now with my newly acquired “armchair” expertise of embeddings, I began to appreciate the real value of embeddings in practical applications, including:

semantic search
recommendation engines
finding similar/dissimilar documents
classification

So I set out to further explore embeddings on my own, but with a focus on “how would I bring this feature to the types of practical business applications that we develop here at Solution Street?” For that, there are several key requirements that need to be met in order to practically use embeddings in a business application:

An appropriate data set for a given domain that can be represented by the embeddings
A way to generate the embeddings
A way to store/persist the embeddings (as vectors)
A way to quickly find and retrieve the stored vector(s)
A way to quickly calculate the distance/similarity between two vectors
And all of this has to meet the software “-ilities” requirements, such as scalability, reliability, security, usability, etc.

For the first requirement, a data set, I decided to take the same approach as Simon’s post and use a set of blog articles, though this time it was from our own Solution Street blog which contains a wealth of good software construction/development knowledge and tips gained over the 22+ years of Solution Street’s existence. For the initial demo, I would also create a recommendation engine that would select similar blog articles for a given article. Using the Solution Street blog articles as my test data set also brought the additional advantage of my familiarity with this data, so it would be obvious to me when things were working (or were failing miserably).

For the second requirement, generating the embeddings from text data, we need to use an AI embedding model to encode the text input as a vector. There are many options here, both paid and open source, and details of each option is out of scope for this article. So for simplicity’s sake, I decided to use the well-known OpenAI and its related embedding model API to generate the embeddings here. This is not a free option, but the cost to generate each embedding is very cheap (about $0.00008 per page of content at the time of this article).

For the next three requirements: store, find, and calculate the distance between the vector embeddings, this “smelled” awfully similar to the job of databases/datastores that are often used in enterprise applications. After looking into this some more, I found that there are indeed both general purpose databases and datastores that can be used or “adapted” to working with vectors as well as purpose-built vector databases. Given the final requirement of the database to meet all of the software “-ilities,” be supported, be production ready, and also not trying to reinvent the wheel, I decided to look at the purpose-built vector databases for this demo.

There are several (many actually) purpose-built vector databases out there. Some of the more popular ones include Pinecone, Weaviate, Milvus, and Chroma. For this demo, I decided to use Weaviate primarily because its code is open source, is well documented, and provides the features I need (search, vector handling, plug and play to different APIs, production ready, etc.) and it is easy to get started and use.

Embeddings Code Demo

Before I get into the details of the demo, all of the code and data that is discussed here is available in this GitHub repo. I will be using Python for this demo. If you want to try this yourself, please have a recent version of Python installed (I am using 3.12.3 here) and sign up for an OpenAI account to get access to the API and an API key. Finally, although not absolutely required, I will also use Visual Studio Code with the Jupyter notebook plugin for this demo with the code from the GitHub repo. This makes it easier to demonstrate each step along the way, and also makes it easier to experiment with the code.

Start by cloning the repo above to your local environment. Now open Visual Studio Code and open the folder where you cloned the repo. It should look like the following:

Next, open the Terminal view in Visual Studio, and verify that the correct version of Python is installed (I am using 3.12.3 in this case), and also install the two required libraries python-dotenv and weaviate-client:

> python --version
Python 3.12.3
> pip install python-dotenv && pip install -U weaviate-client
....

Create a new file in the main folder called .env

This is where you can store any environment variables that need to be loaded to run the demos. The Python library you installed earlier, python-dotenv, will be used to automatically load these environment variables in the code. For now, the only variable we need to add to this file is for the OpenAI API key that you set up earlier. Your .env file should look like the following (replace the “xxxxxxxxxxxx” with your API key value):

#  OpenAI API Key

OPENAI_APIKEY=xxxxxxxxxxxx

Once the .env file is set up, now open the Jupyter notebook blog-embeddings.ipynb.

Make sure you have selected the correct Python kernel for the notebook. In my case, this is the Python 3.12.3 kernel that I installed the Python libraries into earlier. Let’s now walk through each code block in the notebook one by one so I can explain what is happening and discuss the output.

After running the following code block, the environment variables you saved in the .env file earlier will be loaded for use by the program. Run this code block now.

# loads the environment variables for this project

from dotenv import load_dotenv

load_dotenv()

Next, we will initialize and load the Weaviate client. The call to connect_to_embedded() will actually load an embedded version of the Weaviate database that will be used with this program. The embedded version is just for local development. In “real life” you can run the Weaviate database in several ways, including as a docker container, Weaviate Cloud (WCD) service, self-managed Kubernetes, or Hybrid SaaS. For more information, consult the Weaviate documentation.

import weaviate

import os

client = weaviate.connect_to_embedded(

    headers = {

        "X-OpenAI-Api-Key": os.getenv("OPENAI_APIKEY")  # Replace with your API key

    }

)

The embedded server will run on port 8079 by default. The output for a successful launch will look something like this:

Started /Users/ghodum/.cache/weaviate-embedded: process ID 60012

{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-06-11T15:20:41-04:00"}

{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-06-11T15:20:41-04:00"}

{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-06-11T15:20:41-04:00"}

...

Note: if you get the following error, you are probably already running a Weaviate embedded instance. Make sure to shut down or restart any Jupyter notebook kernels that may already be executing Weaviate and try again.

WeaviateStartUpError: Embedded DB did not start because processes are already listening on ports http:8079 and grpc:50050use weaviate.connect_to_local(port=8079, grpc_port=50050) to connect to the existing instance

Now that we have created an instance of the Weaviate database, we will now create a collection to hold our blog data. Collections are groups of objects that share a schema definition. You can think of a collection as similar to a table in a relational database. The schema for a collection can be explicitly defined or it can be generated automatically based on the incoming data. Each object in the collection consists of properties of that object as well as a vector embedding representation of the object. In this example, I will create a new collection named BlogArticles and configure it to use the OpenAI text2vec vectorizer (you can also configure Weaviate to use other embedding models and model providers to generate the vector embedding as well as “bring your own” vector data generated outside of Weaviate) and also to support OpenAI Generative AI features (to be used later on).

# create a new collection to hold the vectors

# we are using OpenAI here, but this can be changed to another AI API

import weaviate.classes as wvc

collection_name = "BlogArticles"

# If the collection already exists, delete it

if client.collections.exists(collection_name): 

    client.collections.delete(collection_name)

blog_articles = client.collections.create(

    name = collection_name,

    vectorizer_config = wvc.config.Configure.Vectorizer.text2vec_openai(),        

    generative_config = wvc.config.Configure.Generative.openai() 

)

After creating the collection, you should see something similar to this output.

{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-06-11T17:57:41-04:00","took":81146}

{"level":"info","msg":"Created shard blogarticles_2BLFhmmda6Lf in 14.831813ms","time":"2024-06-11T17:57:41-04:00"}

Now we get to the important part: we will load each blog article, add it to the collection, and have Weaviate call OpenAI using the text-embedding-ada-002 embedding model by default to create a vector embedding that represents the properties stored in the object. Weaviate also indexes the properties of the object you add to the collection to make it easy to filter the data in a query. It supports both approximate nearest neighbor indexes (ANN) and inverted indexes. All of this happens automatically when you add or update an object to the collection.

The following code will load each blog article from the blogs directory, create an object with two properties, filename and content, and then add it to the collection.

import os

import glob

blog_articles = list()

blog_dir = glob.glob('blogs/*.txt')

for blog_file in blog_dir:

  blog_filename = os.path.basename(blog_file)

  with open(blog_file, mode = 'r') as file:

    blog_articles.append({

        "filename": blog_filename,

        "content": file.read().replace('\n', ' ')

    })

blog_articles_collection = client.collections.get(collection_name)

blog_articles_collection.data.insert_many(blog_articles)

After you run this code, you should see something similar to this (with different UUID for each object added to your collection).

BatchObjectReturn(all_responses=[UUID('cc810581-ed90-459e-8fd9-ddad478ea726'), UUID('e94231ce-ec59-4f3d-9d0c-4eb5e01d02ac'), UUID('93572f32-08fb-46ab-869d-a693ed195524')...

Ok, now the moment we have all been waiting for: all of the blog articles are now loaded into the collection and each blog article has a vector embedding associated with it, so we are now ready to query the data and find the similar articles. Remember from earlier that we are going to find articles that are similar to a selected article by comparing the distances between the selected article’s vector embedding and the other arcticle’s vector embedding and then select the top n other articles where their distances are the smallest when compared to the selected article. Weaviate makes this simple by querying the collection using the near_text() operator. The near_text() operator finds data objects based on their vector similarity to a natural language query.

The following code loads each blog article in the collection, and for each article calls the near_text() operator with that article’s content as the query term. The near_text() operator will internally generate a vector embedding out of the content you provided in the query and will then find the other top n articles that are similar (in this case configured to a limit of 5 articles). We also configure the near_text() operator to return the distance values of each article so we can see how similar the article is to the selected article. The distance value metric is the cosine distance by default, where a value of 0 means identical vectors and a value of 2 means opposing vectors. In reality, the values are usually somewhere in between these two values. Run this code now to generate a listing of similar articles. Note: this may take some time to generate the full listing (on my machine it took about 40 seconds).

from weaviate.classes.query import MetadataQuery

# given a blog article, let's find the top 5 similar articles using the weaviate client

for item in blog_articles_collection.iterator():

    filename = item.properties['filename']

    content = item.properties['content']

    response = blog_articles_collection.query.near_text(

        query = content,

        limit = 6, # we want the top 5 similar articles, but we also get the same article back, so we ask for 6

        return_metadata = MetadataQuery(distance = True)

    )

    print(f"Similar articles to {filename}:")

    for object in response.objects:

        similar_filename = object.properties['filename']

        # skip if the same file

        if similar_filename == filename:

            continue

        distance = object.metadata.distance

        print(f"\t{similar_filename} (distance: {distance})")

    print("\n\n")

An example of my listing showing some of the similar articles:

Similar articles to react-bloglet-volume-1.txt:

react-bloglet-series-volume-2.txt (distance: 0.08556556701660156)

testing-react.txt (distance: 0.12350666522979736)

learning-react-js-by-example.txt (distance: 0.12637197971343994)

learning-react-js-by-example-part-2.txt (distance: 0.1288825273513794)

getting-react-on-the-rails.txt (distance: 0.1297537088394165)

Similar articles to my-tech-predictions-that-will-change-your-life.txt:

introduction-to-modern-ai-2024-edition-part-1.txt (distance: 0.16261422634124756)

practical-artificial-intelligence.txt (distance: 0.1755232810974121)

introduction-to-modern-ai-2024-edition-part-2.txt (distance: 0.1785755753517151)

dipping-your-toe-into-machine-learning.txt (distance: 0.18109631538391113)

software-engineer-resolutions-for-2018.txt (distance: 0.19106817245483398)

Similar articles to how-do-you-find-great-software-developers.txt:

10-secrets-to-hiring-and-retaining-great-software-developers.txt (distance: 0.10145145654678345)

5-key-capabilities-the-best-problem-solvers-have.txt (distance: 0.1152956485748291)

what-your-software-development-vendor-isnt-telling-you.txt (distance: 0.13328921794891357)

see-the-forest-for-the-trees-achieving-the-best-in-software-development.txt (distance: 0.13543027639389038)

lets-not-forget-about-the-other-50.txt (distance: 0.1396358609199524)

...

And finally here is the amazing and magical part: it is finding articles that are semantically similar to one another without giving the tool any other information or metadata about the articles, just the raw content (and the vector embeddings that represent that content).

For example, look at the first blog article above: react-bloglet-volume-1.txt. This is obviously an article about the JavaScript UI library React. The query correctly returns the other articles that are most similar to it and also about React. Look at the distance value of the first article returned in the list, react-bloglet-series-volume-2.txt,compared to the other articles. This article ranked highest (I presume) because they are both part of the same series of articles and are therefore most similar. Fantastic stuff!

The next article, my-tech-predictions-that-will-change-your-life.txt, is equally as interesting. This is an article, written in July 2023 by Solution Street’s founder Arthur Frankel, that made a few predictions of technology that will change your life in the future (and no, Arthur is not also a soothsayer in real life). Not surprisingly, AI and machine learning played a big role in some of his predictions. And so we see a lot of the articles that are listed also have content related to AI and/or machine learning. But it is the last article in the list, software-engineer-resolutions-for-2018.txt,that in my opinion is most interesting. This was another article, written again by Arthur but this time back in 2018, about software engineer resolutions for 2018. In this article, there is only one small mention of machine learning in the article and no mention of AI at all (boy, how times have changed in six short years!). The article is all about predictions of what technologies might be used in the future and therefore was also selected as a similar article in the list. Cool!

Wrap Up

Let’s recap what we have learned so far regarding AI embeddings and their practical use. We began by understanding that embeddings are numerical representations of real-world objects, expressed as multi-dimensional vectors, enabling machine learning models to identify similar objects based on these vector representations. We discussed some practical uses of embeddings including:

semantic search
recommendation engines
identifying similar content
classification

In order for “real world” business applications to start to leverage embeddings, we first identified several key requirements that must be met including:

a relevant data set
a mechanism to generate vector embeddings
a way to store and retrieve vector embeddings
a way to calculate similarity
and being production ready

Finally, we demonstrated a “real world” use case of vector embeddings that can be quickly implemented using the Weaviate vector database to create a related article recommendation feature for blog posts based on vector cosine similarity.

I hope you have enjoyed this quick introduction to the fascinating world of AI embeddings and their practical applications. Now that we have a better understanding of the capability and power of what AI embeddings can provide, Part 2 of this article will explore another practical application of AI embeddings and generative AI called Retrieval-Augmented Generation (RAG), coming soon!