https://technicalwriting.dev/data/embeddings.html Skip to main content * Home * RSS Embeddings are underrated# Machine learning (ML) has the potential to advance the state of the art in technical writing. No, I'm not talking about text generation models like Claude, Gemini, LLaMa, GPT, etc. The ML technology that might end up having the biggest impact on technical writing is embeddings. Embeddings aren't exactly new, but they have become much more widely accessible in the last couple years. What embeddings offer to technical writers is the ability to discover connections between texts at previously impossible scales. Building intuition about embeddings# Here's an overview of how you use embeddings and how they work. It's geared towards technical writers who are learning about embeddings for the first time. Input and output# Someone asks you to "make some embeddings". What do you input? You input text.^1 You don't need to provide the same amount of text every time. E.g. sometimes your input is a single paragraph while at other times it's a few sections, an entire document, or even multiple documents. What do you get back? If you provide a single word as the input, the output will be an array of numbers like this: [-0.02387, -0.0353, 0.0456] Now suppose your input is an entire set of documents. The output turns into this: [0.0451, -0.0154, 0.0020] One input was drastically smaller than the other, yet they both produced an array of 3 numbers. Curiouser and curiouser. (When you work with real embeddings, the arrays will have hundreds or thousands of numbers, not 3. More on that later.) Here's the first key insight. Because we always get back the same amount of numbers no matter how big or small the input text, we now have a way to mathematically compare any two pieces of arbitrary text to each other. But what do those numbers MEAN? ^1 Some embedding models are "multimodal", meaning you can also provide images, videos, and audio as input. This post focuses on text since that's the medium that we work with the most as technical writers. Haven't seen a multimodal model support taste, touch, or smell yet! First, how to literally make the embeddings# The big service providers have made it easy. Here's how it's done with Gemini: import google.generativeai as gemini gemini.configure(api_key='...') text = 'Hello, world!' response = gemini.embed_content( model='models/text-embedding-004', content=text, task_type='SEMANTIC_SIMILARITY' ) embedding = response['embedding'] The size of the array depends on what model you're using. Gemini's text-embedding-004 model returns an array of 768 numbers whereas Voyage AI's voyage-3 model returns an array of 1024 numbers. This is one of the reasons why you can't use embeddings from different providers interchangeably. (The main reason is that the numbers from one model mean something completely different than the numbers from another model.) Does it cost a lot of money?# No. Is it terrible for the environment?# I don't know. After the model has been created (trained), I'm pretty sure that generating embeddings is much less computationally intensive than generating text. But it also seems to be the case that embedding models are trained in similar ways as text generation models^2, with all the energy usage that implies. I'll update this section when I find out more. ^2 From You Should Probably Pay Attention to Tokenizers: "Embeddings are byproduct of transformer training and are actually trained on the heaps of tokenized texts. It gets better: embeddings are what is actually fed as the input to LLMs when we ask it to generate text." What model is best?# Ideally, your embedding model can accept a huge amount of input text, so that you can generate embeddings for complete pages. If you try to provide more input than a model can handle, you usually get an error. As of October 2024 voyage-3 seems to the clear winner in terms of input size^3: Organization Model name Input limit (tokens) Voyage AI voyage-3 32000 Nomic Embed 8192 OpenAI text-embedding-3-large 8191^4 Mistral Embed 8000 Google text-embedding-004 2048 Cohere embed-english-v3.0 512 For my particular use cases as a technical writer, large input size is an important factor. However, your use cases may not need large input size, or there may be other factors that are more important. See the Massive Text Embedding Benchmark (MTEB) leaderboard. ^3 These input limits are based on tokens, and each service calculates tokens differently, so don't put too much weight into these exact numbers. E.g. a token for one model may be approximately 3 characters, whereas for another one it may be approximately 4 characters. ^4 Previously, I incorrectly listed this model's input limit as 3072. Sorry for the mistake. Very weird multi-dimensional space# Back to the big mystery. What the hell do these numbers MEAN?!?!?! Let's begin by thinking about coordinates on a map. Suppose I give you three points and their coordinates: Point X-Coordinate Y-Coordinate A 3 2 B 1 1 C -2 -2 There are 2 dimensions to this map: the X-Coordinate and the Y-Coordinate. Each point lives at the intersection of an X-Coordinate and a Y-Coordinate. Is A closer to B or C? ../_images/embeddings-1.png A is much closer to B. Here's the mental leap. Embeddings are similar to points on a map. Each number in the embedding array is a dimension, similar to the X-Coordinates and Y-Coordinates from earlier. When an embedding model sends you back an array of 1000 numbers, it's telling you the point where that text semantically lives in its 1000-dimension space, relative to all other texts. When we compare the distance between two embeddings in this 1000-dimension space, what we're really doing is figuring out how semantically close or far apart those two texts are from each other. ../_images/mindblown.gif The concept of positioning items in a multi-dimensional space like this, where related items are clustered near each other, goes by the wonderful name of latent space. The most famous example of the weird utility of this technology comes from the Word2vec paper, the foundational research that kickstarted interest in embeddings 11 years ago. In the paper they shared this anecdote: embedding("king") - embedding("man") + embedding("woman") [?] embedding("queen") Starting with the embedding for king, subtract the embedding for man, then add the embedding for woman. When you look around this vicinity of the latent space, you find the embedding for queen nearby. In other words, embeddings can represent semantic relationships in ways that feel intuitive to us humans. If you asked a human "what's the female equivalent of a king?" that human would probably answer "queen", the same answer we get from embeddings. For more explanation of the underlying theories, see Distributional semantics. The 2D map analogy was a nice stepping stone for building intuition but now we need to cast it aside, because embeddings operate in hundreds or thousands of dimensions. It's impossible for us lowly 3-dimensional creatures to visualize what "distance" looks like in 1000 dimensions. Also, we don't know what each dimension represents, hence the section heading "Very weird multi-dimensional space".^5 One dimension might represent something close to color. The king - man + woman [?] queen anecdote suggests that these models contain a dimension with some notion of gender. And so on. Well Dude, we just don't know. The mechanics of converting text into very weird multi-dimensional space are complex, as you might imagine. They are teaching machines to LEARN, after all. The Illustrated Word2vec is a good way to start your journey down that rabbithole. ^5 I borrowed this phrase from Embeddings: What they are and why they matter. Comparing embeddings# After you've generated your embeddings, you'll need some kind of "database" to keep track of what text each embedding is associated to. In the experiment discussed later, I got by with just a local JSON file: { "authors": { "embedding": [...] }, "changes/0.1": { "embedding": [...] }, ... } authors is the name of a page. embedding is the embedding for that page. Comparing embeddings involves a lot of linear algebra. I learned the basics from Linear Algebra for Machine Learning and Data Science. The big math and ML libraries like NumPy and scikit-learn can do the heavy lifting for you (i.e. very little math code on your end). Applications# I could tell you exactly how I think we might advance the state of the art in technical writing with embeddings, but where's the fun in that? You now know why they're such an interesting and useful new tool in the technical writer toolbox... go connect the rest of the dots yourself! Let's cover a basic example to put the intuition-building ideas into practice and then wrap up this post. Related pages# Some docs sites have a recommendation system that makes you aware of other relevant docs. The system looks at whatever page you're currently on, finds other pages related to this one, and then recommends other pages to visit. Embeddings provide a new way to support this feature, probably at a fraction of the cost of previous methods. Here's how it works: 1. Generate an embedding for each page on your docs site. 2. For each page, compare its embedding against all other page embeddings. If the two embeddings are mathematically similar, then the contents on the two pages are probably related to each other. This can be done as a batch operation. A page's embedding only needs to change when the page's content changes. I ran this experiment on the Sphinx docs. The results were pretty good. Implementation and Results have the details. See Related content using embeddings for another example of this approach. Let a thousand embeddings bloom?# As docs site owners, I wonder if we should start freely providing embeddings for our content to anyone who wants them, via REST APIs or well-known URIs. Who knows what kinds of cool stuff our communities can build with this extra type of data about our docs? Parting words# Three years ago, if you had asked me what 768-dimensional space is, I would have told you that it's just some abstract concept that physicists and mathematicians need for unfathomable reasons, probably something related to string theory. Embeddings gave me a reason to think about this idea more deeply, and actually apply it to my own work. I think that's pretty cool. Order-of-magnitude improvements in our ability to maintain our docs may very well still be possible after all... perhaps we just need an order-of-magnitude-more dimensions!! Appendix# Implementation# I created a Sphinx extension to generate an embedding for each doc. Sphinx automatically invokes this extension as it builds the docs. import json import os import voyageai VOYAGE_API_KEY = os.getenv('VOYAGE_API_KEY') voyage = voyageai.Client(api_key=VOYAGE_API_KEY) def on_build_finished(app, exception): with open(srcpath, 'w') as f: json.dump(data, f, indent=4) def embed_with_voyage(text): try: embedding = voyage.embed([text], model='voyage-3', input_type='document').embeddings[0] return embedding except Exception as e: return None def on_doctree_resolved(app, doctree, docname): text = doctree.astext() embedding = embed_with_voyage(text) # Generate an embedding for each document! data[docname] = { 'embedding': embedding } # Use some globals because this is just an experiment and you can't stop me def init_globals(srcdir): global filename global srcpath global data filename = 'embeddings.json' srcpath = f'{srcdir}/{filename}' data = {} def setup(app): init_globals(app.srcdir) # https://www.sphinx-doc.org/en/master/extdev/appapi.html#sphinx-core-events app.connect('doctree-resolved', on_doctree_resolved) # This event fires on every doc that's processed app.connect('build-finished', on_build_finished) return { 'version': '0.0.1', 'parallel_read_safe': True, 'parallel_write_safe': True, } When the build finishes, the embeddings data is stored in embeddings.json like this: { "authors": { "embedding": [...] }, "changes/0.1": { "embedding": [...] }, ... } authors and changes/0.1 are docs. embedding contains the embedding for that doc. The last step is to find the closest neighbor for each doc. I.e. to find the other page that is considered relevant to the page you're currently on. As mentioned earlier, Linear Algebra for Machine Learning and Data Science was the class that taught me the basics. import json import numpy as np from sklearn.metrics.pairwise import cosine_similarity def find_docname(data, target): for docname in data: if data[docname]['embedding'] == target: return docname return None # Adapted from the Voyage AI docs # https://web.archive.org/web/20240923001107/https://docs.voyageai.com/docs/quickstart-tutorial def k_nearest_neighbors(target, embeddings, k=5): # Convert to numpy array target = np.array(target) embeddings = np.array(embeddings) # Reshape the query vector embedding to a matrix of shape (1, n) to make it # compatible with cosine_similarity target = target.reshape(1, -1) # Calculate the similarity for each item in data cosine_sim = cosine_similarity(target, embeddings) # Sort the data by similarity in descending order and take the top k items sorted_indices = np.argsort(cosine_sim[0])[::-1] # Take the top k related embeddings top_k_related_embeddings = embeddings[sorted_indices[:k]] top_k_related_embeddings = [ list(row[:]) for row in top_k_related_embeddings ] # convert to list return top_k_related_embeddings with open('doc/embeddings.json', 'r') as f: data = json.load(f) embeddings = [data[docname]['embedding'] for docname in data] print('.. csv-table::') print(' :header: "Target", "Neighbor"') print() for target in embeddings: dot_products = np.dot(embeddings, target) neighbors = k_nearest_neighbors(target, embeddings, k=3) # ignore neighbors[0] because that is always the target itself nearest_neighbor = neighbors[1] target_docname = find_docname(data, target) target_cell = f'`{target_docname} `_' neighbor_docname = find_docname(data, nearest_neighbor) neighbor_cell = f'`{neighbor_docname} `_' print(f' "{target_cell}", "{neighbor_cell}"') As you may have noticed, I did not actually implement the recommendation UI in this experiment. My main goal was to get basic data on whether the embeddings approach generates decent recommendations or not. Results# How to interpret the data: Target would be the page that you're currently on. Neighbor would be the recommended page. Target Neighbor authors changes/0.6 changes/0.1 changes/0.5 changes/0.2 changes/1.2 changes/0.3 changes/0.4 changes/0.4 changes/1.2 changes/0.5 changes/0.6 changes/0.6 changes/1.6 changes/1.0 changes/1.3 changes/1.1 changes/1.2 changes/1.2 changes/1.1 changes/1.3 changes/1.4 changes/1.4 changes/1.3 changes/1.5 changes/1.6 changes/1.6 changes/1.5 changes/1.7 changes/1.8 changes/1.8 changes/1.6 changes/2.0 changes/1.8 changes/2.1 changes/1.2 changes/2.2 changes/1.2 changes/2.3 changes/2.1 changes/2.4 changes/3.5 changes/3.0 changes/4.3 changes/3.1 changes/3.3 changes/3.2 changes/3.0 changes/3.3 changes/3.1 changes/3.4 changes/4.3 changes/3.5 changes/1.3 changes/4.0 changes/3.0 changes/4.1 changes/4.4 changes/4.2 changes/4.4 changes/4.3 changes/3.0 changes/4.4 changes/7.4 changes/4.5 changes/4.4 changes/5.0 changes/3.5 changes/5.1 changes/5.0 changes/5.2 changes/3.5 changes/5.3 changes/5.2 changes/6.0 changes/6.2 changes/6.1 changes/6.2 changes/6.2 changes/6.1 changes/7.0 extdev/deprecated changes/7.1 changes/7.2 changes/7.2 changes/7.4 changes/7.3 changes/7.4 changes/7.4 changes/7.3 changes/8.0 changes/8.1 changes/8.1 changes/1.8 changes/index changes/8.0 development/howtos/builders usage/extensions/index development/howtos/index development/tutorials/index development/howtos/setup_extension usage/extensions/index development/html_themes/index usage/theming development/html_themes/templating development/html_themes/index development/index usage/index development/tutorials/adding_domain extdev/domainapi development/tutorials/autodoc_ext usage/extensions/autodoc development/tutorials/examples/ tutorial/end README development/tutorials/ usage/extensions/todo extending_build development/tutorials/ extdev/markupapi extending_syntax development/tutorials/index development/howtos/index examples index extdev/appapi extdev/index extdev/builderapi usage/builders/index extdev/collectorapi extdev/envapi extdev/deprecated changes/1.8 extdev/domainapi usage/domains/index extdev/envapi extdev/collectorapi extdev/event_callbacks extdev/appapi extdev/i18n usage/advanced/intl extdev/index extdev/appapi extdev/logging extdev/appapi extdev/markupapi development/tutorials/ extending_syntax extdev/nodes extdev/domainapi extdev/parserapi extdev/appapi extdev/projectapi extdev/envapi extdev/testing internals/contributing extdev/utils extdev/appapi faq usage/configuration glossary usage/quickstart index usage/quickstart internals/code-of-conduct internals/index internals/contributing usage/advanced/intl internals/index usage/index internals/organization internals/contributing internals/release-process extdev/deprecated latex usage/configuration man/index usage/index man/sphinx-apidoc man/sphinx-autogen man/sphinx-autogen usage/extensions/autosummary man/sphinx-build usage/configuration man/sphinx-quickstart tutorial/getting-started support tutorial/end tutorial/automatic-doc-generation usage/extensions/autosummary tutorial/deploying tutorial/first-steps tutorial/describing-code usage/domains/index tutorial/end usage/index tutorial/first-steps tutorial/getting-started tutorial/getting-started tutorial/index tutorial/index tutorial/getting-started tutorial/more-sphinx-customization usage/theming tutorial/narrative-documentation usage/quickstart usage/advanced/intl internals/contributing usage/advanced/websupport/api usage/advanced/websupport/ quickstart usage/advanced/websupport/index usage/advanced/websupport/ quickstart usage/advanced/websupport/ usage/advanced/websupport/api quickstart usage/advanced/websupport/ usage/advanced/websupport/api searchadapters usage/advanced/websupport/ usage/advanced/websupport/api storagebackends usage/builders/index usage/configuration usage/configuration changes/1.2 usage/domains/c usage/domains/cpp usage/domains/cpp usage/domains/c usage/domains/index extdev/domainapi usage/domains/javascript usage/domains/python usage/domains/mathematics usage/referencing usage/domains/python extdev/domainapi usage/domains/restructuredtext extdev/markupapi usage/domains/standard usage/domains/index usage/extensions/autodoc tutorial/automatic-doc-generation usage/extensions/autosectionlabel usage/quickstart usage/extensions/autosummary tutorial/automatic-doc-generation usage/extensions/coverage usage/extensions/autodoc usage/extensions/doctest tutorial/describing-code usage/extensions/duration tutorial/ more-sphinx-customization usage/extensions/example_google usage/extensions/example_numpy usage/extensions/example_numpy usage/extensions/example_google usage/extensions/extlinks usage/extensions/intersphinx usage/extensions/githubpages tutorial/deploying usage/extensions/graphviz usage/extensions/math usage/extensions/ifconfig usage/extensions/doctest usage/extensions/imgconverter usage/extensions/math usage/extensions/index development/index usage/extensions/inheritance usage/extensions/graphviz usage/extensions/intersphinx usage/quickstart usage/extensions/linkcode usage/extensions/viewcode usage/extensions/math usage/configuration usage/extensions/napoleon usage/extensions/example_google usage/extensions/todo development/tutorials/ extending_build usage/extensions/viewcode usage/extensions/linkcode usage/index tutorial/end usage/installation tutorial/getting-started usage/markdown extdev/parserapi usage/quickstart index usage/referencing usage/restructuredtext/roles usage/restructuredtext/basics usage/restructuredtext/directives usage/restructuredtext/directives usage/restructuredtext/basics usage/restructuredtext/domains usage/domains/index usage/restructuredtext/field-lists usage/restructuredtext/directives usage/restructuredtext/index usage/restructuredtext/basics usage/restructuredtext/roles usage/referencing usage/theming development/html_themes/index (c) 2024 Kayce Basques