https://technicalwriting.dev/data/embeddings.html

Skip to main content

  * Home
  * RSS

Embeddings are underrated#

Machine learning (ML) has the potential to advance the state of the
art in technical writing. No, I'm not talking about text generation
models like Claude, Gemini, LLaMa, GPT, etc. The ML technology that
might end up having the biggest impact on technical writing is
embeddings.

Embeddings aren't exactly new, but they have become much more widely
accessible in the last couple years. What embeddings offer to
technical writers is the ability to discover connections between
texts at previously impossible scales.

Building intuition about embeddings#

Here's an overview of how you use embeddings and how they work. It's
geared towards technical writers who are learning about embeddings
for the first time.

Input and output#

Someone asks you to "make some embeddings". What do you input? You
input text.^1 You don't need to provide the same amount of text every
time. E.g. sometimes your input is a single paragraph while at other
times it's a few sections, an entire document, or even multiple
documents.

What do you get back? If you provide a single word as the input, the
output will be an array of numbers like this:

[-0.02387, -0.0353, 0.0456]

Now suppose your input is an entire set of documents. The output
turns into this:

[0.0451, -0.0154, 0.0020]

One input was drastically smaller than the other, yet they both
produced an array of 3 numbers. Curiouser and curiouser. (When you
work with real embeddings, the arrays will have hundreds or thousands
of numbers, not 3. More on that later.)

Here's the first key insight. Because we always get back the same
amount of numbers no matter how big or small the input text, we now
have a way to mathematically compare any two pieces of arbitrary text
to each other.

But what do those numbers MEAN?

^1 Some embedding models are "multimodal", meaning you can also
provide images, videos, and audio as input. This post focuses on text
since that's the medium that we work with the most as technical
writers. Haven't seen a multimodal model support taste, touch, or
smell yet!

First, how to literally make the embeddings#

The big service providers have made it easy. Here's how it's done
with Gemini:

import google.generativeai as gemini


gemini.configure(api_key='...')

text = 'Hello, world!'
response = gemini.embed_content(
    model='models/text-embedding-004',
    content=text,
    task_type='SEMANTIC_SIMILARITY'
)
embedding = response['embedding']

The size of the array depends on what model you're using. Gemini's
text-embedding-004 model returns an array of 768 numbers whereas
Voyage AI's voyage-3 model returns an array of 1024 numbers. This is
one of the reasons why you can't use embeddings from different
providers interchangeably. (The main reason is that the numbers from
one model mean something completely different than the numbers from
another model.)

Does it cost a lot of money?#

No.

Is it terrible for the environment?#

I don't know. After the model has been created (trained), I'm pretty
sure that generating embeddings is much less computationally
intensive than generating text. But it also seems to be the case that
embedding models are trained in similar ways as text generation
models^2, with all the energy usage that implies. I'll update this
section when I find out more.

^2 From You Should Probably Pay Attention to Tokenizers: "Embeddings
are byproduct of transformer training and are actually trained on the
heaps of tokenized texts. It gets better: embeddings are what is
actually fed as the input to LLMs when we ask it to generate text."

What model is best?#

Ideally, your embedding model can accept a huge amount of input text,
so that you can generate embeddings for complete pages. If you try to
provide more input than a model can handle, you usually get an error.
As of October 2024 voyage-3 seems to the clear winner in terms of
input size^3:

Organization       Model name       Input limit (tokens)

Voyage AI    voyage-3               32000

Nomic        Embed                  8192

OpenAI       text-embedding-3-large 8191^4

Mistral      Embed                  8000

Google       text-embedding-004     2048

Cohere       embed-english-v3.0     512

For my particular use cases as a technical writer, large input size
is an important factor. However, your use cases may not need large
input size, or there may be other factors that are more important.
See the Massive Text Embedding Benchmark (MTEB) leaderboard.

^3 These input limits are based on tokens, and each service
calculates tokens differently, so don't put too much weight into
these exact numbers. E.g. a token for one model may be approximately
3 characters, whereas for another one it may be approximately 4
characters.

^4 Previously, I incorrectly listed this model's input limit as 3072.
Sorry for the mistake.

Very weird multi-dimensional space#

Back to the big mystery. What the hell do these numbers MEAN?!?!?!

Let's begin by thinking about coordinates on a map. Suppose I give
you three points and their coordinates:

Point X-Coordinate Y-Coordinate

A     3            2

B     1            1

C     -2           -2

There are 2 dimensions to this map: the X-Coordinate and the
Y-Coordinate. Each point lives at the intersection of an X-Coordinate
and a Y-Coordinate.

Is A closer to B or C?

../_images/embeddings-1.png

A is much closer to B.

Here's the mental leap. Embeddings are similar to points on a map.
Each number in the embedding array is a dimension, similar to the
X-Coordinates and Y-Coordinates from earlier. When an embedding model
sends you back an array of 1000 numbers, it's telling you the point
where that text semantically lives in its 1000-dimension space,
relative to all other texts. When we compare the distance between two
embeddings in this 1000-dimension space, what we're really doing is
figuring out how semantically close or far apart those two texts are
from each other.

../_images/mindblown.gif

The concept of positioning items in a multi-dimensional space like
this, where related items are clustered near each other, goes by the
wonderful name of latent space.

The most famous example of the weird utility of this technology comes
from the Word2vec paper, the foundational research that kickstarted
interest in embeddings 11 years ago. In the paper they shared this
anecdote:

embedding("king") - embedding("man") + embedding("woman") [?] embedding("queen")

Starting with the embedding for king, subtract the embedding for man,
then add the embedding for woman. When you look around this vicinity
of the latent space, you find the embedding for queen nearby. In
other words, embeddings can represent semantic relationships in ways
that feel intuitive to us humans. If you asked a human "what's the
female equivalent of a king?" that human would probably answer
"queen", the same answer we get from embeddings. For more explanation
of the underlying theories, see Distributional semantics.

The 2D map analogy was a nice stepping stone for building intuition
but now we need to cast it aside, because embeddings operate in
hundreds or thousands of dimensions. It's impossible for us lowly
3-dimensional creatures to visualize what "distance" looks like in
1000 dimensions. Also, we don't know what each dimension represents,
hence the section heading "Very weird multi-dimensional space".^5 One
dimension might represent something close to color. The king - man + 
woman [?] queen anecdote suggests that these models contain a dimension
with some notion of gender. And so on. Well Dude, we just don't know.

The mechanics of converting text into very weird multi-dimensional
space are complex, as you might imagine. They are teaching machines
to LEARN, after all. The Illustrated Word2vec is a good way to start
your journey down that rabbithole.

^5 I borrowed this phrase from Embeddings: What they are and why they
matter.

Comparing embeddings#

After you've generated your embeddings, you'll need some kind of
"database" to keep track of what text each embedding is associated
to. In the experiment discussed later, I got by with just a local
JSON file:

{
    "authors": {
        "embedding": [...]
    },
    "changes/0.1": {
        "embedding": [...]
    },
    ...
}

authors is the name of a page. embedding is the embedding for that
page.

Comparing embeddings involves a lot of linear algebra. I learned the
basics from Linear Algebra for Machine Learning and Data Science. The
big math and ML libraries like NumPy and scikit-learn can do the
heavy lifting for you (i.e. very little math code on your end).

Applications#

I could tell you exactly how I think we might advance the state of
the art in technical writing with embeddings, but where's the fun in
that? You now know why they're such an interesting and useful new
tool in the technical writer toolbox... go connect the rest of the dots
yourself!

Let's cover a basic example to put the intuition-building ideas into
practice and then wrap up this post.

Related pages#

Some docs sites have a recommendation system that makes you aware of
other relevant docs. The system looks at whatever page you're
currently on, finds other pages related to this one, and then
recommends other pages to visit. Embeddings provide a new way to
support this feature, probably at a fraction of the cost of previous
methods. Here's how it works:

 1. Generate an embedding for each page on your docs site.

 2. For each page, compare its embedding against all other page
    embeddings. If the two embeddings are mathematically similar,
    then the contents on the two pages are probably related to each
    other.

This can be done as a batch operation. A page's embedding only needs
to change when the page's content changes.

I ran this experiment on the Sphinx docs. The results were pretty
good. Implementation and Results have the details.

See Related content using embeddings for another example of this
approach.

Let a thousand embeddings bloom?#

As docs site owners, I wonder if we should start freely providing
embeddings for our content to anyone who wants them, via REST APIs or
well-known URIs. Who knows what kinds of cool stuff our communities
can build with this extra type of data about our docs?

Parting words#

Three years ago, if you had asked me what 768-dimensional space is, I
would have told you that it's just some abstract concept that
physicists and mathematicians need for unfathomable reasons, probably
something related to string theory. Embeddings gave me a reason to
think about this idea more deeply, and actually apply it to my own
work. I think that's pretty cool.

Order-of-magnitude improvements in our ability to maintain our docs
may very well still be possible after all... perhaps we just need an
order-of-magnitude-more dimensions!!

Appendix#

Implementation#

I created a Sphinx extension to generate an embedding for each doc.
Sphinx automatically invokes this extension as it builds the docs.

import json
import os


import voyageai


VOYAGE_API_KEY = os.getenv('VOYAGE_API_KEY')
voyage = voyageai.Client(api_key=VOYAGE_API_KEY)


def on_build_finished(app, exception):
    with open(srcpath, 'w') as f:
        json.dump(data, f, indent=4)


def embed_with_voyage(text):
    try:
        embedding = voyage.embed([text], model='voyage-3', input_type='document').embeddings[0]
        return embedding
    except Exception as e:
        return None


def on_doctree_resolved(app, doctree, docname):
    text = doctree.astext()
    embedding = embed_with_voyage(text)  # Generate an embedding for each document!
    data[docname] = {
        'embedding': embedding
    }


# Use some globals because this is just an experiment and you can't stop me
def init_globals(srcdir):
    global filename
    global srcpath
    global data
    filename = 'embeddings.json'
    srcpath = f'{srcdir}/{filename}'
    data = {}


def setup(app):
    init_globals(app.srcdir)
    # https://www.sphinx-doc.org/en/master/extdev/appapi.html#sphinx-core-events
    app.connect('doctree-resolved', on_doctree_resolved)  # This event fires on every doc that's processed
    app.connect('build-finished', on_build_finished)
    return {
        'version': '0.0.1',
        'parallel_read_safe': True,
        'parallel_write_safe': True,
    }

When the build finishes, the embeddings data is stored in 
embeddings.json like this:

{
    "authors": {
        "embedding": [...]
    },
    "changes/0.1": {
        "embedding": [...]
    },
    ...
}

authors and changes/0.1 are docs. embedding contains the embedding
for that doc.

The last step is to find the closest neighbor for each doc. I.e. to
find the other page that is considered relevant to the page you're
currently on. As mentioned earlier, Linear Algebra for Machine
Learning and Data Science was the class that taught me the basics.

import json


import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def find_docname(data, target):
    for docname in data:
        if data[docname]['embedding'] == target:
            return docname
    return None


# Adapted from the Voyage AI docs
# https://web.archive.org/web/20240923001107/https://docs.voyageai.com/docs/quickstart-tutorial
def k_nearest_neighbors(target, embeddings, k=5):
    # Convert to numpy array
    target = np.array(target)
    embeddings = np.array(embeddings)
    # Reshape the query vector embedding to a matrix of shape (1, n) to make it
    # compatible with cosine_similarity
    target = target.reshape(1, -1)
    # Calculate the similarity for each item in data
    cosine_sim = cosine_similarity(target, embeddings)
    # Sort the data by similarity in descending order and take the top k items
    sorted_indices = np.argsort(cosine_sim[0])[::-1]
    # Take the top k related embeddings
    top_k_related_embeddings = embeddings[sorted_indices[:k]]
    top_k_related_embeddings = [
        list(row[:]) for row in top_k_related_embeddings
    ]  # convert to list
    return top_k_related_embeddings


with open('doc/embeddings.json', 'r') as f:
    data = json.load(f)
embeddings = [data[docname]['embedding'] for docname in data]
print('.. csv-table::')
print('   :header: "Target", "Neighbor"')
print()
for target in embeddings:
    dot_products = np.dot(embeddings, target)
    neighbors = k_nearest_neighbors(target, embeddings, k=3)
    # ignore neighbors[0] because that is always the target itself
    nearest_neighbor = neighbors[1]
    target_docname = find_docname(data, target)
    target_cell = f'`{target_docname} <https://www.sphinx-doc.org/en/master/{target_docname}.html>`_'
    neighbor_docname = find_docname(data, nearest_neighbor)
    neighbor_cell = f'`{neighbor_docname} <https://www.sphinx-doc.org/en/master/{neighbor_docname}.html>`_'
    print(f'   "{target_cell}", "{neighbor_cell}"')

As you may have noticed, I did not actually implement the
recommendation UI in this experiment. My main goal was to get basic
data on whether the embeddings approach generates decent
recommendations or not.

Results#

How to interpret the data: Target would be the page that you're
currently on. Neighbor would be the recommended page.

              Target                            Neighbor

authors                             changes/0.6

changes/0.1                         changes/0.5

changes/0.2                         changes/1.2

changes/0.3                         changes/0.4

changes/0.4                         changes/1.2

changes/0.5                         changes/0.6

changes/0.6                         changes/1.6

changes/1.0                         changes/1.3

changes/1.1                         changes/1.2

changes/1.2                         changes/1.1

changes/1.3                         changes/1.4

changes/1.4                         changes/1.3

changes/1.5                         changes/1.6

changes/1.6                         changes/1.5

changes/1.7                         changes/1.8

changes/1.8                         changes/1.6

changes/2.0                         changes/1.8

changes/2.1                         changes/1.2

changes/2.2                         changes/1.2

changes/2.3                         changes/2.1

changes/2.4                         changes/3.5

changes/3.0                         changes/4.3

changes/3.1                         changes/3.3

changes/3.2                         changes/3.0

changes/3.3                         changes/3.1

changes/3.4                         changes/4.3

changes/3.5                         changes/1.3

changes/4.0                         changes/3.0

changes/4.1                         changes/4.4

changes/4.2                         changes/4.4

changes/4.3                         changes/3.0

changes/4.4                         changes/7.4

changes/4.5                         changes/4.4

changes/5.0                         changes/3.5

changes/5.1                         changes/5.0

changes/5.2                         changes/3.5

changes/5.3                         changes/5.2

changes/6.0                         changes/6.2

changes/6.1                         changes/6.2

changes/6.2                         changes/6.1

changes/7.0                         extdev/deprecated

changes/7.1                         changes/7.2

changes/7.2                         changes/7.4

changes/7.3                         changes/7.4

changes/7.4                         changes/7.3

changes/8.0                         changes/8.1

changes/8.1                         changes/1.8

changes/index                       changes/8.0

development/howtos/builders         usage/extensions/index

development/howtos/index            development/tutorials/index

development/howtos/setup_extension  usage/extensions/index

development/html_themes/index       usage/theming

development/html_themes/templating  development/html_themes/index

development/index                   usage/index

development/tutorials/adding_domain extdev/domainapi

development/tutorials/autodoc_ext   usage/extensions/autodoc

development/tutorials/examples/     tutorial/end
README

development/tutorials/              usage/extensions/todo
extending_build

development/tutorials/              extdev/markupapi
extending_syntax

development/tutorials/index         development/howtos/index

examples                            index

extdev/appapi                       extdev/index

extdev/builderapi                   usage/builders/index

extdev/collectorapi                 extdev/envapi

extdev/deprecated                   changes/1.8

extdev/domainapi                    usage/domains/index

extdev/envapi                       extdev/collectorapi

extdev/event_callbacks              extdev/appapi

extdev/i18n                         usage/advanced/intl

extdev/index                        extdev/appapi

extdev/logging                      extdev/appapi

extdev/markupapi                    development/tutorials/
                                    extending_syntax

extdev/nodes                        extdev/domainapi

extdev/parserapi                    extdev/appapi

extdev/projectapi                   extdev/envapi

extdev/testing                      internals/contributing

extdev/utils                        extdev/appapi

faq                                 usage/configuration

glossary                            usage/quickstart

index                               usage/quickstart

internals/code-of-conduct           internals/index

internals/contributing              usage/advanced/intl

internals/index                     usage/index

internals/organization              internals/contributing

internals/release-process           extdev/deprecated

latex                               usage/configuration

man/index                           usage/index

man/sphinx-apidoc                   man/sphinx-autogen

man/sphinx-autogen                  usage/extensions/autosummary

man/sphinx-build                    usage/configuration

man/sphinx-quickstart               tutorial/getting-started

support                             tutorial/end

tutorial/automatic-doc-generation   usage/extensions/autosummary

tutorial/deploying                  tutorial/first-steps

tutorial/describing-code            usage/domains/index

tutorial/end                        usage/index

tutorial/first-steps                tutorial/getting-started

tutorial/getting-started            tutorial/index

tutorial/index                      tutorial/getting-started

tutorial/more-sphinx-customization  usage/theming

tutorial/narrative-documentation    usage/quickstart

usage/advanced/intl                 internals/contributing

usage/advanced/websupport/api       usage/advanced/websupport/
                                    quickstart

usage/advanced/websupport/index     usage/advanced/websupport/
                                    quickstart

usage/advanced/websupport/          usage/advanced/websupport/api
quickstart

usage/advanced/websupport/          usage/advanced/websupport/api
searchadapters

usage/advanced/websupport/          usage/advanced/websupport/api
storagebackends

usage/builders/index                usage/configuration

usage/configuration                 changes/1.2

usage/domains/c                     usage/domains/cpp

usage/domains/cpp                   usage/domains/c

usage/domains/index                 extdev/domainapi

usage/domains/javascript            usage/domains/python

usage/domains/mathematics           usage/referencing

usage/domains/python                extdev/domainapi

usage/domains/restructuredtext      extdev/markupapi

usage/domains/standard              usage/domains/index

usage/extensions/autodoc            tutorial/automatic-doc-generation

usage/extensions/autosectionlabel   usage/quickstart

usage/extensions/autosummary        tutorial/automatic-doc-generation

usage/extensions/coverage           usage/extensions/autodoc

usage/extensions/doctest            tutorial/describing-code

usage/extensions/duration           tutorial/
                                    more-sphinx-customization

usage/extensions/example_google     usage/extensions/example_numpy

usage/extensions/example_numpy      usage/extensions/example_google

usage/extensions/extlinks           usage/extensions/intersphinx

usage/extensions/githubpages        tutorial/deploying

usage/extensions/graphviz           usage/extensions/math

usage/extensions/ifconfig           usage/extensions/doctest

usage/extensions/imgconverter       usage/extensions/math

usage/extensions/index              development/index

usage/extensions/inheritance        usage/extensions/graphviz

usage/extensions/intersphinx        usage/quickstart

usage/extensions/linkcode           usage/extensions/viewcode

usage/extensions/math               usage/configuration

usage/extensions/napoleon           usage/extensions/example_google

usage/extensions/todo               development/tutorials/
                                    extending_build

usage/extensions/viewcode           usage/extensions/linkcode

usage/index                         tutorial/end

usage/installation                  tutorial/getting-started

usage/markdown                      extdev/parserapi

usage/quickstart                    index

usage/referencing                   usage/restructuredtext/roles

usage/restructuredtext/basics       usage/restructuredtext/directives

usage/restructuredtext/directives   usage/restructuredtext/basics

usage/restructuredtext/domains      usage/domains/index

usage/restructuredtext/field-lists  usage/restructuredtext/directives

usage/restructuredtext/index        usage/restructuredtext/basics

usage/restructuredtext/roles        usage/referencing

usage/theming                       development/html_themes/index

(c) 2024 Kayce Basques