https://foojay.io/today/indexing-all-of-wikipedia-on-a-laptop/

foojay.io foojay.io Friends of OpenJDK

  * News
      + Daily News
      + Podcast
      + Event Calendar
  * Java Basics
      + Quick Start
      + Foojay Pedia
  * Java Almanac
      + Java 21
      + Java 17
      + Java 11
  * About
      + Team
      + Authors
      + Advisory Board

[                    ]
Friends of OpenJDK Today

  * DataStax
  * Performance
  * Tools

Indexing all of Wikipedia, on a laptop

May 29, 2024
Unique Views: 24,036
sinceMay, 2024

Author(s)

  * [bc157185a402]
    Jonathan Ellis
      +  
      +  

    Jonathan is the co-founder and CTO of DataStax. Before DataStax,
    he spent six years as project chair of Apache Cassandra, where he
    built the project and community into an open-source ... Learn
    more

In November, Cohere released a dataset containing all of Wikipedia,
chunked and embedded to vectors with their multilingual-v3 model.

Computing this many embeddings yourself would cost in the
neighborhood of $5000, so the public release of this dataset makes
creating a semantic, vector-based index of Wikipedia practical for an
individual for the first time.

Here's what we're building: 

[ydeHYk97v6]

You can try searching the completed index on a public demo instance
here.

Why this is hard

Sure, the dataset is big (180GB for the English corpus), but that's
not the obstacle per se.  We've been able to build full-text indexes
on larger datasets for a long time.

The obstacle is that until now, off-the-shelf vector databases could
not index a dataset larger than memory, because both the
full-resolution vectors and the index (edge list) needed to be kept
in memory during index construction.  Larger datasets could be split
into segments, but this means that at query time they need to search
each segment separately, then combine the results, turning an O(log
N) search per segment into O(N) overall.  (In their latest release,
Lucene attempts to mitigate this by processing segments in parallel
with multiple threads, but obviously (1) this only gives you a
constant factor of improvement before you run out of CPU cores and
(2) this does not improve throughput.)

Specifically, if you're indexing 1536-dimension vectors (the size of
ada002 or openai-v3-small), then you can fit about 5M vectors and
their associated edge lists in a 32GB index construction RAM budget.

[-BwVEUQqMI]

JVector, the library that powers DataStax Astra vector search, now
supports indexing larger-than-memory datasets by performing
construction-related searches with compressed vectors.  This means
that the edge lists need to fit in memory, but the uncompressed
vectors do not, which gives us enough headroom to index Wikipedia-en
on a laptop.

Requirements

 1. Linux or MacOS.  It will not work on Windows because
    ChronicleMap, which we are going to use for the non-vector data,
    is limited to a 4GB size there.  (If you are interested enough,
    you could shard the Map by vector id to keep each shard under 4GB
    and still have O(1) lookup times.)
 2. About 180GB of free space for the dataset, and 90GB for the
    completed index.
 3. Enough RAM to run a JVM with 36GB of heap space during
    construction (~28GB for the index, 8GB for GC headroom).
 4. Disable swap before building the index.  Linux will aggressively
    try to cache the index being constructed to the point of swapping
    out parts of the JVM heap, which is obviously counterproductive. 
    In my test, building with swap enabled was almost twice as slow
    as with it off.

Building and searching the index

 1. Check out the project:
    $ git clone https://github.com/jbellis/coherepedia-jvector
    $ cd coherepedia-jvector
 2. Edit config.properties to set the locations for the dataset and
    the index. 
 3. Run pip install datasets.  (Setting up a venv or conda
    environment first is recommended but not strictly necessary.)
 4. Run python download.py.  This downloads the 180 GB dataset to the
    location you configured.  For me that took about half an hour.
 5. Run ./mvnw compile exec:exec@buildindex.  This took about 5 and a
    half hours on my machine (with an i9-12900 CPU).
 6. Run ./mvnw compile exec:exec@serve and open a browser to http://
    localhost:4567.  Search away!

How it works

We're using JVector for the vector index and Chronicle Map for the
article data.  There are several things I don't love about Chronicle
Map, but nothing else touches it for simple disk-based key/value
performance.

The full source of the index construction class is here.  I'll
explain it next in pieces.

Compression parameters

JVector is based on the DiskANN vector index design, which performs
an initial search using vectors compressed lossily with product
quantization (PQ) in memory, then reranks the results using
high-resolution vectors from disk.  However, while DiskANN stores
full, uncompressed vectors to perform reranking, JVector is able to
improve on that using Locally-Adaptive Quantization (LVQ)
compression.

To set this up, we'll first load some vectors into a
RandomAccessVectorValues (RAVV) instance.  RAVV is a JVector
interface for a vector container; it could be List or Map based,
in-memory or on-disk.  In this case we'll use a simple List-backed
RAVV.  We'll compute the parameters for both compressions (kmeans
clustering for PQ, global mean for LVQ) from a single shard of the
dataset.  At about 110k rows, this is enough data to have a
statistically valid sample.

[aN200b1SUL]

Next, we compute the PQ compression codebook; we're compressing the
vectors by a factor of 64, because the Cohere v3 embeddings can be
PQ-compressed that much without losing accuracy, after reranking. 
Binary Quantization only gives us 32x compression and is less
accurate.

[PS0HlbtZNa]

Finally, we need to set up LVQ.  LVQ gives us 4x compression while
losing no measurable accuracy over the full uncompressed vectors,
resulting in both a smaller footprint on disk and faster searches. 
(I thank the vector search team at Intel Research for pointing this
out to us.)

[XheHrYXEE6]

GraphIndexBuilder

Next, we need to instantiate and configure our GraphIndexBuilder.

[veV6oVgpkD]

This instantiates a JVector GraphIndexBuilder and connects it to an
OnDiskGraphIndexWriter, and tells it to use the PQ-compressed vectors
list (which starts empty and will grow as we add vectors to the
index) during construction (in the BuildScoreProvider).

Chronicle Map and RowData

We'll store article contents in RowData records.  This content is
what has been encoded as the corresponding vector in the dataset, and
is what we want to return to the user in our search results.

[veVvO8QUrY]

To turn the vector index's search results (a list of integer vector
ids) into RowData, we store the RowData in a Map keyed by the vector
id.  This will be a lot of data, so we use ChronicleMap to store this
on disk with a minimal in-memory footprint.

[hFMjxcQstg]

We need to tell ChronicleMap how large it's going to be, both in
entry count and entry size.  Undersizing these will cause it to crash
(my primary complaint about ChronicleMap), so we deliberately use a
high estimate.

We do not need to explicitly tell ChronicleMap how to read and write
RowData objects, instead we just have RowData implement
Serializable.  While ChronicleMap supports custom de/serialize code,
it's perfectly happy to use simple out-of-the-box serialization and
since profiling shows that's not a bottleneck for us we'll just leave
it at that.

Ingesting the data

We use Java's parallel Streams to process the shards in parallel. 
For each row in each shard, we

 1. Add it to pqVectorsList
 2. Call writer.writeInline to add the LVQ-compressed vector to disk
 3. Call builder.addGraphNode - order is important because both (1)
    and (2) are used when we call addGraphNode
 4. Call contentMap.put with the article chunk data.

[5souUR9e_g] [_Z7fqvsbvQ]

You can look at the full source if you're curious about forEachRow,
it's just standard "pull data out of Arrow" stuff.

When the build completes, you should see files like this:

$ ls -lh ~/coherepedia

-rw-rw-r-- 1 jonathan jonathan  48G May 20 15:53 coherepedia.ann

-rw-rw-r-- 1 jonathan jonathan  36G May 20 18:05 coherepedia.map

-rw-rw-r-- 1 jonathan jonathan 2.5G May 20 15:53 coherepedia.pqv

-rw-rw-r-- 1 jonathan jonathan 4.1K May 17 23:04 coherepedia.lvq

-rw-rw-r-- 1 jonathan jonathan 1.1M May 17 23:04 coherepedia.pq

These are respectively

  * ANN: the vector index, containing the edge lists and
    LVQ-compressed vectors for reranking.
  * MAP: the map containing article data indexed by vector id.
  * PQV: PQ-compressed vectors, which are read into memory and used
    for the approximate search pass.
  * LVQ: the LVQ global mean, used during construction.
  * PQ: the PQ codebooks, used during construction.

Loading the index (after construction)

The code for serving queries is found in the WebSearch class.  We're
using Spark (the web framework, not the big data engine) to serve a
simple search form:

[a-y2F-t0K9]

Construction needed a relatively large heap to keep the edge lists in
memory.  With that complete, we only need enough to keep the
PQ-compressed vectors in memory; exec@serve is configured to use a
4GB heap.

WebSearch (the class behind exec@serve) first has a static
initializer to load the PQ vectors and open the ChronicleMap.  We
also create a reusable GraphSearcher instance:

[ofHaU8px5j]

Performing a search

Executing a search and turning it into RowData for the user looks
like this:

[-bY_lTq_EA]

There are four "paragraphs" of code here, containing

 1. The call to getVectorEmbedding.  This calls Cohere's API to turn
    the search query (a String) into a vector embedding.
 2. Creating approximate and reranking score functions.  Approximate
    scoring is done through our product quantization, and reranking
    is done with the LVQ vectors in the index.  Since the LVQ vectors
    are encapsulated in the index itself, we never need to explicitly
    deal with LVQ decoding; the index does it for us.
 3. The call to searcher.search that actually does the query, and
    finally
 4. Retrieving the RowData associated with the top vector neighbors
    using contentMap.

That's it!  We've indexed all of Wikipedia with high performance,
parallel code in about 150 loc, and created a simple search server in
another 100.

On my machine, searches (which each run in a single thread) take
about 50ms.  We would expect it to take over twice as long if this
were split across multiple segments.  We would also expect it to lose
significant accuracy if searches were performed only with compressed
vectors without reranking.

Conclusion

Indexing the entirety of English Wikipedia on a laptop has become a
practical reality thanks to recent advances in the JVector library
that will be part of the imminent 3.0 release.  (Star the repo and
stand by!)  This article demonstrates how to do exactly that using
JVector in conjunction with Chronicle Map, while also showcasing the
use of LVQ to reduce index size while preserving accurate reranking.

To take advantage of the power of JVector alongside powerful indexing
for non-vector data, rolled into a document api with support for
realtime inserts, updates, and deletes, check out the DataStax Astra
service.

Enjoy hacking with JVector and Astra!

Don't Forget to Share This Post!

  *  
  *  
  *  
  *  

[bc157185a402]
Author Jonathan Ellis

Jonathan is the co-founder and CTO of DataStax. Before DataStax, he
spent six years as project chair of Apache Cassandra, where he built
the project and community into an open-source success. Previously, he
built an object storage system based on Reed-Solomon encoding for
data backup provider Mozy that scaled to...

Show More

  *  
  *  

Popular Posts:

  * JVector 1.0 October 02, 2023
  * Spring AI: How to Write GenAI Applications with Java May 09, 2024
  * 7 Reasons Why, After 26 Years, Java Still Makes Sense! March 15,
    2022
  * 9 Outdated Ideas About Java March 11, 2023
  * A Flavour of TornadoVM on Apple M1 Pro August 02, 2022

Related Articles

View All

  * JVector 1.0

    JVector is a pure Java embedded vector search engine that powers
    DataStax Astra and is being added to Apache Cassandra.

    Read More
    [bc157185a4]
      + Jonathan Ellis
    October 02, 2023
      + Apache Cassandra
      + Machine Learning
      + Release Notes
  * Spring AI: How to Write GenAI Applications with Java

    We'll look at how to write GenAI applications with Java using the
    Spring AI framework and utilize RAG for improving answers.

    Read More
    Avatar photo
      + Jennifer Reif
    May 09, 2024
      + Graph
      + Java
      + Neo4J
      + Spring
  * 7 Reasons Why, After 26 Years, Java Still Makes Sense!

    After many discussions with Java developers, combined with my
    personal experiences with the Java community and platform, here
    are the key reasons why Java developers love Java after all these
    years!

    Read More
    Avatar photo
      + A N M Bazlur Rahman
    March 15, 2022
      + Java Beginner
      + Java Core
      + Opinion

Author(s)

  * [bc157185a402]
    Jonathan Ellis
      +  
      +  

    Jonathan is the co-founder and CTO of DataStax. Before DataStax,
    he spent six years as project chair of Apache Cassandra, where he
    built the project and community into an open-source ... Learn
    more

Comments (0)

Leave a Comment

Cancel reply

Your email address will not be published. Required fields are marked 
*

[                    ]
[                    ]
Highlight your code snippets using [code lang="language name"]
shortcode. Just insert your code between opening and closing tag:
[code lang="java"] code [/code]. Or specify another language.
[                    ]
[ ]

Save my name, email, and website in this browser for the next time I
comment.

Post Comment 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

foojay.io foojay.io Friends of OpenJDK

OpenJDK Hub

  * Java 21
  * Command Line

Community Hub

  * Friends of OpenJDK Today
  * Events Calendar
  * Where to Find Friends of OpenJDK

About

  * About Us
  * Advisory Board
  * Authors

[                    ]
   

  * [email protected]
  * Sitemap
  * Terms of Use

Set Event Reminder

  * Microsoft Outlook
  * Google Calendar
  * macOS Calendar

Subscribe to foojay updates:

https://foojay.io/feed/
Copied to the clipboard