https://foojay.io/today/indexing-all-of-wikipedia-on-a-laptop/ foojay.io foojay.io Friends of OpenJDK * News + Daily News + Podcast + Event Calendar * Java Basics + Quick Start + Foojay Pedia * Java Almanac + Java 21 + Java 17 + Java 11 * About + Team + Authors + Advisory Board [ ] Friends of OpenJDK Today * DataStax * Performance * Tools Indexing all of Wikipedia, on a laptop May 29, 2024 Unique Views: 24,036 sinceMay, 2024 Author(s) * [bc157185a402] Jonathan Ellis + + Jonathan is the co-founder and CTO of DataStax. Before DataStax, he spent six years as project chair of Apache Cassandra, where he built the project and community into an open-source ... Learn more In November, Cohere released a dataset containing all of Wikipedia, chunked and embedded to vectors with their multilingual-v3 model. Computing this many embeddings yourself would cost in the neighborhood of $5000, so the public release of this dataset makes creating a semantic, vector-based index of Wikipedia practical for an individual for the first time. Here's what we're building: [ydeHYk97v6] You can try searching the completed index on a public demo instance here. Why this is hard Sure, the dataset is big (180GB for the English corpus), but that's not the obstacle per se. We've been able to build full-text indexes on larger datasets for a long time. The obstacle is that until now, off-the-shelf vector databases could not index a dataset larger than memory, because both the full-resolution vectors and the index (edge list) needed to be kept in memory during index construction. Larger datasets could be split into segments, but this means that at query time they need to search each segment separately, then combine the results, turning an O(log N) search per segment into O(N) overall. (In their latest release, Lucene attempts to mitigate this by processing segments in parallel with multiple threads, but obviously (1) this only gives you a constant factor of improvement before you run out of CPU cores and (2) this does not improve throughput.) Specifically, if you're indexing 1536-dimension vectors (the size of ada002 or openai-v3-small), then you can fit about 5M vectors and their associated edge lists in a 32GB index construction RAM budget. [-BwVEUQqMI] JVector, the library that powers DataStax Astra vector search, now supports indexing larger-than-memory datasets by performing construction-related searches with compressed vectors. This means that the edge lists need to fit in memory, but the uncompressed vectors do not, which gives us enough headroom to index Wikipedia-en on a laptop. Requirements 1. Linux or MacOS. It will not work on Windows because ChronicleMap, which we are going to use for the non-vector data, is limited to a 4GB size there. (If you are interested enough, you could shard the Map by vector id to keep each shard under 4GB and still have O(1) lookup times.) 2. About 180GB of free space for the dataset, and 90GB for the completed index. 3. Enough RAM to run a JVM with 36GB of heap space during construction (~28GB for the index, 8GB for GC headroom). 4. Disable swap before building the index. Linux will aggressively try to cache the index being constructed to the point of swapping out parts of the JVM heap, which is obviously counterproductive. In my test, building with swap enabled was almost twice as slow as with it off. Building and searching the index 1. Check out the project: $ git clone https://github.com/jbellis/coherepedia-jvector $ cd coherepedia-jvector 2. Edit config.properties to set the locations for the dataset and the index. 3. Run pip install datasets. (Setting up a venv or conda environment first is recommended but not strictly necessary.) 4. Run python download.py. This downloads the 180 GB dataset to the location you configured. For me that took about half an hour. 5. Run ./mvnw compile exec:exec@buildindex. This took about 5 and a half hours on my machine (with an i9-12900 CPU). 6. Run ./mvnw compile exec:exec@serve and open a browser to http:// localhost:4567. Search away! How it works We're using JVector for the vector index and Chronicle Map for the article data. There are several things I don't love about Chronicle Map, but nothing else touches it for simple disk-based key/value performance. The full source of the index construction class is here. I'll explain it next in pieces. Compression parameters JVector is based on the DiskANN vector index design, which performs an initial search using vectors compressed lossily with product quantization (PQ) in memory, then reranks the results using high-resolution vectors from disk. However, while DiskANN stores full, uncompressed vectors to perform reranking, JVector is able to improve on that using Locally-Adaptive Quantization (LVQ) compression. To set this up, we'll first load some vectors into a RandomAccessVectorValues (RAVV) instance. RAVV is a JVector interface for a vector container; it could be List or Map based, in-memory or on-disk. In this case we'll use a simple List-backed RAVV. We'll compute the parameters for both compressions (kmeans clustering for PQ, global mean for LVQ) from a single shard of the dataset. At about 110k rows, this is enough data to have a statistically valid sample. [aN200b1SUL] Next, we compute the PQ compression codebook; we're compressing the vectors by a factor of 64, because the Cohere v3 embeddings can be PQ-compressed that much without losing accuracy, after reranking. Binary Quantization only gives us 32x compression and is less accurate. [PS0HlbtZNa] Finally, we need to set up LVQ. LVQ gives us 4x compression while losing no measurable accuracy over the full uncompressed vectors, resulting in both a smaller footprint on disk and faster searches. (I thank the vector search team at Intel Research for pointing this out to us.) [XheHrYXEE6] GraphIndexBuilder Next, we need to instantiate and configure our GraphIndexBuilder. [veV6oVgpkD] This instantiates a JVector GraphIndexBuilder and connects it to an OnDiskGraphIndexWriter, and tells it to use the PQ-compressed vectors list (which starts empty and will grow as we add vectors to the index) during construction (in the BuildScoreProvider). Chronicle Map and RowData We'll store article contents in RowData records. This content is what has been encoded as the corresponding vector in the dataset, and is what we want to return to the user in our search results. [veVvO8QUrY] To turn the vector index's search results (a list of integer vector ids) into RowData, we store the RowData in a Map keyed by the vector id. This will be a lot of data, so we use ChronicleMap to store this on disk with a minimal in-memory footprint. [hFMjxcQstg] We need to tell ChronicleMap how large it's going to be, both in entry count and entry size. Undersizing these will cause it to crash (my primary complaint about ChronicleMap), so we deliberately use a high estimate. We do not need to explicitly tell ChronicleMap how to read and write RowData objects, instead we just have RowData implement Serializable. While ChronicleMap supports custom de/serialize code, it's perfectly happy to use simple out-of-the-box serialization and since profiling shows that's not a bottleneck for us we'll just leave it at that. Ingesting the data We use Java's parallel Streams to process the shards in parallel. For each row in each shard, we 1. Add it to pqVectorsList 2. Call writer.writeInline to add the LVQ-compressed vector to disk 3. Call builder.addGraphNode - order is important because both (1) and (2) are used when we call addGraphNode 4. Call contentMap.put with the article chunk data. [5souUR9e_g] [_Z7fqvsbvQ] You can look at the full source if you're curious about forEachRow, it's just standard "pull data out of Arrow" stuff. When the build completes, you should see files like this: $ ls -lh ~/coherepedia -rw-rw-r-- 1 jonathan jonathan 48G May 20 15:53 coherepedia.ann -rw-rw-r-- 1 jonathan jonathan 36G May 20 18:05 coherepedia.map -rw-rw-r-- 1 jonathan jonathan 2.5G May 20 15:53 coherepedia.pqv -rw-rw-r-- 1 jonathan jonathan 4.1K May 17 23:04 coherepedia.lvq -rw-rw-r-- 1 jonathan jonathan 1.1M May 17 23:04 coherepedia.pq These are respectively * ANN: the vector index, containing the edge lists and LVQ-compressed vectors for reranking. * MAP: the map containing article data indexed by vector id. * PQV: PQ-compressed vectors, which are read into memory and used for the approximate search pass. * LVQ: the LVQ global mean, used during construction. * PQ: the PQ codebooks, used during construction. Loading the index (after construction) The code for serving queries is found in the WebSearch class. We're using Spark (the web framework, not the big data engine) to serve a simple search form: [a-y2F-t0K9] Construction needed a relatively large heap to keep the edge lists in memory. With that complete, we only need enough to keep the PQ-compressed vectors in memory; exec@serve is configured to use a 4GB heap. WebSearch (the class behind exec@serve) first has a static initializer to load the PQ vectors and open the ChronicleMap. We also create a reusable GraphSearcher instance: [ofHaU8px5j] Performing a search Executing a search and turning it into RowData for the user looks like this: [-bY_lTq_EA] There are four "paragraphs" of code here, containing 1. The call to getVectorEmbedding. This calls Cohere's API to turn the search query (a String) into a vector embedding. 2. Creating approximate and reranking score functions. Approximate scoring is done through our product quantization, and reranking is done with the LVQ vectors in the index. Since the LVQ vectors are encapsulated in the index itself, we never need to explicitly deal with LVQ decoding; the index does it for us. 3. The call to searcher.search that actually does the query, and finally 4. Retrieving the RowData associated with the top vector neighbors using contentMap. That's it! We've indexed all of Wikipedia with high performance, parallel code in about 150 loc, and created a simple search server in another 100. On my machine, searches (which each run in a single thread) take about 50ms. We would expect it to take over twice as long if this were split across multiple segments. We would also expect it to lose significant accuracy if searches were performed only with compressed vectors without reranking. Conclusion Indexing the entirety of English Wikipedia on a laptop has become a practical reality thanks to recent advances in the JVector library that will be part of the imminent 3.0 release. (Star the repo and stand by!) This article demonstrates how to do exactly that using JVector in conjunction with Chronicle Map, while also showcasing the use of LVQ to reduce index size while preserving accurate reranking. To take advantage of the power of JVector alongside powerful indexing for non-vector data, rolled into a document api with support for realtime inserts, updates, and deletes, check out the DataStax Astra service. Enjoy hacking with JVector and Astra! Don't Forget to Share This Post! * * * * [bc157185a402] Author Jonathan Ellis Jonathan is the co-founder and CTO of DataStax. Before DataStax, he spent six years as project chair of Apache Cassandra, where he built the project and community into an open-source success. Previously, he built an object storage system based on Reed-Solomon encoding for data backup provider Mozy that scaled to... Show More * * Popular Posts: * JVector 1.0 October 02, 2023 * Spring AI: How to Write GenAI Applications with Java May 09, 2024 * 7 Reasons Why, After 26 Years, Java Still Makes Sense! March 15, 2022 * 9 Outdated Ideas About Java March 11, 2023 * A Flavour of TornadoVM on Apple M1 Pro August 02, 2022 Related Articles View All * JVector 1.0 JVector is a pure Java embedded vector search engine that powers DataStax Astra and is being added to Apache Cassandra. Read More [bc157185a4] + Jonathan Ellis October 02, 2023 + Apache Cassandra + Machine Learning + Release Notes * Spring AI: How to Write GenAI Applications with Java We'll look at how to write GenAI applications with Java using the Spring AI framework and utilize RAG for improving answers. Read More Avatar photo + Jennifer Reif May 09, 2024 + Graph + Java + Neo4J + Spring * 7 Reasons Why, After 26 Years, Java Still Makes Sense! After many discussions with Java developers, combined with my personal experiences with the Java community and platform, here are the key reasons why Java developers love Java after all these years! Read More Avatar photo + A N M Bazlur Rahman March 15, 2022 + Java Beginner + Java Core + Opinion Author(s) * [bc157185a402] Jonathan Ellis + + Jonathan is the co-founder and CTO of DataStax. Before DataStax, he spent six years as project chair of Apache Cassandra, where he built the project and community into an open-source ... Learn more Comments (0) Leave a Comment Cancel reply Your email address will not be published. Required fields are marked * [ ] [ ] Highlight your code snippets using [code lang="language name"] shortcode. Just insert your code between opening and closing tag: [code lang="java"] code [/code]. Or specify another language. [ ] [ ] Save my name, email, and website in this browser for the next time I comment. Post Comment [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] foojay.io foojay.io Friends of OpenJDK OpenJDK Hub * Java 21 * Command Line Community Hub * Friends of OpenJDK Today * Events Calendar * Where to Find Friends of OpenJDK About * About Us * Advisory Board * Authors [ ] * [email protected] * Sitemap * Terms of Use Set Event Reminder * Microsoft Outlook * Google Calendar * macOS Calendar Subscribe to foojay updates: https://foojay.io/feed/ Copied to the clipboard