[HN Gopher] Vector indexing all of Wikipedia on a laptop
       ___________________________________________________________________
        
       Vector indexing all of Wikipedia on a laptop
        
       Author : tjake
       Score  : 245 points
       Date   : 2024-05-29 17:05 UTC (5 hours ago)
        
 (HTM) web link (foojay.io)
 (TXT) w3m dump (foojay.io)
        
       | gfourfour wrote:
       | Maybe I'm missing something but I've created vector embeddings
       | for all of English Wikipedia about a dozen times and it costs
       | maybe $10 of compute on Colab, not $5000
        
         | emmelaich wrote:
         | Got any details?
        
           | gfourfour wrote:
           | Nothing too crazy, just downloading a dump, splitting it into
           | manageable batch sizes, and using a lightweight embedding
           | model to vectorize each article. Using the best GPU available
           | on colab it takes maybe 8 hours if I remember correctly?
           | Vectors can be saved as NPY files and loaded into something
           | like FAISS for fast querying.
        
             | j0hnyl wrote:
             | How big is the resulting vector data?
        
               | gfourfour wrote:
               | Like 8 gb roughly
        
             | abetusk wrote:
             | This probably deserves its own article and might be of
             | interest to the HN community.
        
               | gfourfour wrote:
               | I will probably make a post when I launch my app! For now
               | I'm trying to figure out how I can host the whole system
               | for cheap because I don't anticipate generating much
               | revenue
        
             | riku_iki wrote:
             | What is the end task(e.g. RAG, or just vector search for
             | question answering), are you satisfied with results in
             | terms of quality?
        
               | gfourfour wrote:
               | The end result is a recommendation algorithm, so
               | basically just vector similarity search (with a bunch of
               | other logic too ofc). The quality is great, and if
               | anything a little bit of underfitting is desirable to
               | avoid the "we see you bought a toilet seat, here's 50
               | other toilet seats you might like" effect.
        
             | thomasfromcdnjs wrote:
             | Did you chunk the articles? If so, in what way?
        
               | gfourfour wrote:
               | Yes. I split the text into sentence and append sentences
               | to a chunk until the max context window is reached. The
               | context window size is dynamic for each article so that
               | each chunk is roughly the same size. Then I just do a
               | mean pool of the chunks for each article.
        
               | thomasfromcdnjs wrote:
               | Thanks for the answer.
               | 
               | Wikipedia has a lot of tables so I was wondering if
               | content-aware sentence chunking would be good enough for
               | Wikipedia.
               | 
               | https://www.pinecone.io/learn/chunking-strategies/
        
         | hivacruz wrote:
         | Did you do use the same method, i.e. split by chunks each
         | article and vectorize each chunk?
        
           | gfourfour wrote:
           | Yes
        
           | dudus wrote:
           | That's the only way to do it. You can't index the whole
           | thing. The challenge is chunking. There are several different
           | algorithms to chunk content for vectorization with different
           | pros and cons.
        
             | minimaxir wrote:
             | You can do much bigger chunks with models that support RoPE
             | embeddings, such as nomic-embed-text-1.5 which has a 8192
             | context length: https://huggingface.co/nomic-ai/nomic-
             | embed-text-v1.5
             | 
             | In theory this would be an efficiency boost but the
             | performance math can be tricky.
        
               | qudat wrote:
               | As far as I understand it, context length degrades llm
               | performance, so just because an llm "supports" a large
               | context length it basically just clips a top and bottom
               | chunk and skips over the middle bits.
        
               | rahimnathwani wrote:
               | Why would you want chunks that big for vector search?
               | Wouldn't there be too much information in each chunk,
               | making it harder to match a query to a concept within the
               | chunk?
        
         | DataDaemon wrote:
         | How?
        
         | bunderbunder wrote:
         | This is covering 300+ languages, not just English, and it's
         | specifically using Cohere's Embed v3 embeddings, which are
         | provided as a service and currently priced at US$0.10 per
         | million tokens [1]. I assume if you're running on Colab you're
         | using an open model, and possibly a relatively lighter weight
         | one as well?
         | 
         | [1]: https://cohere.com/pricing
        
           | traverseda wrote:
           | This is pretty early in the game to be relying on proprietary
           | embeddings, don't you think? If if they are 20% better, blink
           | and there will be a new normal.
           | 
           | It's insane to me that someone, this early in the gold rush,
           | would be mining in someone else's mine, so to speak
        
             | bunderbunder wrote:
             | I have no idea. But that wasn't the question I was
             | answering. It was, "how does the article's author estimate
             | that would cost $5000?" And I think that's how. Or at
             | least, that gets to a number that's in the same ballpark as
             | what the author was suggesting.
             | 
             | That said, first guess, if you _do_ want to evaluate Cohere
             | embeddings for a commercial application, using this dataset
             | could be a decent basis for a lower-cost spike.
        
               | jbellis wrote:
               | Yes, that is how I came up with that number.
        
             | janalsncm wrote:
             | It's not just that. Embeddings aren't magic. If you're
             | going to be creating embeddings for similarity search, the
             | first thing you need to ask yourself is what makes two
             | vectors similar such that two embeddings should even be
             | close together?
             | 
             | There are a lot of related sources of similarity, but
             | they're slightly different. And I have no idea what Cohere
             | is doing. Additionally, it's not clear to me how queries
             | can and should be embedded. Queries are typically much
             | shorter than their associated documents, so they typically
             | need to be trained jointly.
             | 
             | Selling "embeddings as a service" is a bit like selling
             | hashing as a service. There are a lot of different hash
             | functions. Cryptographic hashes, locality sensitive hashes,
             | hashes for checksum, etc.
        
               | bunderbunder wrote:
               | You probably won't find out exactly what they're doing
               | any more than anyone's going to find out a whole lot of
               | details on what OpenAI is doing with GPT. But, as the
               | popularity of GPT demonstrates, it seems that many
               | business customers are now comfortable embracing closed
               | models.
               | 
               | There is more information here, though:
               | https://cohere.com/blog/introducing-embed-v3
        
           | gfourfour wrote:
           | Ah didn't realize it was every language. Yes I'm using a
           | light weight open model - but also my use case doesn't
           | require anything super heavy weight. Wikipedia articles are
           | very feature-dense and differentiable from one another. It
           | doesn't require a massive feature vector to create meaningful
           | embeddings.
        
             | tjake wrote:
             | it's 35M 1024 vectors Plus the text
        
               | gfourfour wrote:
               | Still, $5000 is kind of insane. That makes no sense to
               | me.
        
           | janalsncm wrote:
           | I also don't quite understand the value of embedding all
           | languages into the same database. If I search for "dog" do I
           | really need to see the same article 300 times?
           | 
           | As a first step they are using PQ anyways. It seems natural
           | to just assume all English docs have the same centroid and
           | search that subspace with hnswlib.
        
             | jbellis wrote:
             | It's split by language. TFA builds an index on the English
             | language subset.
        
               | janalsncm wrote:
               | Ah, missed that.
        
         | janalsncm wrote:
         | Also, if you're spending $5000 to compute embeddings, why are
         | you indexing them on a laptop?
        
           | syllogistic wrote:
           | He's not though because cohere stuck the already embedded
           | dataset on huggingface
           | https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-
           | em...
        
         | bytearray wrote:
         | Do you have a link to the notebook?
        
       | traverseda wrote:
       | >Disable swap before building the index. Linux will aggressively
       | try to cache the index being constructed to the point of swapping
       | out parts of the JVM heap, which is obviously counterproductive.
       | In my test, building with swap enabled was almost twice as slow
       | as with it off.
       | 
       | This is an indication to me that something has gone very wrong in
       | your code base.
        
         | dekhn wrote:
         | As a workaround, you can mlock your process which should
         | prevent the application pages from being evicted by swap.
        
           | jbellis wrote:
           | FWIW this is what Cassandra does on startup, but it didn't
           | seem worth it to go to the trouble of dealing with Unsafe for
           | a demo project.
        
             | nextaccountic wrote:
             | Can't mlock be wrapped out in a safe API?
        
               | jbellis wrote:
               | Probably, but I'm not aware of any Java library that
               | provides that functionality.
        
         | nsingh2 wrote:
         | Yea this was strange, I've only ever seen swaps when my main
         | memory was near being full. Maybe they're storing all
         | embeddings in memory?
        
           | jbellis wrote:
           | I routinely see Linux page memory out to swap while having
           | 10+GB free. I can only guess that it really really _really_
           | likes to cache recently used data from disk.
        
             | hot_gril wrote:
             | I've seen the same. It will sometimes prefer to swap rather
             | than evict the disk cache.
             | 
             | I don't know how Linux does this in particular, but
             | intuitively swapping can make sense if part of your
             | allocated RAM isn't being accessed often and the disk is.
             | The kernel isn't going to know for sure of course, and
             | seems in my case it guessed wrong.
        
         | hot_gril wrote:
         | Linux swap has some fuzzy logic that I never fully understood.
         | There have been times I've disabled it because it wasn't doing
         | what I wanted.
        
           | lynx23 wrote:
           | I almost always reduce swappiness on a new install, the
           | default of (60?) never served me well.
        
         | nisa wrote:
         | > This is an indication to me that something has gone very
         | wrong in your code base.
         | 
         | I'm not sure on what planet all of these people here live that
         | they have success with Linux swap. It's been broken for me
         | forever and the first thing I do is disable it everywhere.
        
           | zero_k wrote:
           | 100% agree
        
           | funcDropShadow wrote:
           | On a planet where the heap size of the JVM is properly set.
        
             | hot_gril wrote:
             | I don't use Java often, but when I do, I have to look up
             | again how the heap settings work.
        
               | taspeotis wrote:
               | It's just -xXxXxX360NoScopexXxXxHeapSize=69G
        
               | Izkata wrote:
               | There should be at least one : in there.
               | 
               | (IIRC these options are used like -XX:foo=bar)
               | 
               | (Edit - no, but some do use : instead of = I guess)
        
           | nolist_policy wrote:
           | Linux swap has been fixed on Chromebooks for years thanks to
           | MGLRU. It's upstream now and you can try it with a recent
           | enough kernel with                 echo y
           | >/sys/kernel/mm/lru_gen/enabled
        
         | jbellis wrote:
         | [article author]
         | 
         | TBH this was sloppy on my part. I tested multiple runs of the
         | index build and early on kswapd was super busy. I assumed Linux
         | was just caching recently read parts of the source dataset, but
         | it's also possible it was something external to the index build
         | since it's my daily driver machine. After I turned off swap I
         | had no issues and didn't look into it harder.
        
         | AnotherGoodName wrote:
         | Memory managed languages won't do long term garbage collection
         | till memory is nearly full.
         | 
         | I suspect they just need to pass in -xmx options to the jvm to
         | avoid this.
        
           | marginalia_nu wrote:
           | Even if this wasn't factually incorrect, default max heap
           | size on the JVM is 25% of system RAM.
        
           | menacingly wrote:
           | I actually wish this were true because of more predictable gc
           | pauses
        
             | threeseed wrote:
             | With G1GC there is a setting called MaxGCPauseMillis which
             | can give you predictability.
        
           | threeseed wrote:
           | Not sure what long term garbage collection means. Because the
           | JVM will GC all objects at any point if they are unused.
           | 
           | If you're referring to full GC you can configure how often
           | that happens and by default it doesn't just wait until memory
           | is nearly full.
        
         | spullara wrote:
         | All the major web companies disable swap so no work has been
         | done to optimize it.
        
           | hot_gril wrote:
           | Yeah, in most situations I'd rather kill processes with
           | excessive mem usage (possibly due to memleak) than have the
           | machine grind to a halt by swapping. Sometimes my headless
           | lab machines would become practically inaccessible over SSH
           | if I didn't swapoff before running something that
           | accidentally chews up mem.
           | 
           | I'll let my personal laptop swap, though. Especially if my
           | wife is also logged in and has tons of idle stuff open.
        
       | tjake wrote:
       | You can demo this here: https://jvectordemo.com:8443/
       | 
       | GH Project: https://github.com/jbellis/jvector
        
         | jbellis wrote:
         | [article author]
         | 
         | The source to build and serve the index are at
         | https://github.com/jbellis/coherepedia-jvector
        
         | xandrius wrote:
         | Internal sever error :(
        
           | bytearray wrote:
           | Same.
        
           | jbellis wrote:
           | Oops, HN maxed out the free Cohere API key it was using.
           | Fixed.
        
       | hot_gril wrote:
       | How many dimensions are in the original vectors? Something in the
       | millions?
        
         | jbellis wrote:
         | 1024 per vector x 41M vectors
        
           | hot_gril wrote:
           | 1024-dim vectors would fit into pgvector in Postgres, which
           | can do cosine similarity indexing and doesn't require
           | everything to fit into memory. Wonder how the performance of
           | that would compare to this.
        
             | jbellis wrote:
             | It's been a while since I read the source to pgvector but
             | at the time it was a straightforward HNSW implementation
             | that implicitly assumes your index fits in page cache. Once
             | that's not true, your search performance will fall off a
             | cliff. Which in turn means your insert performance also
             | hits a wall since each new vector requires a search.
             | 
             | I haven't seen any news that indicates this has changed,
             | but by all means give it a try!
        
               | hot_gril wrote:
               | Thanks, I didn't know that. Last time I was dealing with
               | these kinds of problems, pgvector didn't exist yet.
        
       | isoprophlex wrote:
       | $5000?! I indexed all of HN for ... $50 I think. And that's tens
       | of millions of posts.
        
         | xandrius wrote:
         | To be fair Wikipedia has over 60 million pages and this is for
         | 300+ languages. But yeah, the value shows that they might not
         | be using the cheapest service out there.
        
       | peter_l_downs wrote:
       | > JVector, the library that powers DataStax Astra vector search,
       | now supports indexing larger-than-memory datasets by performing
       | construction-related searches with compressed vectors. This means
       | that the edge lists need to fit in memory, but the uncompressed
       | vectors do not, which gives us enough headroom to index
       | Wikipedia-en on a laptop.
       | 
       | It's interesting to note that JVector accomplishes this
       | differently than how DiskANN described doing it. My understanding
       | (based on the links below, but I didn't read the full diff in
       | #244) is that JVector will incrementally compress the vectors it
       | is using to construct the index; whereas DiskANN described
       | partitioning the vectors into subsets small enough that indexes
       | can be built in-memory using uncompressed vectors, building those
       | indexes independently, and then merging the results into one
       | larger index.
       | 
       | OP, have you done any quality comparisons between an index built
       | with JVector using the PQ approach (small RAM machine) vs. an
       | index built with JVector using the raw vectors during
       | construction (big RAM machine)? I'd be curious to understand what
       | this technique's impact is on the final search results.
       | 
       | I'd also be interested to know if any other vector stores support
       | building indexes in limited memory using the partition-then-merge
       | approach described by DiskANN.
       | 
       | Finally, it's been a while since I looked at this stuff, so if I
       | mis-wrote or mis-understood please correct me!
       | 
       | - DiskANN: https://dl.acm.org/doi/10.5555/3454287.3455520
       | 
       | - Anisotropic Vector Quantization (PQ Compression):
       | https://arxiv.org/abs/1908.10396
       | 
       | - JVector/#168: How to support building larger-than-memory
       | indexes https://github.com/jbellis/jvector/issues/168
       | 
       | - JVector/#244: Build indexes using compressed vectors
       | https://github.com/jbellis/jvector/pull/244
        
         | jbellis wrote:
         | That's correct!
         | 
         | I've tested the build-with-compression approach used here with
         | all the datasets in JVector's Bench [1] and there's near zero
         | loss in accuracy.
         | 
         | I suspect that the reason the DiskANN authors used the approach
         | they did is that in 2019 Deep1B was about the only very large
         | public dataset around, and since the vectors themselves are
         | small your edge lists end up dominating your memory usage. So
         | they came up with a clever solution, at the cost of making
         | construction 2.5x as expensive. (Educated guess: 2x is from
         | adding each vector to multiple partitions and the extra 50% to
         | merge the results.)
         | 
         | So JVector is just keeping edge lists in memory today. When
         | that becomes a bottleneck we may need to do something similar
         | to DiskANN but I'm hoping we can do better because it's frankly
         | a little inelegant.
         | 
         | [1] https://github.com/jbellis/jvector/blob/main/jvector-
         | example...
        
         | pbadams wrote:
         | It's not mentioned in the original paper, but DiskANN also
         | supports PQ at build-time via `--build_PQ_bytes`, though it's a
         | tradeoff with the graph quality as you mention.
         | 
         | One interesting property in benchmarking is that the distance
         | comparison implementations for full-dim vectors can often be
         | more efficient than those for PQ-compressed vectors (straight-
         | line SIMD execution vs table lookups), so on some systems
         | cluster-and-merge is relatively competitive in terms of build
         | performance.
        
       | localhost wrote:
       | This is a giant dataset of 536GB of embeddings. I wonder how much
       | compression is possible by training or fine-tuning a transformer
       | model directly using these embeddings, i.e., no
       | tokenization/decoding steps? Could a 7B or 14B model "memorize"
       | Wikipedia?
        
       | arnaudsm wrote:
       | In expert topics, is vector search finally competitive with
       | BM25-like algorithms? Or do we still need to mix the 2 together ?
        
         | jbellis wrote:
         | ColBERT gives you best of both worlds.
         | 
         | https://arxiv.org/abs/2004.12832
         | 
         | https://thenewstack.io/overcoming-the-limits-of-rag-with-col...
        
       | danaugrs wrote:
       | Loosely related: https://www.quantamagazine.org/computer-
       | scientists-invent-an...
        
       | jl6 wrote:
       | The source files appear to include pages from all namespaces,
       | which is good, because a lot of the value of Wikipedia articles
       | is held in the talk page discussions, and these sometimes get
       | stripped from projects that use Wikipedia dumps.
        
         | worldsayshi wrote:
         | I'm curious what the main value you see in the talk pages? I
         | almost never look at them myself.
        
           | jl6 wrote:
           | They're not so interesting for mundane topics, but for
           | anything remotely controversial, they are essential for
           | understanding what perspectives _aren't_ included in the
           | article.
        
       | lilatree wrote:
       | "... turning an O(log N) search per segment into O(N) overall."
       | 
       | Can someone explain why?
        
       ___________________________________________________________________
       (page generated 2024-05-29 23:00 UTC)