[HN Gopher] Vector indexing all of Wikipedia on a laptop
___________________________________________________________________
Vector indexing all of Wikipedia on a laptop
Author : tjake
Score : 245 points
Date : 2024-05-29 17:05 UTC (5 hours ago)
(HTM) web link (foojay.io)
(TXT) w3m dump (foojay.io)
| gfourfour wrote:
| Maybe I'm missing something but I've created vector embeddings
| for all of English Wikipedia about a dozen times and it costs
| maybe $10 of compute on Colab, not $5000
| emmelaich wrote:
| Got any details?
| gfourfour wrote:
| Nothing too crazy, just downloading a dump, splitting it into
| manageable batch sizes, and using a lightweight embedding
| model to vectorize each article. Using the best GPU available
| on colab it takes maybe 8 hours if I remember correctly?
| Vectors can be saved as NPY files and loaded into something
| like FAISS for fast querying.
| j0hnyl wrote:
| How big is the resulting vector data?
| gfourfour wrote:
| Like 8 gb roughly
| abetusk wrote:
| This probably deserves its own article and might be of
| interest to the HN community.
| gfourfour wrote:
| I will probably make a post when I launch my app! For now
| I'm trying to figure out how I can host the whole system
| for cheap because I don't anticipate generating much
| revenue
| riku_iki wrote:
| What is the end task(e.g. RAG, or just vector search for
| question answering), are you satisfied with results in
| terms of quality?
| gfourfour wrote:
| The end result is a recommendation algorithm, so
| basically just vector similarity search (with a bunch of
| other logic too ofc). The quality is great, and if
| anything a little bit of underfitting is desirable to
| avoid the "we see you bought a toilet seat, here's 50
| other toilet seats you might like" effect.
| thomasfromcdnjs wrote:
| Did you chunk the articles? If so, in what way?
| gfourfour wrote:
| Yes. I split the text into sentence and append sentences
| to a chunk until the max context window is reached. The
| context window size is dynamic for each article so that
| each chunk is roughly the same size. Then I just do a
| mean pool of the chunks for each article.
| thomasfromcdnjs wrote:
| Thanks for the answer.
|
| Wikipedia has a lot of tables so I was wondering if
| content-aware sentence chunking would be good enough for
| Wikipedia.
|
| https://www.pinecone.io/learn/chunking-strategies/
| hivacruz wrote:
| Did you do use the same method, i.e. split by chunks each
| article and vectorize each chunk?
| gfourfour wrote:
| Yes
| dudus wrote:
| That's the only way to do it. You can't index the whole
| thing. The challenge is chunking. There are several different
| algorithms to chunk content for vectorization with different
| pros and cons.
| minimaxir wrote:
| You can do much bigger chunks with models that support RoPE
| embeddings, such as nomic-embed-text-1.5 which has a 8192
| context length: https://huggingface.co/nomic-ai/nomic-
| embed-text-v1.5
|
| In theory this would be an efficiency boost but the
| performance math can be tricky.
| qudat wrote:
| As far as I understand it, context length degrades llm
| performance, so just because an llm "supports" a large
| context length it basically just clips a top and bottom
| chunk and skips over the middle bits.
| rahimnathwani wrote:
| Why would you want chunks that big for vector search?
| Wouldn't there be too much information in each chunk,
| making it harder to match a query to a concept within the
| chunk?
| DataDaemon wrote:
| How?
| bunderbunder wrote:
| This is covering 300+ languages, not just English, and it's
| specifically using Cohere's Embed v3 embeddings, which are
| provided as a service and currently priced at US$0.10 per
| million tokens [1]. I assume if you're running on Colab you're
| using an open model, and possibly a relatively lighter weight
| one as well?
|
| [1]: https://cohere.com/pricing
| traverseda wrote:
| This is pretty early in the game to be relying on proprietary
| embeddings, don't you think? If if they are 20% better, blink
| and there will be a new normal.
|
| It's insane to me that someone, this early in the gold rush,
| would be mining in someone else's mine, so to speak
| bunderbunder wrote:
| I have no idea. But that wasn't the question I was
| answering. It was, "how does the article's author estimate
| that would cost $5000?" And I think that's how. Or at
| least, that gets to a number that's in the same ballpark as
| what the author was suggesting.
|
| That said, first guess, if you _do_ want to evaluate Cohere
| embeddings for a commercial application, using this dataset
| could be a decent basis for a lower-cost spike.
| jbellis wrote:
| Yes, that is how I came up with that number.
| janalsncm wrote:
| It's not just that. Embeddings aren't magic. If you're
| going to be creating embeddings for similarity search, the
| first thing you need to ask yourself is what makes two
| vectors similar such that two embeddings should even be
| close together?
|
| There are a lot of related sources of similarity, but
| they're slightly different. And I have no idea what Cohere
| is doing. Additionally, it's not clear to me how queries
| can and should be embedded. Queries are typically much
| shorter than their associated documents, so they typically
| need to be trained jointly.
|
| Selling "embeddings as a service" is a bit like selling
| hashing as a service. There are a lot of different hash
| functions. Cryptographic hashes, locality sensitive hashes,
| hashes for checksum, etc.
| bunderbunder wrote:
| You probably won't find out exactly what they're doing
| any more than anyone's going to find out a whole lot of
| details on what OpenAI is doing with GPT. But, as the
| popularity of GPT demonstrates, it seems that many
| business customers are now comfortable embracing closed
| models.
|
| There is more information here, though:
| https://cohere.com/blog/introducing-embed-v3
| gfourfour wrote:
| Ah didn't realize it was every language. Yes I'm using a
| light weight open model - but also my use case doesn't
| require anything super heavy weight. Wikipedia articles are
| very feature-dense and differentiable from one another. It
| doesn't require a massive feature vector to create meaningful
| embeddings.
| tjake wrote:
| it's 35M 1024 vectors Plus the text
| gfourfour wrote:
| Still, $5000 is kind of insane. That makes no sense to
| me.
| janalsncm wrote:
| I also don't quite understand the value of embedding all
| languages into the same database. If I search for "dog" do I
| really need to see the same article 300 times?
|
| As a first step they are using PQ anyways. It seems natural
| to just assume all English docs have the same centroid and
| search that subspace with hnswlib.
| jbellis wrote:
| It's split by language. TFA builds an index on the English
| language subset.
| janalsncm wrote:
| Ah, missed that.
| janalsncm wrote:
| Also, if you're spending $5000 to compute embeddings, why are
| you indexing them on a laptop?
| syllogistic wrote:
| He's not though because cohere stuck the already embedded
| dataset on huggingface
| https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-
| em...
| bytearray wrote:
| Do you have a link to the notebook?
| traverseda wrote:
| >Disable swap before building the index. Linux will aggressively
| try to cache the index being constructed to the point of swapping
| out parts of the JVM heap, which is obviously counterproductive.
| In my test, building with swap enabled was almost twice as slow
| as with it off.
|
| This is an indication to me that something has gone very wrong in
| your code base.
| dekhn wrote:
| As a workaround, you can mlock your process which should
| prevent the application pages from being evicted by swap.
| jbellis wrote:
| FWIW this is what Cassandra does on startup, but it didn't
| seem worth it to go to the trouble of dealing with Unsafe for
| a demo project.
| nextaccountic wrote:
| Can't mlock be wrapped out in a safe API?
| jbellis wrote:
| Probably, but I'm not aware of any Java library that
| provides that functionality.
| nsingh2 wrote:
| Yea this was strange, I've only ever seen swaps when my main
| memory was near being full. Maybe they're storing all
| embeddings in memory?
| jbellis wrote:
| I routinely see Linux page memory out to swap while having
| 10+GB free. I can only guess that it really really _really_
| likes to cache recently used data from disk.
| hot_gril wrote:
| I've seen the same. It will sometimes prefer to swap rather
| than evict the disk cache.
|
| I don't know how Linux does this in particular, but
| intuitively swapping can make sense if part of your
| allocated RAM isn't being accessed often and the disk is.
| The kernel isn't going to know for sure of course, and
| seems in my case it guessed wrong.
| hot_gril wrote:
| Linux swap has some fuzzy logic that I never fully understood.
| There have been times I've disabled it because it wasn't doing
| what I wanted.
| lynx23 wrote:
| I almost always reduce swappiness on a new install, the
| default of (60?) never served me well.
| nisa wrote:
| > This is an indication to me that something has gone very
| wrong in your code base.
|
| I'm not sure on what planet all of these people here live that
| they have success with Linux swap. It's been broken for me
| forever and the first thing I do is disable it everywhere.
| zero_k wrote:
| 100% agree
| funcDropShadow wrote:
| On a planet where the heap size of the JVM is properly set.
| hot_gril wrote:
| I don't use Java often, but when I do, I have to look up
| again how the heap settings work.
| taspeotis wrote:
| It's just -xXxXxX360NoScopexXxXxHeapSize=69G
| Izkata wrote:
| There should be at least one : in there.
|
| (IIRC these options are used like -XX:foo=bar)
|
| (Edit - no, but some do use : instead of = I guess)
| nolist_policy wrote:
| Linux swap has been fixed on Chromebooks for years thanks to
| MGLRU. It's upstream now and you can try it with a recent
| enough kernel with echo y
| >/sys/kernel/mm/lru_gen/enabled
| jbellis wrote:
| [article author]
|
| TBH this was sloppy on my part. I tested multiple runs of the
| index build and early on kswapd was super busy. I assumed Linux
| was just caching recently read parts of the source dataset, but
| it's also possible it was something external to the index build
| since it's my daily driver machine. After I turned off swap I
| had no issues and didn't look into it harder.
| AnotherGoodName wrote:
| Memory managed languages won't do long term garbage collection
| till memory is nearly full.
|
| I suspect they just need to pass in -xmx options to the jvm to
| avoid this.
| marginalia_nu wrote:
| Even if this wasn't factually incorrect, default max heap
| size on the JVM is 25% of system RAM.
| menacingly wrote:
| I actually wish this were true because of more predictable gc
| pauses
| threeseed wrote:
| With G1GC there is a setting called MaxGCPauseMillis which
| can give you predictability.
| threeseed wrote:
| Not sure what long term garbage collection means. Because the
| JVM will GC all objects at any point if they are unused.
|
| If you're referring to full GC you can configure how often
| that happens and by default it doesn't just wait until memory
| is nearly full.
| spullara wrote:
| All the major web companies disable swap so no work has been
| done to optimize it.
| hot_gril wrote:
| Yeah, in most situations I'd rather kill processes with
| excessive mem usage (possibly due to memleak) than have the
| machine grind to a halt by swapping. Sometimes my headless
| lab machines would become practically inaccessible over SSH
| if I didn't swapoff before running something that
| accidentally chews up mem.
|
| I'll let my personal laptop swap, though. Especially if my
| wife is also logged in and has tons of idle stuff open.
| tjake wrote:
| You can demo this here: https://jvectordemo.com:8443/
|
| GH Project: https://github.com/jbellis/jvector
| jbellis wrote:
| [article author]
|
| The source to build and serve the index are at
| https://github.com/jbellis/coherepedia-jvector
| xandrius wrote:
| Internal sever error :(
| bytearray wrote:
| Same.
| jbellis wrote:
| Oops, HN maxed out the free Cohere API key it was using.
| Fixed.
| hot_gril wrote:
| How many dimensions are in the original vectors? Something in the
| millions?
| jbellis wrote:
| 1024 per vector x 41M vectors
| hot_gril wrote:
| 1024-dim vectors would fit into pgvector in Postgres, which
| can do cosine similarity indexing and doesn't require
| everything to fit into memory. Wonder how the performance of
| that would compare to this.
| jbellis wrote:
| It's been a while since I read the source to pgvector but
| at the time it was a straightforward HNSW implementation
| that implicitly assumes your index fits in page cache. Once
| that's not true, your search performance will fall off a
| cliff. Which in turn means your insert performance also
| hits a wall since each new vector requires a search.
|
| I haven't seen any news that indicates this has changed,
| but by all means give it a try!
| hot_gril wrote:
| Thanks, I didn't know that. Last time I was dealing with
| these kinds of problems, pgvector didn't exist yet.
| isoprophlex wrote:
| $5000?! I indexed all of HN for ... $50 I think. And that's tens
| of millions of posts.
| xandrius wrote:
| To be fair Wikipedia has over 60 million pages and this is for
| 300+ languages. But yeah, the value shows that they might not
| be using the cheapest service out there.
| peter_l_downs wrote:
| > JVector, the library that powers DataStax Astra vector search,
| now supports indexing larger-than-memory datasets by performing
| construction-related searches with compressed vectors. This means
| that the edge lists need to fit in memory, but the uncompressed
| vectors do not, which gives us enough headroom to index
| Wikipedia-en on a laptop.
|
| It's interesting to note that JVector accomplishes this
| differently than how DiskANN described doing it. My understanding
| (based on the links below, but I didn't read the full diff in
| #244) is that JVector will incrementally compress the vectors it
| is using to construct the index; whereas DiskANN described
| partitioning the vectors into subsets small enough that indexes
| can be built in-memory using uncompressed vectors, building those
| indexes independently, and then merging the results into one
| larger index.
|
| OP, have you done any quality comparisons between an index built
| with JVector using the PQ approach (small RAM machine) vs. an
| index built with JVector using the raw vectors during
| construction (big RAM machine)? I'd be curious to understand what
| this technique's impact is on the final search results.
|
| I'd also be interested to know if any other vector stores support
| building indexes in limited memory using the partition-then-merge
| approach described by DiskANN.
|
| Finally, it's been a while since I looked at this stuff, so if I
| mis-wrote or mis-understood please correct me!
|
| - DiskANN: https://dl.acm.org/doi/10.5555/3454287.3455520
|
| - Anisotropic Vector Quantization (PQ Compression):
| https://arxiv.org/abs/1908.10396
|
| - JVector/#168: How to support building larger-than-memory
| indexes https://github.com/jbellis/jvector/issues/168
|
| - JVector/#244: Build indexes using compressed vectors
| https://github.com/jbellis/jvector/pull/244
| jbellis wrote:
| That's correct!
|
| I've tested the build-with-compression approach used here with
| all the datasets in JVector's Bench [1] and there's near zero
| loss in accuracy.
|
| I suspect that the reason the DiskANN authors used the approach
| they did is that in 2019 Deep1B was about the only very large
| public dataset around, and since the vectors themselves are
| small your edge lists end up dominating your memory usage. So
| they came up with a clever solution, at the cost of making
| construction 2.5x as expensive. (Educated guess: 2x is from
| adding each vector to multiple partitions and the extra 50% to
| merge the results.)
|
| So JVector is just keeping edge lists in memory today. When
| that becomes a bottleneck we may need to do something similar
| to DiskANN but I'm hoping we can do better because it's frankly
| a little inelegant.
|
| [1] https://github.com/jbellis/jvector/blob/main/jvector-
| example...
| pbadams wrote:
| It's not mentioned in the original paper, but DiskANN also
| supports PQ at build-time via `--build_PQ_bytes`, though it's a
| tradeoff with the graph quality as you mention.
|
| One interesting property in benchmarking is that the distance
| comparison implementations for full-dim vectors can often be
| more efficient than those for PQ-compressed vectors (straight-
| line SIMD execution vs table lookups), so on some systems
| cluster-and-merge is relatively competitive in terms of build
| performance.
| localhost wrote:
| This is a giant dataset of 536GB of embeddings. I wonder how much
| compression is possible by training or fine-tuning a transformer
| model directly using these embeddings, i.e., no
| tokenization/decoding steps? Could a 7B or 14B model "memorize"
| Wikipedia?
| arnaudsm wrote:
| In expert topics, is vector search finally competitive with
| BM25-like algorithms? Or do we still need to mix the 2 together ?
| jbellis wrote:
| ColBERT gives you best of both worlds.
|
| https://arxiv.org/abs/2004.12832
|
| https://thenewstack.io/overcoming-the-limits-of-rag-with-col...
| danaugrs wrote:
| Loosely related: https://www.quantamagazine.org/computer-
| scientists-invent-an...
| jl6 wrote:
| The source files appear to include pages from all namespaces,
| which is good, because a lot of the value of Wikipedia articles
| is held in the talk page discussions, and these sometimes get
| stripped from projects that use Wikipedia dumps.
| worldsayshi wrote:
| I'm curious what the main value you see in the talk pages? I
| almost never look at them myself.
| jl6 wrote:
| They're not so interesting for mundane topics, but for
| anything remotely controversial, they are essential for
| understanding what perspectives _aren't_ included in the
| article.
| lilatree wrote:
| "... turning an O(log N) search per segment into O(N) overall."
|
| Can someone explain why?
___________________________________________________________________
(page generated 2024-05-29 23:00 UTC)