[HN Gopher] 32k context length text embedding models
       ___________________________________________________________________
        
       32k context length text embedding models
        
       Author : fzliu
       Score  : 92 points
       Date   : 2024-11-24 00:42 UTC (22 hours ago)
        
 (HTM) web link (blog.voyageai.com)
 (TXT) w3m dump (blog.voyageai.com)
        
       | johnfn wrote:
       | What on earth is "OpenAI V3"? Just to be sure I wasn't being
       | obtuse, I Googled it, only to get a bunch of articles pointing
       | back at this post.
        
         | chipgap98 wrote:
         | It is OpenAI's vector embedding model
        
         | refulgentis wrote:
         | https://openai.com/index/new-embedding-models-and-api-update...
         | 
         | API constant is text-embedding-3
        
         | qeternity wrote:
         | You missed the "large" which adds context: "OpenAI V3 large"
         | which is their sota large embedding model.
        
       | throwup238 wrote:
       | What's the benefit of generating embeddings for such large
       | chunks? Do people use these large contexts to include lots of
       | document specific headers/footers or are they actually generating
       | embeddings of single large documents?
       | 
       | I don't understand how the math works out on those vectors
        
         | Arctic_fly wrote:
         | > What's the benefit of generating embeddings for such large
         | chunks?
         | 
         | Not an expert, but I believe now that we can fit more tokens
         | into an LLM's context window, we can avoid a number of problems
         | by providing additional context around any chunk of text that
         | might be useful to the LLM. Solves the problem of
         | misinterpretation of the important bit by the LLM.
        
         | lsorber wrote:
         | You don't have to reduce a long context to a single embedding
         | vector. Instead, you can compute the token embeddings of a long
         | context and then pool those into say sentence embeddings.
         | 
         | The benefit is that each sentence's embedding is informed by
         | all of the other sentences in the context. So when a sentence
         | refers to "The company" for example, the sentence embedding
         | will have captured which company that is based on the other
         | sentences in the context.
         | 
         | This technique is called 'late chunking' [1], and is based on
         | another technique called 'late interaction' [2].
         | 
         | And you can combine late chunking (to pool token embeddings)
         | with semantic chunking (to partition the document) for even
         | better retrieval results. For an example implementation that
         | applies both techniques, check out RAGLite [3].
         | 
         | [1] https://weaviate.io/blog/late-chunking
         | 
         | [2] https://jina.ai/news/what-is-colbert-and-late-interaction-
         | an...
         | 
         | [3] https://github.com/superlinear-ai/raglite
        
           | visarga wrote:
           | You can achieve the same effect by using LLM to do question
           | answering prior to embedding, it's much more flexible but
           | slower, you can use CoT, or even graph rag. Late chunking is
           | a faster implicit alternative.
        
           | voiper1 wrote:
           | I read both those articles, but I still don't get how to do
           | it. It seems the idea is that more of the embedding is
           | informed by context, but do I _do_ late chunking?
           | 
           | My best guess so far is that somehow I embed a long text and
           | then I break up the returned embedding into multiple parts
           | and search each separately? But that doesn't sound right.
        
             | _hl_ wrote:
             | You'd need to go a level below the API that most embedding
             | services expose.
             | 
             | A transformer-based embedding model doesn't just give you a
             | vector for the entire input string, it gives you vectors
             | _for each token_. These are then "pooled" together (eg
             | averaged, or max-pooled, or other strategies) to reduce
             | these many vectors down into a single vector.
             | 
             | Late chunking means changing this reduction to yield many
             | vectors instead of just one.
        
             | lsorber wrote:
             | The name 'late chunking' is indeed somewhat of a misnomer
             | in the sense that the technique does not partition
             | documents into document chunks. What it actually does is to
             | pool token embeddings (of a large context) into say
             | sentence embeddings. The result is that your document is
             | now represented as a sequence of sentence embeddings, each
             | of which is informed by the other sentences in the
             | document.
             | 
             | Then, you want to parition the document into chunks. Late
             | chunking pairs really well with semantic chunking because
             | it can use late chunking's improved sentence embeddings to
             | find semantically more cohesive chunks. In fact, you can
             | cast this as a binary integer programming problem and find
             | the 'best' chunks this way. See RAGLite [1] for an
             | implementation of both techniques including the formulation
             | of semantic chunking as an optimization problem.
             | 
             | Finally, you have a sequence of document chunks, each
             | represented as a multi-vector sequence of sentence
             | embeddings. You could choose to pool these sentence
             | embeddings into a single embedding vector per chunk. Or,
             | you could leave the multi-vector chunk embeddings as-is and
             | apply a more advanced querying technique like ColBERT's
             | MaxSim [2].
             | 
             | [1] https://github.com/superlinear-ai/raglite
             | 
             | [2] https://huggingface.co/blog/fsommers/document-
             | similarity-col...
        
               | causal wrote:
               | What does it mean to "pool" embeddings? The first article
               | seems to assume the reader is familiar
        
               | deepsquirrelnet wrote:
               | "Pooling" is just aggregation methods. It could mean
               | taking max or average values, or more exotic methods like
               | attention pooling. It's meant to reduce the one-per-token
               | dimensionality to one per passage or document.
        
         | voiper1 wrote:
         | I thought embedding large chunks would "dilute" the ideas,
         | since large chunks tend to have multiple disparate ideas?
         | 
         | Does it somehow capture _all_ of the ideas, and querying for a
         | single one would somehow match?
         | 
         | Isn't that the point of breaking down into sentences?
         | 
         | Someone mentioned adding context -- but doesn't it calculate
         | embedding on the whole thing? The API Docs list `input` but no
         | separate `context`.
         | https://docs.voyageai.com/reference/embeddings-api
        
       | OutOfHere wrote:
       | I would like to see an independent benchmark.
        
         | benreesman wrote:
         | This looks quite serious (which would be unsurprising given
         | that Fei-Fei Li and Christopher Re are involved).
         | 
         | I'm also quite interested in the nuts and bolts: does anyone
         | know what the current accepted leaderboard on this is? I was
         | screwing around with GritLM [1] a few months back and I seem to
         | remember the MTEB [2] was kind of the headline thing at that
         | time, but I might be out of date.
         | 
         | [1] https://arxiv.org/pdf/2402.09906 [2]
         | https://huggingface.co/blog/mteb
        
       | ldjkfkdsjnv wrote:
       | I build a RAG system with voyage and it crushed openai
       | embeddings, the difference in retrieval quality was noticeable
        
         | Oras wrote:
         | What evaluation metrics did you use?
        
       | ChrisArchitect wrote:
       | https://hn.algolia.com/?q=https%3A%2F%2Fblog.voyageai.com%2F...
        
         | albert_e wrote:
         | very interesting observation
         | 
         | so the same link has been posted ~10 times in the last one
         | month?
         | 
         | and this is the first time the post got any attention
         | 
         | mixed feelings there
        
       | Oras wrote:
       | Not related, but why they don't have a pricing page? Last time I
       | checked voyageai I had to google their pricing to find the page
       | as it's not in the nav menu.
        
       | dtjohnnyb wrote:
       | I've found good results from summarizing my documents using a
       | large context model then embedding those summaries using a
       | standard embedding model (e.g. e5)
       | 
       | This way I can tune what aspects of the doc I want to focus
       | retrieval on, it's easier to determine when there are any data
       | quality issues that need to be fixed, and the summaries have
       | turned out to be useful for other use cases in the company.
        
         | tinyhouse wrote:
         | Agreed. Esp if you gonna call an API, you can call something
         | cheaper than this embeddings model, like 4o-mini, summarize,
         | then use a small embeddings model fine-tuned for your needs
         | locally.
         | 
         | I was critical about these guys before (not about their quality
         | of work but rather about building a business around
         | embeddings). This work though seems interesting and I might
         | even give it a try, esp if they provide a fine-tuning API (is
         | that on the roadmap?)
        
       | albert_e wrote:
       | Related question:
       | 
       | One year ago simonw said this in a post about embeddings:
       | 
       | [https://news.ycombinator.com/item?id=37985489]
       | 
       | > Lots of startups are launching new "vector databases"--which
       | are effectively databases that are custom built to answer
       | nearest-neighbour queries against vectors as quickly as possible.
       | 
       | > I'm not convinced you need an entirely new database for this:
       | I'm more excited about adding custom indexes to existing
       | databases. For example, SQLite has sqlite-vss and PostgreSQL has
       | pgvector.
       | 
       | Do we still feel specialized vector databases are an overkill?
       | 
       | We have AWS promoting amazon OpenSearch as the default vector
       | database for a RAG knowledge base and that service is not cheap.
       | 
       | Also I would like to understand a bit more about how to pre-
       | process and chunk the data properly in a way that optimizes the
       | vector embeddings, storage and retrieval ... any good guides on
       | the same i cna refer to? Thanks!
        
         | marcyb5st wrote:
         | Will try to respond in order:
         | 
         | 1. It depends on how much embeddings we are talking about. Few
         | millions, probably yes, 100s millions/Billions range? You
         | likely need something custom.
         | 
         | 2. Vectors are only one way to search for things. If your
         | corpus contains stuff that don't carry semantic weight (think
         | about part numbers) and you want to find the chunk that
         | contains that information you'll likely need something that
         | uses tf-idf.
         | 
         | 3. Regarding chunk size, it really depends on your data and the
         | queries your users will do. The denser the content the smaller
         | the chunk size.
         | 
         | 4. Preprocessing - again, it depends. If it's PDFs with just
         | texts, try to remove footers / headers from the extracted text.
         | Of it contains tables, look at something like table former to
         | extract a good html representation. Clean up other artifacts
         | from the text (like dashes for like breaking, square brackets
         | with reference numbers for scientific papers, ... ).
        
         | dotinvoke wrote:
         | I had the same idea, but now I a Postgres database that has
         | very high latency for simple queries because the CPU is busy
         | building large HNSW indexes.
         | 
         | My impression is that it might be best to do vector index
         | construction separately from the rest of the data, for
         | performance reasons. It seems vector indexes are several orders
         | of magnitude more compute intensive than most other database
         | operations.
        
         | pjot wrote:
         | Personally I feel they are overkill.
         | 
         | For example I'm using duckDB as a vector store for similarity
         | search and RAG. It works really well.
         | 
         | https://github.com/patricktrainer/duckdb-embedding-search
        
         | woodson wrote:
         | Using a built-in vector extension is convenient if you want to
         | integrate vector similarity ("semantic search") with faceted
         | search. Some vector stores (e.g., qdrant) support vector
         | attributes that can be matched against, though.
         | 
         | As mentioned by another comment, an advantage of using a
         | separate vector store (on different hardware) is that
         | (re-)building vector indices can cause high CPU load and
         | therefore latency for regular queries to go up.
        
         | doctorpangloss wrote:
         | I work in image diffusion rather than "LLMs."
         | 
         | RAGs are the ControlNet of image diffusion. They exist for many
         | reasons, some of those are that context windows are small,
         | instruct-style frontier models haven't been adequately trained
         | on search tasks, and reason #1: people say they need RAGs so an
         | industry sprouts up to give it to them.
         | 
         | Do we need RAGs? I guess for now yes, but in the near future
         | no: 2/3 reasons will be solved by improvements to frontier
         | models that are eminently doable and probably underway already.
         | So let's answer the question for controlnets instead to
         | illuminate why just because someone asks for something, doesn't
         | mean it makes any sense.
         | 
         | If you're Marc Andreesen and you call Mike Ovitz, your
         | conversation about AI art generation is going to go like this:
         | "Hollywood people tell me that they don't want the AI to make
         | creative decisions, they want AI for VFX or to make short
         | TikTok videos or" something something, "the most important
         | people say they want tools that do not obsolete them." This
         | trickles down to the lowly art director, who may have an art
         | illustration background but who is already sending stuff
         | overseas to be done in something that resembles functionally a
         | dehumanized art generator. Everybody up and down this value
         | chain has no math or English Lit background so to them, the
         | simplest, most visual UX that doesn't threaten their livelihood
         | is what they want: Sketch To Image.
         | 
         | Does Sketch to image make sense? No. I mean it makes sense for
         | people who cannot be fucked to do the absolutely minimal amount
         | of lift to write prompts, which is many art professionals who,
         | for the worse, have adopted "I don't write" as an _identity_ ,
         | not merely some technical skill specialization. But once you
         | overcome this incredibly small obstacle of writing 25 words to
         | Ideogram instead of 3 words to Stable Diffusion, it's obvious:
         | nobody needs to draw something and then have a computer finish
         | it. Of course it's technologically and scientifically tractable
         | to have all the benefits of controlnets like, well control and
         | consistency, but with ordinary text. But people who buy
         | software want something that is finished, they are not waiting
         | around for R&D projects. They want some other penniless
         | creative to make a viral video using Ideogram or they want
         | their investor's daughter's heir boyfriend who is their boss to
         | shove it down their throats.
         | 
         | This is all meant to illustrate that you should not be asking
         | people who don't know anything what technology they want. They
         | absolutely positively will say "faster horses." RAGs are faster
         | horses!
        
       | antirez wrote:
       | I wonder if random projections or other similar dimensionality
       | reduction techniques work equally well then using a model
       | specialized in smaller embeddings capturing the same amount of
       | semantical information. This way we could use larger embeddings
       | of open models working very well and yet enjoy faster node to
       | node similarity during searches.
        
       ___________________________________________________________________
       (page generated 2024-11-24 23:01 UTC)