[HN Gopher] 32k context length text embedding models
___________________________________________________________________
32k context length text embedding models
Author : fzliu
Score : 92 points
Date : 2024-11-24 00:42 UTC (22 hours ago)
(HTM) web link (blog.voyageai.com)
(TXT) w3m dump (blog.voyageai.com)
| johnfn wrote:
| What on earth is "OpenAI V3"? Just to be sure I wasn't being
| obtuse, I Googled it, only to get a bunch of articles pointing
| back at this post.
| chipgap98 wrote:
| It is OpenAI's vector embedding model
| refulgentis wrote:
| https://openai.com/index/new-embedding-models-and-api-update...
|
| API constant is text-embedding-3
| qeternity wrote:
| You missed the "large" which adds context: "OpenAI V3 large"
| which is their sota large embedding model.
| throwup238 wrote:
| What's the benefit of generating embeddings for such large
| chunks? Do people use these large contexts to include lots of
| document specific headers/footers or are they actually generating
| embeddings of single large documents?
|
| I don't understand how the math works out on those vectors
| Arctic_fly wrote:
| > What's the benefit of generating embeddings for such large
| chunks?
|
| Not an expert, but I believe now that we can fit more tokens
| into an LLM's context window, we can avoid a number of problems
| by providing additional context around any chunk of text that
| might be useful to the LLM. Solves the problem of
| misinterpretation of the important bit by the LLM.
| lsorber wrote:
| You don't have to reduce a long context to a single embedding
| vector. Instead, you can compute the token embeddings of a long
| context and then pool those into say sentence embeddings.
|
| The benefit is that each sentence's embedding is informed by
| all of the other sentences in the context. So when a sentence
| refers to "The company" for example, the sentence embedding
| will have captured which company that is based on the other
| sentences in the context.
|
| This technique is called 'late chunking' [1], and is based on
| another technique called 'late interaction' [2].
|
| And you can combine late chunking (to pool token embeddings)
| with semantic chunking (to partition the document) for even
| better retrieval results. For an example implementation that
| applies both techniques, check out RAGLite [3].
|
| [1] https://weaviate.io/blog/late-chunking
|
| [2] https://jina.ai/news/what-is-colbert-and-late-interaction-
| an...
|
| [3] https://github.com/superlinear-ai/raglite
| visarga wrote:
| You can achieve the same effect by using LLM to do question
| answering prior to embedding, it's much more flexible but
| slower, you can use CoT, or even graph rag. Late chunking is
| a faster implicit alternative.
| voiper1 wrote:
| I read both those articles, but I still don't get how to do
| it. It seems the idea is that more of the embedding is
| informed by context, but do I _do_ late chunking?
|
| My best guess so far is that somehow I embed a long text and
| then I break up the returned embedding into multiple parts
| and search each separately? But that doesn't sound right.
| _hl_ wrote:
| You'd need to go a level below the API that most embedding
| services expose.
|
| A transformer-based embedding model doesn't just give you a
| vector for the entire input string, it gives you vectors
| _for each token_. These are then "pooled" together (eg
| averaged, or max-pooled, or other strategies) to reduce
| these many vectors down into a single vector.
|
| Late chunking means changing this reduction to yield many
| vectors instead of just one.
| lsorber wrote:
| The name 'late chunking' is indeed somewhat of a misnomer
| in the sense that the technique does not partition
| documents into document chunks. What it actually does is to
| pool token embeddings (of a large context) into say
| sentence embeddings. The result is that your document is
| now represented as a sequence of sentence embeddings, each
| of which is informed by the other sentences in the
| document.
|
| Then, you want to parition the document into chunks. Late
| chunking pairs really well with semantic chunking because
| it can use late chunking's improved sentence embeddings to
| find semantically more cohesive chunks. In fact, you can
| cast this as a binary integer programming problem and find
| the 'best' chunks this way. See RAGLite [1] for an
| implementation of both techniques including the formulation
| of semantic chunking as an optimization problem.
|
| Finally, you have a sequence of document chunks, each
| represented as a multi-vector sequence of sentence
| embeddings. You could choose to pool these sentence
| embeddings into a single embedding vector per chunk. Or,
| you could leave the multi-vector chunk embeddings as-is and
| apply a more advanced querying technique like ColBERT's
| MaxSim [2].
|
| [1] https://github.com/superlinear-ai/raglite
|
| [2] https://huggingface.co/blog/fsommers/document-
| similarity-col...
| causal wrote:
| What does it mean to "pool" embeddings? The first article
| seems to assume the reader is familiar
| deepsquirrelnet wrote:
| "Pooling" is just aggregation methods. It could mean
| taking max or average values, or more exotic methods like
| attention pooling. It's meant to reduce the one-per-token
| dimensionality to one per passage or document.
| voiper1 wrote:
| I thought embedding large chunks would "dilute" the ideas,
| since large chunks tend to have multiple disparate ideas?
|
| Does it somehow capture _all_ of the ideas, and querying for a
| single one would somehow match?
|
| Isn't that the point of breaking down into sentences?
|
| Someone mentioned adding context -- but doesn't it calculate
| embedding on the whole thing? The API Docs list `input` but no
| separate `context`.
| https://docs.voyageai.com/reference/embeddings-api
| OutOfHere wrote:
| I would like to see an independent benchmark.
| benreesman wrote:
| This looks quite serious (which would be unsurprising given
| that Fei-Fei Li and Christopher Re are involved).
|
| I'm also quite interested in the nuts and bolts: does anyone
| know what the current accepted leaderboard on this is? I was
| screwing around with GritLM [1] a few months back and I seem to
| remember the MTEB [2] was kind of the headline thing at that
| time, but I might be out of date.
|
| [1] https://arxiv.org/pdf/2402.09906 [2]
| https://huggingface.co/blog/mteb
| ldjkfkdsjnv wrote:
| I build a RAG system with voyage and it crushed openai
| embeddings, the difference in retrieval quality was noticeable
| Oras wrote:
| What evaluation metrics did you use?
| ChrisArchitect wrote:
| https://hn.algolia.com/?q=https%3A%2F%2Fblog.voyageai.com%2F...
| albert_e wrote:
| very interesting observation
|
| so the same link has been posted ~10 times in the last one
| month?
|
| and this is the first time the post got any attention
|
| mixed feelings there
| Oras wrote:
| Not related, but why they don't have a pricing page? Last time I
| checked voyageai I had to google their pricing to find the page
| as it's not in the nav menu.
| dtjohnnyb wrote:
| I've found good results from summarizing my documents using a
| large context model then embedding those summaries using a
| standard embedding model (e.g. e5)
|
| This way I can tune what aspects of the doc I want to focus
| retrieval on, it's easier to determine when there are any data
| quality issues that need to be fixed, and the summaries have
| turned out to be useful for other use cases in the company.
| tinyhouse wrote:
| Agreed. Esp if you gonna call an API, you can call something
| cheaper than this embeddings model, like 4o-mini, summarize,
| then use a small embeddings model fine-tuned for your needs
| locally.
|
| I was critical about these guys before (not about their quality
| of work but rather about building a business around
| embeddings). This work though seems interesting and I might
| even give it a try, esp if they provide a fine-tuning API (is
| that on the roadmap?)
| albert_e wrote:
| Related question:
|
| One year ago simonw said this in a post about embeddings:
|
| [https://news.ycombinator.com/item?id=37985489]
|
| > Lots of startups are launching new "vector databases"--which
| are effectively databases that are custom built to answer
| nearest-neighbour queries against vectors as quickly as possible.
|
| > I'm not convinced you need an entirely new database for this:
| I'm more excited about adding custom indexes to existing
| databases. For example, SQLite has sqlite-vss and PostgreSQL has
| pgvector.
|
| Do we still feel specialized vector databases are an overkill?
|
| We have AWS promoting amazon OpenSearch as the default vector
| database for a RAG knowledge base and that service is not cheap.
|
| Also I would like to understand a bit more about how to pre-
| process and chunk the data properly in a way that optimizes the
| vector embeddings, storage and retrieval ... any good guides on
| the same i cna refer to? Thanks!
| marcyb5st wrote:
| Will try to respond in order:
|
| 1. It depends on how much embeddings we are talking about. Few
| millions, probably yes, 100s millions/Billions range? You
| likely need something custom.
|
| 2. Vectors are only one way to search for things. If your
| corpus contains stuff that don't carry semantic weight (think
| about part numbers) and you want to find the chunk that
| contains that information you'll likely need something that
| uses tf-idf.
|
| 3. Regarding chunk size, it really depends on your data and the
| queries your users will do. The denser the content the smaller
| the chunk size.
|
| 4. Preprocessing - again, it depends. If it's PDFs with just
| texts, try to remove footers / headers from the extracted text.
| Of it contains tables, look at something like table former to
| extract a good html representation. Clean up other artifacts
| from the text (like dashes for like breaking, square brackets
| with reference numbers for scientific papers, ... ).
| dotinvoke wrote:
| I had the same idea, but now I a Postgres database that has
| very high latency for simple queries because the CPU is busy
| building large HNSW indexes.
|
| My impression is that it might be best to do vector index
| construction separately from the rest of the data, for
| performance reasons. It seems vector indexes are several orders
| of magnitude more compute intensive than most other database
| operations.
| pjot wrote:
| Personally I feel they are overkill.
|
| For example I'm using duckDB as a vector store for similarity
| search and RAG. It works really well.
|
| https://github.com/patricktrainer/duckdb-embedding-search
| woodson wrote:
| Using a built-in vector extension is convenient if you want to
| integrate vector similarity ("semantic search") with faceted
| search. Some vector stores (e.g., qdrant) support vector
| attributes that can be matched against, though.
|
| As mentioned by another comment, an advantage of using a
| separate vector store (on different hardware) is that
| (re-)building vector indices can cause high CPU load and
| therefore latency for regular queries to go up.
| doctorpangloss wrote:
| I work in image diffusion rather than "LLMs."
|
| RAGs are the ControlNet of image diffusion. They exist for many
| reasons, some of those are that context windows are small,
| instruct-style frontier models haven't been adequately trained
| on search tasks, and reason #1: people say they need RAGs so an
| industry sprouts up to give it to them.
|
| Do we need RAGs? I guess for now yes, but in the near future
| no: 2/3 reasons will be solved by improvements to frontier
| models that are eminently doable and probably underway already.
| So let's answer the question for controlnets instead to
| illuminate why just because someone asks for something, doesn't
| mean it makes any sense.
|
| If you're Marc Andreesen and you call Mike Ovitz, your
| conversation about AI art generation is going to go like this:
| "Hollywood people tell me that they don't want the AI to make
| creative decisions, they want AI for VFX or to make short
| TikTok videos or" something something, "the most important
| people say they want tools that do not obsolete them." This
| trickles down to the lowly art director, who may have an art
| illustration background but who is already sending stuff
| overseas to be done in something that resembles functionally a
| dehumanized art generator. Everybody up and down this value
| chain has no math or English Lit background so to them, the
| simplest, most visual UX that doesn't threaten their livelihood
| is what they want: Sketch To Image.
|
| Does Sketch to image make sense? No. I mean it makes sense for
| people who cannot be fucked to do the absolutely minimal amount
| of lift to write prompts, which is many art professionals who,
| for the worse, have adopted "I don't write" as an _identity_ ,
| not merely some technical skill specialization. But once you
| overcome this incredibly small obstacle of writing 25 words to
| Ideogram instead of 3 words to Stable Diffusion, it's obvious:
| nobody needs to draw something and then have a computer finish
| it. Of course it's technologically and scientifically tractable
| to have all the benefits of controlnets like, well control and
| consistency, but with ordinary text. But people who buy
| software want something that is finished, they are not waiting
| around for R&D projects. They want some other penniless
| creative to make a viral video using Ideogram or they want
| their investor's daughter's heir boyfriend who is their boss to
| shove it down their throats.
|
| This is all meant to illustrate that you should not be asking
| people who don't know anything what technology they want. They
| absolutely positively will say "faster horses." RAGs are faster
| horses!
| antirez wrote:
| I wonder if random projections or other similar dimensionality
| reduction techniques work equally well then using a model
| specialized in smaller embeddings capturing the same amount of
| semantical information. This way we could use larger embeddings
| of open models working very well and yet enjoy faster node to
| node similarity during searches.
___________________________________________________________________
(page generated 2024-11-24 23:01 UTC)