[HN Gopher] Gemini Embedding: Powering RAG and context engineering
       ___________________________________________________________________
        
       Gemini Embedding: Powering RAG and context engineering
        
       Author : simonpure
       Score  : 138 points
       Date   : 2025-07-31 16:47 UTC (6 hours ago)
        
 (HTM) web link (developers.googleblog.com)
 (TXT) w3m dump (developers.googleblog.com)
        
       | bryan0 wrote:
       | The Matryoshka embeddings seem interesting:
       | 
       | > The Gemini embedding model, gemini-embedding-001, is trained
       | using the Matryoshka Representation Learning (MRL) technique
       | which teaches a model to learn high-dimensional embeddings that
       | have initial segments (or prefixes) which are also useful,
       | simpler versions of the same data. Use the output_dimensionality
       | parameter to control the size of the output embedding vector.
       | Selecting a smaller output dimensionality can save storage space
       | and increase computational efficiency for downstream
       | applications, while sacrificing little in terms of quality. By
       | default, it outputs a 3072-dimensional embedding, but you can
       | truncate it to a smaller size without losing quality to save
       | storage space. We recommend using 768, 1536, or 3072 output
       | dimensions. [0]
       | 
       | looks like even the 256-dim embeddings perform really well.
       | 
       | [0]: https://ai.google.dev/gemini-api/docs/embeddings#quality-
       | for...
        
         | simonw wrote:
         | The Matryoshka trick is really neat - there's a good
         | explanation here: https://huggingface.co/blog/matryoshka
         | 
         | I've seen it in a few models now - Nomic Embed 1.5 was the
         | first https://www.nomic.ai/blog/posts/nomic-embed-matryoshka
        
           | alach11 wrote:
           | OpenAI did it a few weeks earlier when they released text-
           | embedding-3-large, right?
        
             | simonw wrote:
             | Huh, yeah you're right: that was January 25th 2024
             | https://openai.com/index/new-embedding-models-and-api-
             | update...
             | 
             | Nomic 1.5 was February 14th 2024:
             | https://www.nomic.ai/blog/posts/nomic-embed-matryoshka
        
         | ACCount36 wrote:
         | Google teams seem to be in love with that Matryoshka tech. I
         | wonder how far that scales.
        
           | OutOfHere wrote:
           | It's a practical feature. Scaling is irrelevant in this
           | context because it scales to the length of the embedding,
           | although in batches of k-length embeddings.
        
         | thefourthchime wrote:
         | It's interesting, but the improvement they're claiming isn't
         | that groundbreaking.
        
         | OutOfHere wrote:
         | Does OpenAI's text-embedding-3-large or text-embedding-3-small
         | embedding model have the Matryoshka property?
        
           | minimaxir wrote:
           | They do, they just don't advertise it well (and only
           | confirmed it with a footnote after criticism of its
           | omission): https://openai.com/index/new-embedding-models-and-
           | api-update...
           | 
           | > Both of our new embedding models were trained with a
           | technique that allows developers to trade-off performance and
           | cost of using embeddings. Specifically, developers can
           | shorten embeddings (i.e. remove some numbers from the end of
           | the sequence) without the embedding losing its concept-
           | representing properties by passing in the dimensions API
           | parameter. For example, on the MTEB benchmark, a text-
           | embedding-3-large embedding can be shortened to a size of 256
           | while still outperforming an unshortened text-embedding-
           | ada-002 embedding with a size of 1536.
        
       | mvieira38 wrote:
       | To anyone working in these types of applications, are embeddings
       | still worth it compared to agentic search for text? If I have a
       | directory of text files, for example, is it better to save all of
       | their embeddings in a VDB and use that, or are LLMs now good
       | enough that I can just let them use ripgrep or something to
       | search for themselves?
        
         | philip1209 wrote:
         | Semantic search is still important. I'd say that regex search
         | is also quickly rising in importance, especially for coding
         | agents.
        
         | pjm331 wrote:
         | With the caveat that I have not spent a serious amount of time
         | trying to get RAG to work - my brief attempt to use it via AWS
         | knowledge base to compare it vs agentic search resulted in me
         | sticking with agentic search (via Claude code SDK)
         | 
         | My impression was there's lots of knobs you can tune with RAG
         | and it's just more complex in general - so maybe there's a
         | point where the amount of text I have is large enough that that
         | complexity pays off - but right now agentic search works very
         | well and is significantly simpler to get started with
        
         | simonw wrote:
         | If your LLM is good enough you'll likely get better results
         | from tool calling with grep or a FTS engine - the better models
         | can even adapt their search patterns to search for things like
         | "dog OR canine" where previously vector similarity may have
         | been a bigger win.
         | 
         | Getting embeddings working takes a bunch of work: you need to
         | decide on a chunking strategy, then run the embeddings, then
         | decide how best to store them for fast retrieval. You often end
         | up having to keep your embedding store in memory which can add
         | up for larger volumes of data.
         | 
         | I did a whole lot of work with embeddings last year but I've
         | mostly lost interest now that tool-based-search has become so
         | powerful.
         | 
         | Hooking up tool-based-search that itself uses embeddings is
         | worth exploring, but you may find that the results you get from
         | ripgrep are good enough that it's not worth the considerable
         | extra effort.
        
         | whinvik wrote:
         | Curious but how do we take care of non text files. What if we
         | had a lot of PDF files?
        
           | minimaxir wrote:
           | You can extract text from PDF files. (there's a number of
           | dedicated models for that, but even the humble pandoc can do
           | it)
        
           | sergiotapia wrote:
           | Use pymupdf to extract the PDF text. Hell, run that nasty
           | business through an LLM as step-2 to get a beautiful clean
           | markdown version of the text. Lord knows the PDF format is
           | horribly complex!
        
       | morkalork wrote:
       | Question to other GCP users, how are you finding Google's
       | aggressive deprecation of older embedding models? Feels like you
       | have to pay to rerun your data through every 12 months.
        
         | adregan wrote:
         | This is precisely the risk I've been wondering about with
         | vectorization. I've considered that an open source model might
         | be valuable as you could always find someone to host it for you
         | and control the deprecation rate yourself.
        
         | throwaway-blaze wrote:
         | You know of lots of LLM-using apps that don't need to re-run
         | their fine tunings or embeddings because of improvements or new
         | features at least annually? Things are moving so fast that
         | "every 12 months" seems kinda slow...
        
         | BoorishBears wrote:
         | My costs for embedding are so small compared to inference I
         | don't generally notice.
         | 
         | But am I crazy or did the pre-production version of gemini-
         | embedding-001 have a much larger max context length?
         | 
         | Edit: It seems like it did? 8k -> 2k? Huge downgrade if true, I
         | was really excited about the experimental model reaching GA
         | before that
        
       | asdev wrote:
       | I feel like tool calling killed RAG, however you have less
       | control over how the retrieved data is injected in the context.
        
         | OutOfHere wrote:
         | How would you use tool-calling to filter through millions of
         | documents? You need some search functionality, whether old-
         | school search or embedding search. If you have only thousands
         | of documents, then sure, you don't need search, as you can feed
         | them all to the LLM.
        
           | kfajdsl wrote:
           | You give the LLM search tools.
        
             | OutOfHere wrote:
             | That's missing the point. You are hiding the search behind
             | the tool, but it's still search. Whether you use a tool or
             | a hardcoded workflow is irrelevant.
        
           | kridsdale1 wrote:
           | I haven't built either system but it seems clear that tool
           | calling will be 'O(num_targets * O(search tool))', while RAG
           | will be 'O(embed_query * num_targets)'.
           | 
           | RAG looks linear (constant per lookup) while tools look
           | polynomial. And tools will possibly fill up the limited LLM
           | context too.
        
         | billmalarky wrote:
         | Search tool calling is RAG. Maybe we should call it a "RAG
         | Agent" to be more en vogue heh. But RAG is not just similarity
         | search on embeddings in vector DBs. RAG is any type of a
         | retrieval + context injection step prior to inference.
         | 
         | Heck, the RAG Agent could run cosign diff on your vector db in
         | addition to grep, FTS queries, KB api calls, whatever, to do
         | wide recall (candidate generation) then rerank (relevance
         | prioritization) all the results.
         | 
         | You are probably correct that for most use cases search tool
         | calling makes more practical sense than embeddings similarity
         | search to power RAG.
        
           | visarga wrote:
           | > could run cosign diff on your vector db
           | 
           | or maybe even "cosine similarity"
        
       | stillpointlab wrote:
       | > Embeddings are crucial here, as they efficiently identify and
       | integrate vital information--like documents, conversation
       | history, and tool definitions--directly into a model's working
       | memory.
       | 
       | I feel like I'm falling behind here, but can someone explain this
       | to me?
       | 
       | My high-level view of embedding is that I send some text to the
       | provider, they tokenize the text and then run it through some NN
       | that spits out a vector of numbers of a particular size (looks to
       | be variable in this case including 768, 1536 and 3072). I can
       | then use those embeddings in places like a vector DB where I
       | might want to do some kind of similarity search (e.g. cosine
       | difference). I can also use them to do clustering on that
       | similarity which can give me some classification capabilities.
       | 
       | But how does this translate to these things being "directly into
       | a model's working memory'? My understanding is that with RAG I
       | just throw a bunch of the embeddings into a vector DB as keys but
       | the ultimate text I send in the context to the LLM is the source
       | text that the keys represent. I don't actually send the
       | embeddings themselves to the LLM.
       | 
       | So what is is marketing stuff about "directly into a model's
       | working memory."? Is my mental view wrong?
        
         | NicholasD43 wrote:
         | You're right on this. "Advanced" RAG techniques are all
         | complete marketing BS, in the end all you're doing it passing
         | the text into the model's context window.
        
         | letitgo12345 wrote:
         | LLMs can use search engines as a tool. One possibility is
         | Google embeds the search query through these embeddings and
         | does retrieval using them and then the retrieved result is
         | pasted into the model's chain of thought (which..unless they
         | have an external memory module in their model, is basically the
         | model's only working memory).
        
           | stillpointlab wrote:
           | I'm reading the docs and it does not appear Google keeps
           | these embeddings at all. I send some text to them, they
           | return the embedding for that text at the size I specified.
           | 
           | So the flow is something like:
           | 
           | 1. Have a text doc (or library of docs)
           | 
           | 2. Chunk it into small pieces
           | 
           | 3. Send each chunk to <provider> and get an embedding vector
           | of some size back
           | 
           | 4. Use the embedding to:
           | 
           | 4a. Semantic search / RAG: put the embeddings in a vector DB
           | and do some similarity search on the embedding. The ultimate
           | output is the source _chunk_
           | 
           | 4b. Run a cluster algorithm on the embedding to generate some
           | kind of graph representation of my data
           | 
           | 4c. Run a classifier algorithm on the embedding to allow me
           | to classify new data
           | 
           | 5. The output of all steps in 4 is crucially text
           | 
           | 6. Send that text to an LLM
           | 
           | At no point is the embedding directly in the models memory.
        
         | Voloskaya wrote:
         | > So what is is marketing stuff about "directly into a model's
         | working memory."? Is my mental view wrong?
         | 
         | Context is sometimes called working memory. But no your
         | understanding is right: find the right document through cosine
         | similarity (and thus through embeddings), then add the content
         | of those docs to the context
        
           | greymalik wrote:
           | One of the things I find confusing about this article is that
           | the author positions RAG as being unrelated to both context
           | engineering and vector search.
        
         | yazaddaruvala wrote:
         | At least in theory. If the model is the same, the embeddings
         | can be reused by the model rather than recomputing them.
         | 
         | I believe this is what they mean.
         | 
         | In practice, how fast will the model change (including
         | tokenizer)? how fast will the vector db be fully backfilled to
         | match the model version?
         | 
         | That would be the "cache hit rate" of sorts and how much it
         | helps likely depends on some of those variables for your
         | specific corpus and query volumes.
        
           | stillpointlab wrote:
           | > the embeddings can be reused by the model
           | 
           | I can't find any evidence that this is possible with Gemini
           | or any other LLM provider.
        
             | yazaddaruvala wrote:
             | Yeah given what your saying is true and continues to be,
             | 
             | Seems the embeddings would just be useful for a "nice
             | corpus search" mechanism for some regular RAG.
        
         | fine_tune wrote:
         | RAG is taking a bunch of docs, chunking them it to text blocks
         | of a certain length (how best todo this up for debate),
         | creating a search API that takes query (like a google search)
         | and compares it to the document chunks (very much how your
         | describing). Take the returned chunks, ignore the score from
         | vector search, feed those chunks into a re-ranker with the
         | original query (this step is important vector search mostly
         | sucks), filter those re-ranked for the top 1/2 results and then
         | format a prompt like;
         | 
         | The user ask 'long query', we fetched some docs (see below),
         | answer the query based on the docs (reference the docs if u
         | feel like it)
         | 
         | Doc1.pdf - Chunk N Eat cheese
         | 
         | Doc2.pdf- Chunk Y Dont eat cheese
         | 
         | You then expose the search API as a "tool" for the LLM to call,
         | slightly reformatting the prompt above into a multi turn convo,
         | and suddenly you're in ze money.
         | 
         | But once your users are happy with those results they'll want
         | something dumb like the latest football scores, then you need a
         | web tool - and then it never ends.
         | 
         | To be fair though, its pretty powerful once you've got in
         | place.
        
           | criddell wrote:
           | Is RAG how I would process my 20+ year old bug list for a
           | piece of software I work on?
           | 
           | I've been thinking about this because it would be nice to
           | have a fuzzier search.
        
             | fine_tune wrote:
             | Yes and no, for human search - its kinda neat, you might
             | find some duplicates, or some nearby neighbour bugs that
             | help you solve a whole class of issues.
             | 
             | But the cool kids? They'd do something worse;
             | 
             | They'd define some complicated agentic setup that cloned
             | your code base into containers firewalled off from the
             | world, give prompts like;
             | 
             | Your expert software dev in MY_FAVE_LANG, here's a bug
             | description 'LONG BUG DESCRIPTION' explore the code and
             | write a solution. Here's some tools (read_file, write_file,
             | ETC)
             | 
             | You'd then spawn as many of these as you can, per task, and
             | have them all generate pull requests for the tasks. Review
             | them with an LLM, then manually and accept PR's you wanted.
             | Now your in the ultra money.
             | 
             | You'd use RAG to guide an untuned LLM on your code base for
             | styles and how to write code. You'd write docs like "how to
             | write an API, how to write a DB migration, ETC" and give
             | that as tool to the agents writing the code.
             | 
             | With time and effort, you can write agents to be specific
             | to your code base through fine tuning, but who's got that
             | kind of money?
        
               | CartwheelLinux wrote:
               | You'd be surprised how many people are actually doing
               | this exact kind of solutioning.
               | 
               | It's also not that costly to do if you think about the
               | problem correctly
               | 
               | If you continue down the brute forcing route you can do
               | mischievous things like sign up for thousands and
               | thousands of free accounts across numerous network
               | connections to LLM APIs and plug away
        
         | visarga wrote:
         | Oh what you don't understand is that LLMs also use embeddings
         | inside, it's how they represent tokens. It's just that you
         | don't get to see the embeddings, they are inner workings.
        
         | tcdent wrote:
         | Your mental model is correct.
         | 
         | They're listing applications of that by third parties to
         | demonstrate the use-case, but this is just a model for
         | generating those vectors.
        
         | rao-v wrote:
         | The directly into working memory bit is nonsense of course, but
         | it does point to a problem that is probably worth solving.
         | 
         | What would it take to make the KV cache more portable and
         | cut/paste vs. highly specific to the query?
         | 
         | In theory today, I should be able to process <long quote from
         | document> <specific query> and just stop after the long
         | document and save the KV cache right? The next time around, I
         | can just load it in, and continue from <new query>?
         | 
         | To keep going, you should be able to train the model to operate
         | so that you can have discontinous KV cache segments that are
         | unrelated, so you can drop in <cached KV from doc 1> <cached KV
         | from doc 2> with <query related to both> and have it just work
         | ... but I don't think you can do that today.
         | 
         | I seem remember seeing some papers that tried to "unRoPE" the
         | KV and then "re-RoPE" it, so it can be reused ... but I have
         | not seen the latest. Anybody know what the current state is?
         | 
         | Seems crazy to have to re-process the same context multiple
         | times just to ask it a new query.
        
       | mijoharas wrote:
       | What open embeddings models would people recommend. Still Nomic?
        
         | christina97 wrote:
         | The Qwen3 embedding models were released recently and do very
         | well on benchmarks.
        
       | dmezzetti wrote:
       | It's always worth checking out the MTEB leaderboard:
       | https://huggingface.co/spaces/mteb/leaderboard
       | 
       | There are some good open models there that have longer context
       | limits and fewer dimensions.
       | 
       | The benchmarks are just a guide. It's best to build a test
       | dataset with your own data. This is a good example of that:
       | https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...
       | 
       | Another benefit of having your own test dataset, is that it can
       | grow as your data grows. And you can quickly test new models to
       | see how it performs with YOUR data.
        
       | miohtama wrote:
       | > Everlaw, a platform providing verifiable RAG to help legal
       | professionals analyze large volumes of discovery documents,
       | requires precise semantic matching across millions of specialized
       | texts. Through internal benchmarks, Everlaw found gemini-
       | embedding-001 to be the best, achieving 87% accuracy in surfacing
       | relevant answers from 1.4 million documents filled with industry-
       | specific and complex legal terms, surpassing Voyage (84%) and
       | OpenAI (73%) models. Furthermore, Gemini Embedding's Matryoshka
       | property enables Everlaw to use compact representations, focusing
       | essential information in fewer dimensions. This leads to minimal
       | performance loss, reduced storage costs, and more efficient
       | retrieval and search.
       | 
       | This will make a lot of junior lawyers or their work obsolete.
       | 
       | Here is a good podcast on the topic how will AI affect legal
       | industry
       | 
       | https://open.spotify.com/episode/4IAHG68BeGZzr9uHXYvu5z?si=q...
        
         | dlojudice wrote:
         | It's really cool to see Odd Lots being mentioned here on HN.
         | It's one of my favorite podcasts. However, I think the guest
         | for this particular episode wasn't up to the task of answering
         | questions and exploring the possibilities of using AI in the
         | legal world.
        
       | jcims wrote:
       | I'm short on vocabulary here but it seems that using content
       | embedding similarity to find relevant (chunks of) content to feed
       | an LLM is orthogonal to the use of LLMs to take automatically
       | curated content chunks and use them to enrich a context.
       | 
       | Is that correct?
       | 
       | I'm just curious why this type of content selection seems to have
       | been popularized and in many ways become the defacto standard for
       | RAG, and (as far as I know but I haven't looked at 'search' in a
       | long time) not generally used for general purpose search?
        
       | djoldman wrote:
       | It may be worth pointing out that a few open weights models score
       | higher than gemini-embedding-001 on MTEB:
       | 
       | https://huggingface.co/spaces/mteb/leaderboard
       | 
       | Particularly Qwen3-Embedding-8B and Qwen3-Embedding-4B:
       | 
       | https://huggingface.co/Qwen/Qwen3-Embedding-8B
        
       ___________________________________________________________________
       (page generated 2025-07-31 23:00 UTC)