[HN Gopher] Gemini Embedding: Powering RAG and context engineering
___________________________________________________________________
Gemini Embedding: Powering RAG and context engineering
Author : simonpure
Score : 138 points
Date : 2025-07-31 16:47 UTC (6 hours ago)
(HTM) web link (developers.googleblog.com)
(TXT) w3m dump (developers.googleblog.com)
| bryan0 wrote:
| The Matryoshka embeddings seem interesting:
|
| > The Gemini embedding model, gemini-embedding-001, is trained
| using the Matryoshka Representation Learning (MRL) technique
| which teaches a model to learn high-dimensional embeddings that
| have initial segments (or prefixes) which are also useful,
| simpler versions of the same data. Use the output_dimensionality
| parameter to control the size of the output embedding vector.
| Selecting a smaller output dimensionality can save storage space
| and increase computational efficiency for downstream
| applications, while sacrificing little in terms of quality. By
| default, it outputs a 3072-dimensional embedding, but you can
| truncate it to a smaller size without losing quality to save
| storage space. We recommend using 768, 1536, or 3072 output
| dimensions. [0]
|
| looks like even the 256-dim embeddings perform really well.
|
| [0]: https://ai.google.dev/gemini-api/docs/embeddings#quality-
| for...
| simonw wrote:
| The Matryoshka trick is really neat - there's a good
| explanation here: https://huggingface.co/blog/matryoshka
|
| I've seen it in a few models now - Nomic Embed 1.5 was the
| first https://www.nomic.ai/blog/posts/nomic-embed-matryoshka
| alach11 wrote:
| OpenAI did it a few weeks earlier when they released text-
| embedding-3-large, right?
| simonw wrote:
| Huh, yeah you're right: that was January 25th 2024
| https://openai.com/index/new-embedding-models-and-api-
| update...
|
| Nomic 1.5 was February 14th 2024:
| https://www.nomic.ai/blog/posts/nomic-embed-matryoshka
| ACCount36 wrote:
| Google teams seem to be in love with that Matryoshka tech. I
| wonder how far that scales.
| OutOfHere wrote:
| It's a practical feature. Scaling is irrelevant in this
| context because it scales to the length of the embedding,
| although in batches of k-length embeddings.
| thefourthchime wrote:
| It's interesting, but the improvement they're claiming isn't
| that groundbreaking.
| OutOfHere wrote:
| Does OpenAI's text-embedding-3-large or text-embedding-3-small
| embedding model have the Matryoshka property?
| minimaxir wrote:
| They do, they just don't advertise it well (and only
| confirmed it with a footnote after criticism of its
| omission): https://openai.com/index/new-embedding-models-and-
| api-update...
|
| > Both of our new embedding models were trained with a
| technique that allows developers to trade-off performance and
| cost of using embeddings. Specifically, developers can
| shorten embeddings (i.e. remove some numbers from the end of
| the sequence) without the embedding losing its concept-
| representing properties by passing in the dimensions API
| parameter. For example, on the MTEB benchmark, a text-
| embedding-3-large embedding can be shortened to a size of 256
| while still outperforming an unshortened text-embedding-
| ada-002 embedding with a size of 1536.
| mvieira38 wrote:
| To anyone working in these types of applications, are embeddings
| still worth it compared to agentic search for text? If I have a
| directory of text files, for example, is it better to save all of
| their embeddings in a VDB and use that, or are LLMs now good
| enough that I can just let them use ripgrep or something to
| search for themselves?
| philip1209 wrote:
| Semantic search is still important. I'd say that regex search
| is also quickly rising in importance, especially for coding
| agents.
| pjm331 wrote:
| With the caveat that I have not spent a serious amount of time
| trying to get RAG to work - my brief attempt to use it via AWS
| knowledge base to compare it vs agentic search resulted in me
| sticking with agentic search (via Claude code SDK)
|
| My impression was there's lots of knobs you can tune with RAG
| and it's just more complex in general - so maybe there's a
| point where the amount of text I have is large enough that that
| complexity pays off - but right now agentic search works very
| well and is significantly simpler to get started with
| simonw wrote:
| If your LLM is good enough you'll likely get better results
| from tool calling with grep or a FTS engine - the better models
| can even adapt their search patterns to search for things like
| "dog OR canine" where previously vector similarity may have
| been a bigger win.
|
| Getting embeddings working takes a bunch of work: you need to
| decide on a chunking strategy, then run the embeddings, then
| decide how best to store them for fast retrieval. You often end
| up having to keep your embedding store in memory which can add
| up for larger volumes of data.
|
| I did a whole lot of work with embeddings last year but I've
| mostly lost interest now that tool-based-search has become so
| powerful.
|
| Hooking up tool-based-search that itself uses embeddings is
| worth exploring, but you may find that the results you get from
| ripgrep are good enough that it's not worth the considerable
| extra effort.
| whinvik wrote:
| Curious but how do we take care of non text files. What if we
| had a lot of PDF files?
| minimaxir wrote:
| You can extract text from PDF files. (there's a number of
| dedicated models for that, but even the humble pandoc can do
| it)
| sergiotapia wrote:
| Use pymupdf to extract the PDF text. Hell, run that nasty
| business through an LLM as step-2 to get a beautiful clean
| markdown version of the text. Lord knows the PDF format is
| horribly complex!
| morkalork wrote:
| Question to other GCP users, how are you finding Google's
| aggressive deprecation of older embedding models? Feels like you
| have to pay to rerun your data through every 12 months.
| adregan wrote:
| This is precisely the risk I've been wondering about with
| vectorization. I've considered that an open source model might
| be valuable as you could always find someone to host it for you
| and control the deprecation rate yourself.
| throwaway-blaze wrote:
| You know of lots of LLM-using apps that don't need to re-run
| their fine tunings or embeddings because of improvements or new
| features at least annually? Things are moving so fast that
| "every 12 months" seems kinda slow...
| BoorishBears wrote:
| My costs for embedding are so small compared to inference I
| don't generally notice.
|
| But am I crazy or did the pre-production version of gemini-
| embedding-001 have a much larger max context length?
|
| Edit: It seems like it did? 8k -> 2k? Huge downgrade if true, I
| was really excited about the experimental model reaching GA
| before that
| asdev wrote:
| I feel like tool calling killed RAG, however you have less
| control over how the retrieved data is injected in the context.
| OutOfHere wrote:
| How would you use tool-calling to filter through millions of
| documents? You need some search functionality, whether old-
| school search or embedding search. If you have only thousands
| of documents, then sure, you don't need search, as you can feed
| them all to the LLM.
| kfajdsl wrote:
| You give the LLM search tools.
| OutOfHere wrote:
| That's missing the point. You are hiding the search behind
| the tool, but it's still search. Whether you use a tool or
| a hardcoded workflow is irrelevant.
| kridsdale1 wrote:
| I haven't built either system but it seems clear that tool
| calling will be 'O(num_targets * O(search tool))', while RAG
| will be 'O(embed_query * num_targets)'.
|
| RAG looks linear (constant per lookup) while tools look
| polynomial. And tools will possibly fill up the limited LLM
| context too.
| billmalarky wrote:
| Search tool calling is RAG. Maybe we should call it a "RAG
| Agent" to be more en vogue heh. But RAG is not just similarity
| search on embeddings in vector DBs. RAG is any type of a
| retrieval + context injection step prior to inference.
|
| Heck, the RAG Agent could run cosign diff on your vector db in
| addition to grep, FTS queries, KB api calls, whatever, to do
| wide recall (candidate generation) then rerank (relevance
| prioritization) all the results.
|
| You are probably correct that for most use cases search tool
| calling makes more practical sense than embeddings similarity
| search to power RAG.
| visarga wrote:
| > could run cosign diff on your vector db
|
| or maybe even "cosine similarity"
| stillpointlab wrote:
| > Embeddings are crucial here, as they efficiently identify and
| integrate vital information--like documents, conversation
| history, and tool definitions--directly into a model's working
| memory.
|
| I feel like I'm falling behind here, but can someone explain this
| to me?
|
| My high-level view of embedding is that I send some text to the
| provider, they tokenize the text and then run it through some NN
| that spits out a vector of numbers of a particular size (looks to
| be variable in this case including 768, 1536 and 3072). I can
| then use those embeddings in places like a vector DB where I
| might want to do some kind of similarity search (e.g. cosine
| difference). I can also use them to do clustering on that
| similarity which can give me some classification capabilities.
|
| But how does this translate to these things being "directly into
| a model's working memory'? My understanding is that with RAG I
| just throw a bunch of the embeddings into a vector DB as keys but
| the ultimate text I send in the context to the LLM is the source
| text that the keys represent. I don't actually send the
| embeddings themselves to the LLM.
|
| So what is is marketing stuff about "directly into a model's
| working memory."? Is my mental view wrong?
| NicholasD43 wrote:
| You're right on this. "Advanced" RAG techniques are all
| complete marketing BS, in the end all you're doing it passing
| the text into the model's context window.
| letitgo12345 wrote:
| LLMs can use search engines as a tool. One possibility is
| Google embeds the search query through these embeddings and
| does retrieval using them and then the retrieved result is
| pasted into the model's chain of thought (which..unless they
| have an external memory module in their model, is basically the
| model's only working memory).
| stillpointlab wrote:
| I'm reading the docs and it does not appear Google keeps
| these embeddings at all. I send some text to them, they
| return the embedding for that text at the size I specified.
|
| So the flow is something like:
|
| 1. Have a text doc (or library of docs)
|
| 2. Chunk it into small pieces
|
| 3. Send each chunk to <provider> and get an embedding vector
| of some size back
|
| 4. Use the embedding to:
|
| 4a. Semantic search / RAG: put the embeddings in a vector DB
| and do some similarity search on the embedding. The ultimate
| output is the source _chunk_
|
| 4b. Run a cluster algorithm on the embedding to generate some
| kind of graph representation of my data
|
| 4c. Run a classifier algorithm on the embedding to allow me
| to classify new data
|
| 5. The output of all steps in 4 is crucially text
|
| 6. Send that text to an LLM
|
| At no point is the embedding directly in the models memory.
| Voloskaya wrote:
| > So what is is marketing stuff about "directly into a model's
| working memory."? Is my mental view wrong?
|
| Context is sometimes called working memory. But no your
| understanding is right: find the right document through cosine
| similarity (and thus through embeddings), then add the content
| of those docs to the context
| greymalik wrote:
| One of the things I find confusing about this article is that
| the author positions RAG as being unrelated to both context
| engineering and vector search.
| yazaddaruvala wrote:
| At least in theory. If the model is the same, the embeddings
| can be reused by the model rather than recomputing them.
|
| I believe this is what they mean.
|
| In practice, how fast will the model change (including
| tokenizer)? how fast will the vector db be fully backfilled to
| match the model version?
|
| That would be the "cache hit rate" of sorts and how much it
| helps likely depends on some of those variables for your
| specific corpus and query volumes.
| stillpointlab wrote:
| > the embeddings can be reused by the model
|
| I can't find any evidence that this is possible with Gemini
| or any other LLM provider.
| yazaddaruvala wrote:
| Yeah given what your saying is true and continues to be,
|
| Seems the embeddings would just be useful for a "nice
| corpus search" mechanism for some regular RAG.
| fine_tune wrote:
| RAG is taking a bunch of docs, chunking them it to text blocks
| of a certain length (how best todo this up for debate),
| creating a search API that takes query (like a google search)
| and compares it to the document chunks (very much how your
| describing). Take the returned chunks, ignore the score from
| vector search, feed those chunks into a re-ranker with the
| original query (this step is important vector search mostly
| sucks), filter those re-ranked for the top 1/2 results and then
| format a prompt like;
|
| The user ask 'long query', we fetched some docs (see below),
| answer the query based on the docs (reference the docs if u
| feel like it)
|
| Doc1.pdf - Chunk N Eat cheese
|
| Doc2.pdf- Chunk Y Dont eat cheese
|
| You then expose the search API as a "tool" for the LLM to call,
| slightly reformatting the prompt above into a multi turn convo,
| and suddenly you're in ze money.
|
| But once your users are happy with those results they'll want
| something dumb like the latest football scores, then you need a
| web tool - and then it never ends.
|
| To be fair though, its pretty powerful once you've got in
| place.
| criddell wrote:
| Is RAG how I would process my 20+ year old bug list for a
| piece of software I work on?
|
| I've been thinking about this because it would be nice to
| have a fuzzier search.
| fine_tune wrote:
| Yes and no, for human search - its kinda neat, you might
| find some duplicates, or some nearby neighbour bugs that
| help you solve a whole class of issues.
|
| But the cool kids? They'd do something worse;
|
| They'd define some complicated agentic setup that cloned
| your code base into containers firewalled off from the
| world, give prompts like;
|
| Your expert software dev in MY_FAVE_LANG, here's a bug
| description 'LONG BUG DESCRIPTION' explore the code and
| write a solution. Here's some tools (read_file, write_file,
| ETC)
|
| You'd then spawn as many of these as you can, per task, and
| have them all generate pull requests for the tasks. Review
| them with an LLM, then manually and accept PR's you wanted.
| Now your in the ultra money.
|
| You'd use RAG to guide an untuned LLM on your code base for
| styles and how to write code. You'd write docs like "how to
| write an API, how to write a DB migration, ETC" and give
| that as tool to the agents writing the code.
|
| With time and effort, you can write agents to be specific
| to your code base through fine tuning, but who's got that
| kind of money?
| CartwheelLinux wrote:
| You'd be surprised how many people are actually doing
| this exact kind of solutioning.
|
| It's also not that costly to do if you think about the
| problem correctly
|
| If you continue down the brute forcing route you can do
| mischievous things like sign up for thousands and
| thousands of free accounts across numerous network
| connections to LLM APIs and plug away
| visarga wrote:
| Oh what you don't understand is that LLMs also use embeddings
| inside, it's how they represent tokens. It's just that you
| don't get to see the embeddings, they are inner workings.
| tcdent wrote:
| Your mental model is correct.
|
| They're listing applications of that by third parties to
| demonstrate the use-case, but this is just a model for
| generating those vectors.
| rao-v wrote:
| The directly into working memory bit is nonsense of course, but
| it does point to a problem that is probably worth solving.
|
| What would it take to make the KV cache more portable and
| cut/paste vs. highly specific to the query?
|
| In theory today, I should be able to process <long quote from
| document> <specific query> and just stop after the long
| document and save the KV cache right? The next time around, I
| can just load it in, and continue from <new query>?
|
| To keep going, you should be able to train the model to operate
| so that you can have discontinous KV cache segments that are
| unrelated, so you can drop in <cached KV from doc 1> <cached KV
| from doc 2> with <query related to both> and have it just work
| ... but I don't think you can do that today.
|
| I seem remember seeing some papers that tried to "unRoPE" the
| KV and then "re-RoPE" it, so it can be reused ... but I have
| not seen the latest. Anybody know what the current state is?
|
| Seems crazy to have to re-process the same context multiple
| times just to ask it a new query.
| mijoharas wrote:
| What open embeddings models would people recommend. Still Nomic?
| christina97 wrote:
| The Qwen3 embedding models were released recently and do very
| well on benchmarks.
| dmezzetti wrote:
| It's always worth checking out the MTEB leaderboard:
| https://huggingface.co/spaces/mteb/leaderboard
|
| There are some good open models there that have longer context
| limits and fewer dimensions.
|
| The benchmarks are just a guide. It's best to build a test
| dataset with your own data. This is a good example of that:
| https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...
|
| Another benefit of having your own test dataset, is that it can
| grow as your data grows. And you can quickly test new models to
| see how it performs with YOUR data.
| miohtama wrote:
| > Everlaw, a platform providing verifiable RAG to help legal
| professionals analyze large volumes of discovery documents,
| requires precise semantic matching across millions of specialized
| texts. Through internal benchmarks, Everlaw found gemini-
| embedding-001 to be the best, achieving 87% accuracy in surfacing
| relevant answers from 1.4 million documents filled with industry-
| specific and complex legal terms, surpassing Voyage (84%) and
| OpenAI (73%) models. Furthermore, Gemini Embedding's Matryoshka
| property enables Everlaw to use compact representations, focusing
| essential information in fewer dimensions. This leads to minimal
| performance loss, reduced storage costs, and more efficient
| retrieval and search.
|
| This will make a lot of junior lawyers or their work obsolete.
|
| Here is a good podcast on the topic how will AI affect legal
| industry
|
| https://open.spotify.com/episode/4IAHG68BeGZzr9uHXYvu5z?si=q...
| dlojudice wrote:
| It's really cool to see Odd Lots being mentioned here on HN.
| It's one of my favorite podcasts. However, I think the guest
| for this particular episode wasn't up to the task of answering
| questions and exploring the possibilities of using AI in the
| legal world.
| jcims wrote:
| I'm short on vocabulary here but it seems that using content
| embedding similarity to find relevant (chunks of) content to feed
| an LLM is orthogonal to the use of LLMs to take automatically
| curated content chunks and use them to enrich a context.
|
| Is that correct?
|
| I'm just curious why this type of content selection seems to have
| been popularized and in many ways become the defacto standard for
| RAG, and (as far as I know but I haven't looked at 'search' in a
| long time) not generally used for general purpose search?
| djoldman wrote:
| It may be worth pointing out that a few open weights models score
| higher than gemini-embedding-001 on MTEB:
|
| https://huggingface.co/spaces/mteb/leaderboard
|
| Particularly Qwen3-Embedding-8B and Qwen3-Embedding-4B:
|
| https://huggingface.co/Qwen/Qwen3-Embedding-8B
___________________________________________________________________
(page generated 2025-07-31 23:00 UTC)