[HN Gopher] RAG is more than just embedding search
___________________________________________________________________
RAG is more than just embedding search
Author : jxnlco
Score : 104 points
Date : 2023-09-21 16:18 UTC (6 hours ago)
(HTM) web link (jxnl.github.io)
(TXT) w3m dump (jxnl.github.io)
| dsr_ wrote:
| So instead of asking google or wikipedia, you ask a natural
| language tokenizer to break your query into several possible
| queries, then feed that to an LLM in order to get an essay that
| might answer your question.
|
| Do I have that basically correct?
|
| edit, 43 minutes later: the first three responders say yes. So,
| it's a way of increasing the verbosity and reducing the
| reliability of responses to search queries. Yay! Who would not
| want such a thing?
|
| (me. And probably you.)
| fnordpiglet wrote:
| At its most basic perhaps. But the LLM has an enormous semantic
| corpus embedded in its model that augments the retrieved
| document. The retrieved document in a way cements the context
| better to help prevent wandering into hallucinations. So the
| LLM would indeed be able to summarize the retrieved document,
| but also synthesize it with other "knowledge" embedded in its
| model.
|
| But the more important thing is you can interrogate the LLM to
| ask it the specific questions you have based on what it has
| said and your goals. Contrast this to an information retrieval
| based methods where you read the article hoping your questions
| are answered, and when they aren't you are stuck digging
| through less and less relevant results or refining a search
| string hoping to find the right incantation that tweaks the
| index in the right way, sifting through documents that may
| contain the kernel of information somewhere if it wasn't SEO'ed
| out of existence. This is a really unnatural way of discovering
| information - the natural way, say with a teacher, is to be
| told background, ask questions, and iterate to understanding.
| This is how chat based LLMs work.
|
| However with RAG you can ground them more concretely, as their
| model is a massive mishmash of everything that may or may not
| embed the information sought, but it's also mixed in with
| everything else trained. You can bring in factual information
| into context that may not have even been trained. However the
| facts are a small aspect of knowledge - the overall semantics
| in the total corpus supports the facts in adjacent areas.
| throwaway4aday wrote:
| You could also introduce a classifier step that takes the
| result of the query and asks the LLM if the results truly are
| relevant or not before passing them on to the summarization
| step. You can even add more steps (with possibly diminishing
| returns) such as taking the more relevant results and
| crafting a new query that is a very condensed summary,
| embedding it and then finding more results that are
| semantically similar to it.
| fnordpiglet wrote:
| Yep. But the idea that a RAG backed LLM is merely an
| efficient summarizer is missing the real power, which is it
| can summarize then be interrogated iteratively to refine in
| a semantic sense the actual questions you have, or explore
| adjacent spaces. It's not just a search engine that can
| summarize, it's a search engine that you can interrogate in
| natural language and it responds directly to your
| questions, as opposed to throwing a bunch of documents at
| you that have a probability of being related to your query.
| simcop2387 wrote:
| Not quite, the advantage is that you can give it any documents
| you want to search even ones that aren't available to google or
| wikipedia. But I think otherwise that is essentially what is
| proposed here. The nice part is that since you know which
| documents got looked up when forumlating the answer you can
| also provide those as part of the output to the user so they
| can then go check the source data to confirm what was stated by
| the LLM.
| lukev wrote:
| Yes, a bit, though an important feature here is it's still
| searching the underlying data sources (e.g. Google, Wikipedia,
| or others) and then using a LLM to summarize the results.
|
| The "natural language tokenizer" itself is often an LLM (they
| do a pretty good job of this).
|
| A further extension this article doesn't talk about is to have
| a LLM with a different prompt analyze the answer before
| returning to the user, and do more queries if it doesn't
| believe the question has been well answered (imagine clicking
| "next page" of google search results under the hood).
|
| The potential complexity of this scales all the way up to a
| full "research assistant" LLM "agent" that calls itself
| recursively.
| pplonski86 wrote:
| Before learning about RAG I thought that it is recurrent LLM
| agent that traverse over documents. After some study I must
| say that VectorDBs are boring.
| throwaway4aday wrote:
| It can be as simple or as complicated as you want. The
| article starts off by saying the naive approach of just
| embedding the query and looking for similar documents is a
| bad approach and what you actually want to embed and
| compare is something similar to the expected result. They
| don't go into detail on this but using their example of
| "what is the capital of France" you would conceivably
| transform that into "list of European capital cities" or
| "list of cities in France" using an LLM, embed that, find
| the similar documents, feed those documents into an LLM
| along with the query and some system instructions about how
| to format the response and then return that. Keep in mind
| this is an absurdly simplified example query and none of
| this process is needed to answer the actual question which
| the LLM would know from its training data but you would
| want this process in place to ensure accurate results for
| more complex or specialized queries.
| lukev wrote:
| > none of this process is needed to answer the actual
| question which the LLM would know from its training data.
|
| I think this isn't true; even if the model has the answer
| stored implicitly in its weights, it has no way of
| "citing it's source" or demonstrating that the answer is
| correct.
| throwaway4aday wrote:
| if your model can't predict the completion of "the
| capital of France is _" then it's going to really suck
| for other completions
| lukev wrote:
| This is a great example of something GPT-4 gets
| confidently wrong, today. I just ran this query:
|
| Prompt: "The year is 894 AD. The capital of France is:
| Response: "In 894 AD, the capital of France was Paris."
|
| This is incorrect. According to Wikipedia, "In the 10th
| century Paris was a provincial cathedral city of little
| political or economic significance..."
|
| The problem is that there's no good way to tell from this
| interaction whether it's true or false, because the
| mechanism that GPT-4 uses to return an answer is the same
| whether it's correct or incorrect.
|
| Unless you already know the answer, the _only_ way to be
| confident that a LLM is answering correctly is to use RAG
| to find a citation.
| simonw wrote:
| "Query-Document Mismatch: This model assumes that query embedding
| and the content embedding are similar in the embedding space,
| which is not always true based on the text you're trying to
| search over."
|
| There are embeddings models that take this into account, which
| are pretty fascinating.
|
| I've been exploring https://huggingface.co/intfloat/e5-large-v2
| which lets you calculate two different types of embeddings in the
| same space. Example from their README: passage:
| As a general guideline, the CDC's average requirement of protein
| for women ages 19 to 70 is 46 grams per day query: how
| much protein should a female eat
|
| You can then build your embedding database out of "passage: "
| embeddings, then run "query: " embeddings against it to try and
| find passages that can answer the question.
|
| I've had pretty great initial results trying that out against
| paragraphs from my blog:
| https://til.simonwillison.net/llms/embed-paragraphs#user-con...
|
| This won't help address other challenges mentioned in that post,
| like "what problems did we fix last week?" - but it's still a
| useful starting point.
| bthomas wrote:
| Neat! Do you happen to have the analogous similarity queries
| with a default embedding? Curious to see them side by side.
|
| (I know I can reproduce myself and I appreciate all the code
| you posted there - thought I'd ask first!)
| simonw wrote:
| No, I haven't been disciplined enough to have good examples
| for that yet.
|
| One of my goals right now is to put together a solid RAG
| system based on top of LLM and Datasette that makes it really
| easy to compare different embedding models, chunking
| strategies and prompts to figure out what works best - but
| that's still just an idea in my head at the moment.
| [deleted]
| eshack94 wrote:
| https://archive.ph/lEynt
| binarymax wrote:
| I agree with the premise of the article, but I'm not sure about
| the proposed solution.
|
| Search relevance tuning is a thing. Learn how to use a search
| engine and combine multiple features into ranking signals with
| relevance judgement data.
|
| I recommend the books "Relevant Search" and "AI Powered Search"
| (the latter of which I'm a contributing author).
|
| You'll find that having a well tuned retriever is the backbone
| for most complex text AI. Learn the best practices from people
| who have been in the field for years, instead of trying to
| reinvent the wheel.
| majorbadass wrote:
| Agree with your sentiment, though the article explicitly
| mentions precision/recall, suggesting at least some level of
| tuning. Query understanding via structured attributes is SOTA
| and used at top companies. Rewriting the query as a method is
| weird, and yeah I'm not so convinced.
|
| One reoccuring problem - the hacker ethos doesn't scale with AI
| products. "Mess around until it works" is ok to prototype. This
| is effectively using the dev's intuition on the 10 examples
| they look at as the offline eval function.
|
| But many (most?) new-wave AI products don't have consistent
| offline metrics they optimize for. I think this quickly stops
| working when you've absorbed the obvious gains.
| natsucks wrote:
| Do you know of a good example demonstrating RAG with query
| understanding via structured attributes?
| ivalm wrote:
| A bit of a plug but
|
| https://auxhealth.io/try
|
| Does it's generations with RAG with a mix of structured
| attributes + semantic retrieval.
| sroussey wrote:
| I went to buy it, but apparently I already have an account, so
| I did a password reset, and then it wants my previous password
| to activate the account, and well, I can't buy it.
| binarymax wrote:
| Hi! Send me an email (it's in my profile) and maybe we can
| figure it out for you!
| ramoz wrote:
| it also seems costly to deploy such a robust search backend (eg
| Elastic cluster, vector db, reranking ensemble, LLM for complex
| parsing... these are not cheap technologies)
| [deleted]
| softwaredoug wrote:
| I actually wonder why people dump gobs of user input to the
| vector db, or try to tokenize it into something smart, instead
| of being smarter and asking for queries to be generated. Such
| as:
|
| --
|
| Given a Jira issue database, I want to give you additional
| context to answer a question about a project called FooBar. The
| Jira project id is FOOBAR. Please generate JQL that you would
| like to use to answer this question
|
| My question is: what are the major areas of technical debt in
| project FOOBAR?
|
| --
|
| Given a search engine for the wiki for project foobar, generate
| queries that help you answer this question:
|
| What's the current status of project foobar?
|
| ---
|
| Or somesuch...
|
| (and hi Max, thanks for plugging our book :-p )
| binarymax wrote:
| :waves: Hi Doug! (he's co-author of Relevant Search and
| contributing author of AI Powered Search too)
|
| That's definitely a thing. But alarms go off in my head when
| I think about query latency and cost. Can't imagine running
| 1k qps while sending every single one to GPT or LLama - thats
| the stuff of production nightmares for me!
|
| If you've got less demand and have a couple queries a second,
| then maybe it's OK - but you're probably adding a good second
| on top of your query latency.
| natsucks wrote:
| So in your opinion what are some examples of highly effective
| RAG systems/implementations?
| binarymax wrote:
| Any good search you used before all this LLM stuff started
| happening is a perfect candidate for RAG. How do you know if
| a search was good? If you weren't pulling your hair out and
| actually got decent results for your queries (search is a
| thankless job like that - everyone expects it to work and
| complains when it doesnt).
|
| The reason good search is best for RAG is because the prompt
| is seeded by the top results for the query. The only thing
| RAG does is summarize things for you and gives you answers
| instead of a list of documents.
|
| And now I gotta confess something, after making RAG systems
| for clients and having to use them with all the web search
| engines these days - I kinda miss the list of documents, and
| find myself just skipping the summary at the top half the
| time and going back to reading the 10 blue links.
| natsucks wrote:
| Interesting. Do you think that points to the current
| limitations of RAG or a mismatch in what a user truly wants
| from search?
| binarymax wrote:
| I think it works well when it's not a blob of text. One
| issue is that most of them are really long-winded. For
| example, if the answer can be nouns, just give me the
| list of nouns instead of a full sentence or paragraph.
|
| Take for example this search: https://search.brave.com/se
| arch?q=what+are+the+captain+ameri...
|
| Why the paragraph? Just give me a bulleted list! It's
| hard to read and kinda annoying.
|
| Another issue for me is trust. Web search is oft polluted
| with web spam (this is not new). Mentally, one can see a
| URL and skip a site that doesn't have strong authority.
| So now in RAG, I either need to trust the answer, or I
| need to look at the embedded citation and find the
| document and then see if it's trustworthy. This adds
| friction.
|
| This is also not unique to web search. Private search can
| also have poor relevance - do I know the LLM is being
| given the best context? Or is it getting bad context and
| hallucinating? I need to look at the results to be sure
| anyway.
|
| I think when used in appropriate ways it can be good. But
| the experience of "summarize these 10 results for me"
| might not be the best for every query.
| DebtDeflation wrote:
| You're referring to what in the NLP subfield of Question
| Answering Systems would be known as a "factoid question".
| Historically, things like knowledge graphs and RDF triple
| stores would be used for answering these types of
| questions. I'm still not sold on the idea that an LLM is
| the answer to all QA/Chat problems and this is one
| example.
| darkteflon wrote:
| To someone not familiar with the space, search seems like an
| incredibly complex and difficult space to get right. In your
| view, is it reasonable for the average developer prepared to
| read both of those books to expect to come out the other side
| and construct something ready for production? Thanks!
| cuuupid wrote:
| I think these have common solutions that don't require building
| more systems, running more queries and jacking up GPU bills,
| which is the direction we should be moving in at this point.
|
| e.g. asymmetric embeddings, instruct-based embeddings, and
| retrieval-rerank all address parts of the problems the author is
| presenting, all while keeping things generally light on infra.
| jxnlco wrote:
| > asymmetric embeddings, instruct-based embeddings, and
| retrieval-rerank
|
| how would you handle a relative time range? what if you're
| provided a search client that does not support embeddings
|
| (maybe say, google calendar api)
| darkteflon wrote:
| Could you elaborate on "asymmetric embeddings"? That's the
| first time I've heard that term used in this context.
| rolisz wrote:
| It's embeddings that are generated differently for queries
| and for documents. The idea of that queries are usually
| short, while documents are longer, so if you embed them in
| the same way, the most relevant docs will be far from the
| query. Instruct from HKU is an example of such asymmetric
| embeddings
| fudged71 wrote:
| I've heard the concept of applying a linear transformation
| to an embedding model output, is that a similar idea?
| rolisz wrote:
| That's a sort of adapting an embedding to a certain
| task/domain. But I wouldn't call it asymmetric
| embeddings.
| hexterPOP wrote:
| Interesting article, I did this thing using langchain, it has a
| Multi Query Retriever which does the same thing, check it out
| https://python.langchain.com/docs/modules/data_connection/re...
| natsucks wrote:
| Author has a good point, but no mention of hybrid search?
|
| ...not that hybrid search solves everything.
| jxnlco wrote:
| I'd just see that as modeling
|
| {query: str, keywords: List[str]}
| nobodyminus wrote:
| > Query-Document Mismatch: This model assumes that query
| embedding and the content embedding are similar in the embedding
| space, which is not always true based on the text you're trying
| to search over. Only using queries that are semantically similar
| to the content is a huge limitation!
|
| It seems like fine tuning for joint embeddings between your
| queries and content is a far more elegant way to solve this
| problem.
| tunesmith wrote:
| The pattern of "one request yields multiple kinds of responses"
| is challenging. You're basically looking at either having the
| client ask for the results and get them back, and then send the
| results to backend to get back the summary, OR, you're setting up
| some sort of sockets/server-sent-events thing where the frontend
| request establishes a connection and subscribes, while the
| backend sends back different sorts of "response events" as they
| become available.
| jxnlco wrote:
| why not just asyncio.gather?
___________________________________________________________________
(page generated 2023-09-21 23:01 UTC)