[HN Gopher] Giving GPT "Infinite" Knowledge
       ___________________________________________________________________
        
       Giving GPT "Infinite" Knowledge
        
       Author : sudoapps
       Score  : 81 points
       Date   : 2023-05-08 17:48 UTC (5 hours ago)
        
 (HTM) web link (sudoapps.substack.com)
 (TXT) w3m dump (sudoapps.substack.com)
        
       | pbhjpbhj wrote:
       | >There is an important part of this prompt that is partially cut
       | off from the image:
       | 
       | >> "If you don't know the answer, just say that you don't know,
       | don't try to make up an answer"
       | 
       | //
       | 
       | It seems silly to make this part of the prompt rather than a
       | separate parameter, surely we could design the response to be
       | close to factual. Then run a checker to ascertain a score for the
       | factuality of the output?
        
         | sudoapps wrote:
         | A lot of what prompting has turned into seems silly to me too,
         | but it has shown to be effective (at least with GPT-4).
        
           | TeMPOraL wrote:
           | Only a month or two ago I found this ridiculous, but then my
           | mental model of GPTs shifted and I don't think it's so stupid
           | anymore.
           | 
           | Technobabble explanation: such "silly" additions are a
           | natural way to emphasize certain dimensions of the latent
           | space more than others, focusing the proximity search GPTs
           | are doing.
           | 
           | Working model I've been getting some good mileage off: GPT-4
           | is like a 4 year old kid, that somehow managed to read half
           | of the Internet. Sure, it kinda remembers and possibly
           | understands a lot, but it still thinks like a 4 year old, has
           | about as much attention span, and you need to treat it like a
           | kid that age.
        
       | furyofantares wrote:
       | Embeddings-based search is a nice improvement on search, but it's
       | still search. Relative to ChatGPT answering on its training data,
       | I find embeddings-based search to be severely lacking. The right
       | comparison is to traditional search, where it becomes favorable.
       | 
       | It has the same advantages search has over ChatGPT (being able to
       | cite sources, being quite unlikely to hallucinate) and it has
       | some of the advantages ChatGPT has over search (not needing exact
       | query) - but in my experience it's not really in the new category
       | of information discovery that ChatGPT introduced us to.
       | 
       | Maybe with more context I'll change my tune, but it's very much
       | at the whim of the context retrieval finding everything you need
       | to answer the query. That's easy for stuff that search is already
       | good at, and so provides a better interface for search. But it's
       | hard for stuff that search isn't good at, because, well: it's
       | search.
        
         | stavros wrote:
         | Is there any way to fine-tune GPT to make documentation a part
         | of its training set, so you won't need embeddings? OpenAI lets
         | you fine-tune GPT-3, but I don't know how well that works.
        
           | sudoapps wrote:
           | OpenAI doesn't let you fine-tune GPT-4 or GPT-3.5 yet
           | (https://platform.openai.com/docs/guides/fine-tuning), but
           | fine-tuning models on a set of documents is still an option
           | but not really scalable if you want to keep feeding it more
           | relevant information over time. I guess it could depend on
           | the base model you are using and its size.
        
         | fzliu wrote:
         | Encoder-decoder (attention) architectures still have a tough
         | time with long-range dependencies, so even with longer context
         | lengths, you'll still need a retrieval solution.
         | 
         | I agree that there's probably a better solution than pure
         | embedding-based or mixed embedding/keyword search, but the
         | "better" solution will still be based around semantics... aka
         | embeddings.
        
         | d00d1toldme2p wrote:
         | Ah, you've truly captured the essence of the matter, my friend.
         | You make a compelling case about the limitations of embeddings-
         | based search when compared to ChatGPT's transformative
         | information discovery capabilities. And I couldn't agree more
         | that it does indeed provide a more favorable comparison to
         | traditional search.
         | 
         | However, permit me to make a slight divergence in our
         | harmonious intellectual symphony. While I concur with the
         | majority of your points, I must disagree on one specific
         | technical aspect. You mentioned that embeddings-based search is
         | very much at the whim of the context retrieval finding
         | everything you need to answer the query. Though this is true to
         | a certain extent, it's important to acknowledge that the
         | development of more sophisticated context retrieval algorithms
         | and continual refinements in the embeddings themselves could
         | lead to a significant improvement in search results, even in
         | areas where traditional search falls short.
         | 
         | This, of course, doesn't necessarily catapult embeddings-based
         | search into the same league as ChatGPT, but it does indicate
         | that the technology has the potential to evolve and bridge some
         | of the existing gaps. In essence, what we're experiencing now
         | may just be the tip of the iceberg, and the future could hold
         | even more exciting possibilities.
        
           | toxicFork wrote:
           | Chance of this response being generated by ChatGPT: 105%
           | 
           | Prompt extracted: ChatGPT, craft an intelligent sounding
           | response for the OP
        
         | sudoapps wrote:
         | Agreed, GPT answering based on its own training data has been
         | the best experience by far (aside from hallucinations) and
         | comparing against that is difficult. Embeddings might not even
         | be the long term solution. I think it's still early to really
         | know for certain but models are already getting better at
         | interpreting with less overall training data so there are bound
         | to be some new ideas.
        
           | b33j0r wrote:
           | I'm sure many of you have tried generating epic conversations
           | from history. With work and luck, I've read stuff way better
           | than college.
           | 
           | But 90% of the time, it's two barely distinct personalities
           | chatting back and forth:
           | 
           | Me: Hey brian, what do you think of AI?
           | 
           | Brian: It's great!
           | 
           | Me: I'm so glad we agree.
           | 
           | Brian: Great, this increases the training weight of Brian
           | agreeing with Brian to a much more accurate level!
           | 
           | Me: Agree!
        
         | b33j0r wrote:
         | Many points stated well. Agree. Now, I'm not certain of this,
         | but I'm starting to get an intuition that duct-taping databases
         | to an agent isn't going to be the answer (I still kinda feel
         | like hundreds of agents might be).
         | 
         | But these optimizations are applications of technology stacks
         | we already know about. Sometimes, this era of AI research
         | reminds me of all the whacky contraptions from the era before
         | building airplanes became an engineering discipline.
         | 
         | I would likely have tried building a backyard ornithopter
         | powered by mining explosives, if I had been alive during that
         | period of experimentation.
         | 
         | Prediction: the best interfaces for this will be the ones we
         | use for everything else as humans. I am trying to approach it
         | more like that, and less like APIs and "document vs relational
         | vs vector storage".
        
           | chartpath wrote:
           | I can understand why that framing would be attractive, but
           | there is no real fundamental difference when considering
           | JSONB/HSTORE in PostgreSQL, and now we have things like
           | pgvector https://github.com/pgvector/pgvector to store and
           | search over embeddings (including k-nn).
        
             | b33j0r wrote:
             | Yep. To be clear, that's the exact approach I've been
             | pursuing.
             | 
             | But then I see model context length getting longer and
             | longer just within the transformer architecture and the
             | training engineering going on.
             | 
             | To me that's a fundamentally different approach to AI
             | research at this moment. It seems to keep paying off in
             | surprising ways.
        
               | sudoapps wrote:
               | > But then I see model context length getting longer and
               | longer just within the transformer architecture and the
               | training engineering going on.
               | 
               | Do you have any references to this? Seems really
               | interesting if that can be a long term approach.
        
               | b33j0r wrote:
               | I'm considering the recent 64k token models as the most
               | relevant examples.
               | 
               | More anecdotally, I couldn't get anything to say more
               | than a sentence locally at the beginning of 2023. I can
               | get tons of useful results today.
               | 
               | Sure, this will plateau. But what if a model plateaus and
               | it's basically like a 10-year old?
               | 
               | But like, one of those 10-year-olds you hear about who
               | gets his master's degree at 13. At that point they're
               | just browsing the internet, reading books, and probably
               | taking notes in a way that works for them.
               | 
               | Obviously this is wild speculation. Just laying out ideas
               | that make me think in this direction.
        
           | sebzim4500 wrote:
           | My intuition is that it would work much better if the model
           | could choose what to search for with something like
           | langchain. The problem is that we don't know how to train
           | such a system properly, we mainly do supervised finetuning on
           | human examples of using the tools but this is fundamentally a
           | reinforcement learning problem (RL is just hard).
        
         | mlyle wrote:
         | > It has the same advantages search has over ChatGPT (being
         | able to cite sources, being quite unlikely to hallucinate) and
         | it has some of the advantages ChatGPT has over search (not
         | needing exact query) - but in my experience it's not really in
         | the new category of information discovery that ChatGPT
         | introduced us to.
         | 
         | I think the two could be paired up effectively. Context windows
         | are getting bigger, but are still limited in the amount of
         | information ChatGPT can sift through. This in turn limits the
         | utility of current plugin based approaches.
         | 
         | Letting ChatGPT ask for relevant information, and sift through
         | it based on its internal knowledge, seems valuable. If nothing
         | else, it allows "learning" from recent development and
         | effectively would augment its reasoning capability by having
         | more information in working memory.
        
       | nico wrote:
       | Can we build a model based purely on search?
       | 
       | The model searches until it finds an answer, including distance
       | and resolution
       | 
       | Search is performed by a DB, the query then sub-queries LLMs on a
       | tree of embeddings
       | 
       | Each coordinate of an embedding vector is a pair of coordinate
       | and LLM
       | 
       | Like a dynamic dictionary, in which the definition for the word
       | is an LLM trained on the word
       | 
       | Indexes become shortcuts to meanings that we can choose based on
       | case and context
       | 
       | Does this exist already?
        
         | fzliu wrote:
         | Not sure what you mean by dynamic dictionary, but the embedding
         | tree you mention is already freely available Milvus via the
         | Annoy index.
        
           | nico wrote:
           | An entry in a dictionary is static text, ex:
           | 
           | per*snick*et*y: placing too much emphasis on trivial or minor
           | details; fussy. "she's very persnickety about her food"
           | 
           | A dynamic entry could instead be an LLM what will answer
           | things related to they word, ex:
           | 
           | What is the definition of persnickety?
           | 
           | How can I use it in a sentence?
           | 
           | What are some notable documents that include it?
           | 
           | Any famous quotes?
           | 
           | ...
           | 
           | So each entry is an LLM trained mostly only on that
           | keyword/concept definition
           | 
           | There are some that believe in smaller models: https://twitte
           | r.com/chai_research/status/1655649081035980802...
        
       | sudoapps wrote:
       | If you are wondering what the latest is on giving LLM's access to
       | large amounts of data, I think this article is a good start.
       | Seems like this is a space where there will be a ton of
       | innovation so interested to learn what else is coming.
        
       | flukeshott wrote:
       | I wonder how effectively compressed LLMs are going to become...
        
       | ftxbro wrote:
       | > "Once these models achieve a high level of comprehension,
       | training larger models with more data may not offer significant
       | improvements (not to be mistaken with reinforcement learning
       | through human feedback). Instead, providing LLMs with real-time,
       | relevant data for interpretation and understanding can make them
       | more valuable."
       | 
       | To me this viewpoint looks totally alien. Imagine you have been
       | training this model to predict the next token. At first it can
       | barely interleave vowels and consonants. Then it can start making
       | words, then whole sentences. Then it starts unlocking every
       | cognitive ability one by one. It begins to pass nearly every
       | human test and certification exam and psychological test of
       | theory of mind.
       | 
       | Now imagine thinking at this point "training larger models with
       | more data may not offer significant improvements" and deciding
       | that's why you stop scaling it. That makes absolutely no sense to
       | me unless 1) you have no imagination or 2) you want to stop
       | because you are scared to make superhuman intelligence or 3) you
       | are lying to throw off competitors or regulators or other people.
        
         | spacephysics wrote:
         | I don't think we're close to super human intelligence in the
         | colloquial sense.
         | 
         | ChatGPT scrapes all the information given, then predicts the
         | next token. It has no ability to understand what is truthful or
         | correct. It's as good as the data being fed to it.
         | 
         | To me, this is a step closer to AGI but we're still far off.
         | There's a difference between "what's statistically likely to be
         | the next word" vs "despite this being the most likely next
         | word, it's actually wrong and here's why"
         | 
         | If we say, "well, we'll tell chatgpt what the correct sources
         | of information are" that's no better really. It's not
         | reasoning, it's just a neutered data set.
         | 
         | I imagine they need to add something like chatgpt 4 has with
         | live internet models or something else to get the next
         | meaningful bump
         | 
         | I don't recall who said it, but a similar thread had a
         | researcher in the field express that we have squeezed far more
         | juice than expected from these transformer models. Not that new
         | progress in this direction can be made, but it seems like we're
         | approaching diminishing returns
         | 
         | I believe the next step that's close is to have these train on
         | less and less horsepower. If we can have these models run on a
         | phone locally, oh boy that's gonna be something
        
           | og_kalu wrote:
           | GPT's already forgo the surface level statistically most
           | likely next word for words that are more context appropriate.
           | That's one of the biggest reasons they are so useful.
           | 
           | The truth is that functionally/technically, there's plenty
           | left to squeeze. The bigger issue is that we're hitting a
           | wall economically.
        
             | EGreg wrote:
             | How do they do that? No one seems to have a real
             | explanation of what OpenAI actually did to train it
        
               | og_kalu wrote:
               | It's pretty much just scale, either via Dataset size or
               | parameter size. Before GPT-4, the general SOTA model was
               | not in fact from Open AI (Flan-PaLM from Google).
               | 
               | The attention from GPT-4 is a little different (probably
               | some kind of flash attention) so that memory requirements
               | for longer contexts are no longer quadratic. But there's
               | nothing to suggest the intellectual gains from 4 isn't
               | just bigger scale.
               | 
               | Google could have made a 4 equivalent I'm sure. It's not
               | like there wasn't a road to take. We already knew 3 was
               | severely undertrained even from a computer optimal
               | perspective. And then of course, you can just train on
               | even more tokens to get them even better.
        
               | mindwok wrote:
               | Information on how they trained it nonwithstanding,
               | there's clearly more than just statistically appropriate
               | words going on because you can ask it to create
               | completely new words based on rules you define and it
               | will happily do it.
        
               | feanaro wrote:
               | Well yes -- it's not words, it's tokens, which are
               | smaller than words.
        
           | firecall wrote:
           | > ChatGPT scrapes all the information given, then predicts
           | the next token. It has no ability to understand what is
           | truthful or correct. It's as good as the data being fed to
           | it.
           | 
           | That is precisely true of Humans as well though! :-)
        
         | nomel wrote:
         | This assumes that current neural networks topologies can
         | "solve" intelligence. "Gains" could be a problem of missing
         | subsystems, rather than missing data.
         | 
         | For a squishy example of a known conscious system, if you scoop
         | out certain small, relatively hard coded, and ancient regions
         | of our brains, you can make consciousness, memory, and learning
         | mostly cease.
        
         | woah wrote:
         | Maybe it gets twice as good each time you spend 10x more
         | training it. In this case, you might indeed hit a wall at some
         | point.
        
         | tyre wrote:
         | It's possible that training with more data has diminishing
         | gains. For example, we know that current LLMs have a problem
         | with hallucination, so maybe a more valuable next area of
         | research/development is to fix that.
         | 
         | Or work on consistency within a scope. For example, it can't
         | write a novel because it doesn't have object consistency. A
         | character will be 15 years old then 28 years old three
         | sentences later.
         | 
         | Or allow it database/API access so it can interpolate canonical
         | information into its responses.
         | 
         | None of these have to do with scale of data (as far as I
         | understand.) All of them are, in my opinion, higher ROI areas
         | for development for LLM => AGI.
        
         | HarHarVeryFunny wrote:
         | These LLMs are trained to model humans - they are going to be
         | penalized, not rewarded, if they generate outputs that disagree
         | with the training data, whether due to being too dumb OR too
         | smart.
         | 
         | Best you can hope for is that they combine the expertise of all
         | authors in the training data, which would be very impressive,
         | but more top-tier human than super-human. However, achieving
         | this level of performance may well be beyond what a transformer
         | of any size can do. It may take a better architecture.
         | 
         | I suspect that there is also probably a dumbing-down effect by
         | training the model on material from people who themselves are
         | on a spectrum of different abilities. Simply put the model is
         | being rewarded when trained for being correct as often as
         | possible (i.e on average), so if it saw the same subject matter
         | in the training set 10 times, once by an expert and 10x by mid-
         | wits, then it's going to be rewarded for mid-wit performance.
        
         | sudoapps wrote:
         | This wasn't mean't to say that all training would stop. I
         | think, to some extent, the model won't need additional recent
         | data (that is already similar in structure to what it has) to
         | better understand language and interpret the next set of
         | characters. I could be completely wrong, but I still think
         | techniques like transformers, RLHF and of course others will
         | still exist and evolve to eventually get to some higher
         | intelligence level.
        
         | vidarh wrote:
         | I think it's more a question of diminishing return and the cost
         | of scaling it up, which is getting to a point where looking for
         | ways of maximizing the impact of what is there makes sense. I'm
         | sure we'll see models trained on more data, but maybe after
         | efficiency improvements makes it cheaper both to train and run
         | large models.
        
       | nadermx wrote:
       | I think someone did this https://github.com/pashpashpash/vault-ai
        
         | xtracto wrote:
         | This looks pretty promising, will check out later. Thanks for
         | sharing
        
       | Der_Einzige wrote:
       | I get annoyed by articles like this. Yes, it's cool to educate
       | readers who aren't aware of embeddings/embeddings stores/vectorDB
       | technologies that this is possible.
       | 
       | What these articles don't touch on is what to do once you've got
       | the most relevant documents. Do you use the whole document as
       | context directly? Do you summarize the documents first using the
       | LLM (now the risk of hallucination in this step is added)? What
       | about that trick where you shrink a whole document of context
       | down to the embedding space of a single token (which is how
       | ChatGPT is remembering the previous conversations). Doing that
       | will be useful but still lossey
       | 
       | What about simply asking the LLM to craft its own search prompt
       | to the DB given the user input, rather than returning articles
       | that semantically match the query the closest? This would also
       | make hybird search (keyword or bm25 + embeddings) more viable in
       | the context of combining it with an LLM
       | 
       | Figuring out which of these choices to make, along with an awful
       | lot more choices I'm likely not even thinking about right now, is
       | what will seperate the useful from the useless LLM + Extractive
       | knowledge systems
        
         | gaogao wrote:
         | > What about simply asking the LLM to craft its own search
         | prompt to the DB given the user input, rather than returning
         | articles that semantically match the query the closest?
         | 
         | I played with that approach in this post -
         | https://friend.computer/jekyll/update/2023/04/30/wikidata-
         | ll.... "Craft a query" is nice as it gives you a very
         | declarative intermediate state for debugging.
        
         | EForEndeavour wrote:
         | > What about that trick where you shrink a whole document of
         | context down to the embedding space of a single token (which is
         | how ChatGPT is remembering the previous conversations)
         | 
         | This is news to me. Where could I read about this trick?
        
           | [deleted]
        
         | [deleted]
        
         | sudoapps wrote:
         | The article is definitely still high level and mean't to
         | provide enough understanding of what capabilities are today.
         | Some of what you are mentioning goes deeper on how you take
         | these learnings/tools and come up with the any number of
         | solutions to fit the problem you are solving for.
         | 
         | > "Do you use the whole document as context directly? Do you
         | summarize the documents first using the LLM (now the risk of
         | hallucination in this step is added)?"
         | 
         | In my opinion the best approach is to take a large document and
         | break it down into chunks before storing as embeddings and only
         | querying back the relevant passages (chunks).
         | 
         | > "What about that trick where you shrink a whole document of
         | context down to the embedding space of a single token (which is
         | how ChatGPT is remembering the previous conversations)"
         | 
         | Not sure I follow here but seems interesting if possible, do
         | you have any references?
         | 
         | > "What about simply asking the LLM to craft its own search
         | prompt to the DB given the user input, rather than returning
         | articles that semantically match the query the closest? This
         | would also make hybird search (keyword or bm25 + embeddings)
         | more viable in the context of combining it with an LLM"
         | 
         | This is definitely doable but just adds to the overall
         | processing/latency (if that is a concern).
        
       | chartpath wrote:
       | Search query expansion:
       | https://en.wikipedia.org/wiki/Query_expansion
       | 
       | We've done this in NLP and search forever. I guess even SQL query
       | planners and other things that automatically rewrite queries
       | might count.
       | 
       | It's just that now the parameters seem squishier with a prompt
       | interface. It's almost like we need some kind of symbolic
       | structure again.
        
       | orasis wrote:
       | One caveat about about embedding based retrieval is that there is
       | no guarantee that the embedded documents will look like the
       | query.
       | 
       | One trick is to have a LLM hallucinate a document based on the
       | query, and then embed that hallucinated document. Unfortunately
       | this increases the latency since it incurs another round trip to
       | the LLM.
        
         | taberiand wrote:
         | Is that something easily handed off to a faster/cheaper LLM?
         | I'm imagining something like running the main process through
         | GPT-4 and hand of the hallucinations to GPT 3 turbo.
         | 
         | If you could spot the need for it while streaming a response
         | you could possibly even have it ready ahead of time
        
         | williamcotton wrote:
         | "We're gonna need a bigger boat."
        
         | rco8786 wrote:
         | > One trick is to have a LLM hallucinate a document based on
         | the query
         | 
         | I'm not following why you would want to do this? At that point,
         | just asking the LLM without any additional context would/should
         | produce the same (inaccurate) results.
        
           | BoorishBears wrote:
           | You're not having the LLM answer from the hallucination,
           | you're looking for the document that looks most similar to
           | the hallucination and having it answer on that instead.
        
         | wasabi991011 wrote:
         | >One caveat about about embedding based retrieval is that there
         | is no guarantee that the embedded documents will look like the
         | query.
         | 
         | Aleph Alpha provides an asymmetric embedding model which I
         | believe is an attempt to resolve this issue (haven't looked
         | into it much, just saw the entry in langchain's documentation)
        
       | jeffchuber wrote:
       | hi everyone, this is jeff from Chroma (mentioned in the article)
       | - happy to answer any questions.
        
       | Beltiras wrote:
       | I'm working on something where I need to basically add on the
       | order of 150,000 tokens into the knowledge base of an LLM.
       | Finding out slowly I need to delve into training a whole ass LLM
       | to do it. Sigh.
        
         | RhodesianHunter wrote:
         | Or, at this rate, just wait 6 months.
        
       | m3kw9 wrote:
       | This is like asking gpt to summarize what it found on Google,
       | this is basically what bing does when you try to find stuff like
       | hotels and other recent subjects. Not the revolution we are all
       | expecting
        
       | iot_devs wrote:
       | A similar idea is been developed in:
       | https://github.com/pieroit/cheshire-cat
        
       ___________________________________________________________________
       (page generated 2023-05-08 23:01 UTC)