[HN Gopher] Collapse of self-trained language models
       ___________________________________________________________________
        
       Collapse of self-trained language models
        
       Author : PaulHoule
       Score  : 65 points
       Date   : 2024-04-17 18:05 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | esafak wrote:
       | By the inventor of word2vec. For those of us old enough to
       | remember :)
        
         | williamtrask wrote:
         | I miss the word2vec days...a simpler time.
        
           | cangeroo wrote:
           | What's different?
           | 
           | I imagine that it's increasingly hard to contribute or even
           | participate in the ongoing research, which is dominated by
           | major companies?
           | 
           | I work in a different field, but the way of working now is so
           | different from when I started. It feels like factory work /
           | hamsterwheeling. It used to feel more like the work of an
           | architect, a mixture of art and engineering.
        
             | xanderlewis wrote:
             | Probably the fact that even ten years ago one could hope to
             | do something impressive (and maybe even state of the art)
             | with a custom-built model running on a laptop... these days
             | anything you might try is likely to be outdone by a more
             | general model.
             | 
             | Like the fact that all sorts of fun but elementary language
             | processing tasks that previously required some clever
             | engineering or at least careful data collection and a
             | custom-built model can just be done (and done much better)
             | by something like GPT-x. There's far less incentive to play
             | around with basic ML stuff as an amateur now.
        
       | changoplatanero wrote:
       | > We explore the potential of self-training models on their own
       | outputs, akin to how humans learn and build on their previous
       | thoughts and actions.
       | 
       | I feel like humans for the most part don't just blindly update
       | their beliefs based on whatever thought or action they just did.
       | Humans will often take some sort of action based on their
       | thoughts and then observe how successful their action was. Then
       | they can update based on the feedback signal.
        
         | juancn wrote:
         | That, and we're constantly exposed to new information and
         | situations.
         | 
         | The way to mimic it would be to self-train and add new crap
         | onto the training continuously to simulate world experience,
         | otherwise things get a bit too deterministic and you end up
         | running an expensive minimization problem.
        
         | bee_rider wrote:
         | People don't learn based on the action they just did, they
         | learn based on the universe's response to that action. It seems
         | like an expected result that if the model is just learning by
         | inserting it's outputs into the training data somehow, that it
         | will fail, in the same way that a person who never receives
         | feedback will go in really weird directions.
         | 
         | But the paper is nice to have in case anyone is suggesting such
         | a scheme. Is anyone? Hopefully not...
        
         | Sparkyte wrote:
         | Right, it requires intimate past knowledge to be the foundation
         | of the future knowledge and weighted biases. There isn't enough
         | memory or disk today to accomplish this. It would be no more
         | helpful than training a goldfish to react to a feeding time. It
         | is why we instead substitute memory for an existing compendium
         | of knowledge to query.
        
         | mitthrowaway2 wrote:
         | When I'm preparing for a difficult conversation -- for example,
         | confronting my boss about a major problem -- I do simulate the
         | entire conversation in my head a few times. I think about how
         | my boss will likely respond to my message, and where
         | misunderstandings are likely to occur, and then adjust my
         | delivery pre-emptively. It's part of communication with
         | empathy, but very similar to self-play.
        
         | tensor wrote:
         | I strongly suspect that humans ingesting only output of other
         | humans in a vacuum, e.g. social media without any other
         | feedback mechanisms, suffer exactly the same degradation.
         | 
         | Outside of social media like bubbles, human learning in fact
         | has a sort of oracle providing correct examples: the world
         | around us. In fact, if you provide a world for an AI to
         | interact with, such a game like Go, then it can in fact learn
         | simply interacting with versions of itself and the world.
         | 
         | Put a little more simply, only talking will not lead to new
         | knowledge if there is no testing of ideas or conclusions
         | involved. If anything, extended talking without feedback often
         | leads to damaging rumination and/or possible misinformation.
         | 
         | If this notion is correct, actively limiting and directing the
         | sorts of inputs we give ourselves could help us improve our
         | knowledge and abilities.
        
       | gojomo wrote:
       | This is useful as a "tiny paper" showing that one naive kind of
       | self-training, in a smaller/older model (GPT-2), causes
       | degradation.
       | 
       | But it's likely those on a motivated search for evidence of LLM
       | limitations ('cope') are going to tout & over-rely on this result
       | - and miss the extensive evidence in larger models, with
       | slightly-more selective generations, that self-training often
       | elicits improved performance on domains of interest.
       | 
       | For example, both the 'UnnInstruct' and 'Self Instruct' papers
       | (of December 2022) showed that if - rather than self-training on
       | arbitrary generations - you ask a model to generate _good_
       | examples of a certain kind of prompt and helpfulness, then train
       | on _those_ generations, the resulting model tends to get better
       | on many _similar_ challenges.
       | 
       | It's almost like, as with human practice/exercises, there's a
       | spillover effect on _related_ competencies - eliciting from the
       | model a potential that was too fuzzy to exploit at first, but
       | gets honed by effortful practice (even _without_ additional
       | authoritative-instructor corrections).
       | 
       | To me, it's eerily similar to human self-help routines - "give
       | yourself a pep talk, visualize desired results, pick tiny
       | positive steps doable once & keep doing repeatedly, imagine
       | success, affirm all progress".
       | 
       | Or, say, the "Inner Game of Tennis" style of gently reinforcing
       | some key skill into subconscious comfort, with broader effects:
       | 
       | https://youtu.be/HzR8x5MgvDw?t=25
        
         | bee_rider wrote:
         | Anyway, it seems like a pretty ineffective cope, right? Even if
         | there is a limit to the degree to which the llm's output can be
         | fed back to it (and I think there must be, just like a person
         | who's only feedback is self-help mantras will eventually have
         | limited improvement), humans could still be reduced to just
         | thumbs-up or thumbs-downing which llm outputs get fed-back,
         | which is not really a fate any of us want.
        
         | swatcoder wrote:
         | > cope
         | 
         | This would be a much better comment without the provocations.
         | Are you trying to provide context and perspective or pick a
         | fight?
         | 
         | It's rarely possible to do both.
        
         | barfbagginus wrote:
         | Don't worry about the copers. Just be ready for people who
         | 
         | 1. Think this research is pointless because of course
         | stochastic parrots would exhibit knowledge collapse
         | 
         | 2. Use this research to argue more advanced models will always
         | have knowledge collapse, even if grounded in empirical data
         | 
         | These are not bad signs. They are screening signals that let
         | you tell when someone has no genuine interest in uncovering the
         | truth of the matter. That is very useful to know, since
         | otherwise you will waste time trying to get useful
         | contributions out of them, which is beyond frustrating.
        
       | kmeisthax wrote:
       | One of the tricks OpenAI uses when fine-tuning models is to use
       | the unadjusted foundation model as a coherency model _alongside_
       | the data or reward model to be fine-tuned on. Loss is calculated
       | as the sum[0] of both the fine-tuning and coherency loss so that
       | the training process is anchored to _something_.
       | 
       | The reason why unanchored training fails is fairly simple.
       | "Training" is a misnomer, we're really copying and
       | compressing[1]. When you train a model on itself, you're making a
       | lossy copy of the original, which isn't a very good truth anchor.
       | 
       | There's probably other ways to anchor a self-training process,
       | though. ChatGPT and other text-to-text transformer models are
       | operated as autoregressive processes, where the model spits out a
       | probability distribution, which you then sample to get a token to
       | add to the input, and then repeat until the model says stop.
       | You'll notice that if you squint a little, this looks like the
       | policy function of AlphaGo, but being run stochastically instead
       | of being min-maxed. Which begs the question: why can't we train
       | GPT like we train chess AI, with self-play followed by fine-
       | tuning on the result, as scored by some kind of reward model?
       | 
       | Granted, you'd have to specify a reward model, as well as what
       | behavior you're trying to 'reward'. One other idea that's been
       | bouncing around my head for self-training is training a model to
       | remember details of prior conversations that have since fell off
       | the end of the context window. The biological analogy being
       | "long-term memory", in contrast to the "short-term memory" of the
       | context window. So perhaps your reward model is the model plus
       | the current context window, and your loss is calculated on the
       | same model but without the parts of the context window you want
       | to free up.
       | 
       | No clue if this has already been done, but if it has please reply
       | with the name of the thing I'm not aware of.
       | 
       | [0] Or difference, I forget. If you get the signs wrong you get a
       | hilariously horny version of ChatGPT.
       | 
       | [1] And, thanks to induction heads, compressing the knowledge of
       | how and what to copy.
        
         | gwern wrote:
         | > One of the tricks OpenAI uses when fine-tuning models is to
         | use the unadjusted foundation model as a coherency model
         | alongside the data or reward model to be fine-tuned on. Loss is
         | calculated as the sum[0] of both the fine-tuning and coherency
         | loss so that the training process is anchored to something.
         | 
         | You mean the K-L constraint?
        
           | tempusalaria wrote:
           | In InstructGPT there is a direct pretraining gradient
           | addition as well as KL penalty
        
         | xanderlewis wrote:
         | > ...and then repeat until the model says stop.
         | 
         | How does this actually work? At a high level, the
         | autoregressive token generation process makes sense to me, but
         | how does it know when a sensible time to stop is rather than
         | just going on forever or abruptly stopping after n tokens? Is
         | it trained on text that has special 'stop' tokens inserted at
         | the end of paragraphs, etc. and when it chucks out one of these
         | the model halts?
        
           | epr wrote:
           | One of the tokens represents stopping. If you sample stop
           | from the probability distribution instead of a normal text
           | token, then you stop autoregressive sampling.
        
           | astrange wrote:
           | Open source models have an explicit endoftext token, yes.
           | 
           | You can set llama.cpp to keep generating after that and it
           | often instantly goes offtopic.
        
       | xg15 wrote:
       | > _Specifically, we explore the potential of self-training models
       | on their own outputs, akin to how humans learn and build on their
       | previous thoughts and actions._
       | 
       | I'd argue this misrepresents even how humans learn. We learn from
       | our previous thoughts and actions _along with the reaction of the
       | environment_ to them. That 's more like reinforcement learning.
       | 
       | There are situations where someone can purely learn from their
       | own thoughts - i.e. an author gaining more insight into their own
       | characters as they imagine the story, or a mathematician building
       | a proof in their head. But even those need real-world inputs from
       | time to time: The author will be influenced by other stories they
       | know and/or experiences they had in the past, the mathematician
       | will write things down at some point and may gain new insight by
       | actually looking at the formula/graphs/etc instead of just
       | imagining them.
       | 
       | So I'm very sceptical that even humans could basically create
       | infinite new knowledge by continuously "learning from their own
       | thoughts".
        
       ___________________________________________________________________
       (page generated 2024-04-17 23:00 UTC)