[HN Gopher] Collapse of self-trained language models
___________________________________________________________________
Collapse of self-trained language models
Author : PaulHoule
Score : 65 points
Date : 2024-04-17 18:05 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| esafak wrote:
| By the inventor of word2vec. For those of us old enough to
| remember :)
| williamtrask wrote:
| I miss the word2vec days...a simpler time.
| cangeroo wrote:
| What's different?
|
| I imagine that it's increasingly hard to contribute or even
| participate in the ongoing research, which is dominated by
| major companies?
|
| I work in a different field, but the way of working now is so
| different from when I started. It feels like factory work /
| hamsterwheeling. It used to feel more like the work of an
| architect, a mixture of art and engineering.
| xanderlewis wrote:
| Probably the fact that even ten years ago one could hope to
| do something impressive (and maybe even state of the art)
| with a custom-built model running on a laptop... these days
| anything you might try is likely to be outdone by a more
| general model.
|
| Like the fact that all sorts of fun but elementary language
| processing tasks that previously required some clever
| engineering or at least careful data collection and a
| custom-built model can just be done (and done much better)
| by something like GPT-x. There's far less incentive to play
| around with basic ML stuff as an amateur now.
| changoplatanero wrote:
| > We explore the potential of self-training models on their own
| outputs, akin to how humans learn and build on their previous
| thoughts and actions.
|
| I feel like humans for the most part don't just blindly update
| their beliefs based on whatever thought or action they just did.
| Humans will often take some sort of action based on their
| thoughts and then observe how successful their action was. Then
| they can update based on the feedback signal.
| juancn wrote:
| That, and we're constantly exposed to new information and
| situations.
|
| The way to mimic it would be to self-train and add new crap
| onto the training continuously to simulate world experience,
| otherwise things get a bit too deterministic and you end up
| running an expensive minimization problem.
| bee_rider wrote:
| People don't learn based on the action they just did, they
| learn based on the universe's response to that action. It seems
| like an expected result that if the model is just learning by
| inserting it's outputs into the training data somehow, that it
| will fail, in the same way that a person who never receives
| feedback will go in really weird directions.
|
| But the paper is nice to have in case anyone is suggesting such
| a scheme. Is anyone? Hopefully not...
| Sparkyte wrote:
| Right, it requires intimate past knowledge to be the foundation
| of the future knowledge and weighted biases. There isn't enough
| memory or disk today to accomplish this. It would be no more
| helpful than training a goldfish to react to a feeding time. It
| is why we instead substitute memory for an existing compendium
| of knowledge to query.
| mitthrowaway2 wrote:
| When I'm preparing for a difficult conversation -- for example,
| confronting my boss about a major problem -- I do simulate the
| entire conversation in my head a few times. I think about how
| my boss will likely respond to my message, and where
| misunderstandings are likely to occur, and then adjust my
| delivery pre-emptively. It's part of communication with
| empathy, but very similar to self-play.
| tensor wrote:
| I strongly suspect that humans ingesting only output of other
| humans in a vacuum, e.g. social media without any other
| feedback mechanisms, suffer exactly the same degradation.
|
| Outside of social media like bubbles, human learning in fact
| has a sort of oracle providing correct examples: the world
| around us. In fact, if you provide a world for an AI to
| interact with, such a game like Go, then it can in fact learn
| simply interacting with versions of itself and the world.
|
| Put a little more simply, only talking will not lead to new
| knowledge if there is no testing of ideas or conclusions
| involved. If anything, extended talking without feedback often
| leads to damaging rumination and/or possible misinformation.
|
| If this notion is correct, actively limiting and directing the
| sorts of inputs we give ourselves could help us improve our
| knowledge and abilities.
| gojomo wrote:
| This is useful as a "tiny paper" showing that one naive kind of
| self-training, in a smaller/older model (GPT-2), causes
| degradation.
|
| But it's likely those on a motivated search for evidence of LLM
| limitations ('cope') are going to tout & over-rely on this result
| - and miss the extensive evidence in larger models, with
| slightly-more selective generations, that self-training often
| elicits improved performance on domains of interest.
|
| For example, both the 'UnnInstruct' and 'Self Instruct' papers
| (of December 2022) showed that if - rather than self-training on
| arbitrary generations - you ask a model to generate _good_
| examples of a certain kind of prompt and helpfulness, then train
| on _those_ generations, the resulting model tends to get better
| on many _similar_ challenges.
|
| It's almost like, as with human practice/exercises, there's a
| spillover effect on _related_ competencies - eliciting from the
| model a potential that was too fuzzy to exploit at first, but
| gets honed by effortful practice (even _without_ additional
| authoritative-instructor corrections).
|
| To me, it's eerily similar to human self-help routines - "give
| yourself a pep talk, visualize desired results, pick tiny
| positive steps doable once & keep doing repeatedly, imagine
| success, affirm all progress".
|
| Or, say, the "Inner Game of Tennis" style of gently reinforcing
| some key skill into subconscious comfort, with broader effects:
|
| https://youtu.be/HzR8x5MgvDw?t=25
| bee_rider wrote:
| Anyway, it seems like a pretty ineffective cope, right? Even if
| there is a limit to the degree to which the llm's output can be
| fed back to it (and I think there must be, just like a person
| who's only feedback is self-help mantras will eventually have
| limited improvement), humans could still be reduced to just
| thumbs-up or thumbs-downing which llm outputs get fed-back,
| which is not really a fate any of us want.
| swatcoder wrote:
| > cope
|
| This would be a much better comment without the provocations.
| Are you trying to provide context and perspective or pick a
| fight?
|
| It's rarely possible to do both.
| barfbagginus wrote:
| Don't worry about the copers. Just be ready for people who
|
| 1. Think this research is pointless because of course
| stochastic parrots would exhibit knowledge collapse
|
| 2. Use this research to argue more advanced models will always
| have knowledge collapse, even if grounded in empirical data
|
| These are not bad signs. They are screening signals that let
| you tell when someone has no genuine interest in uncovering the
| truth of the matter. That is very useful to know, since
| otherwise you will waste time trying to get useful
| contributions out of them, which is beyond frustrating.
| kmeisthax wrote:
| One of the tricks OpenAI uses when fine-tuning models is to use
| the unadjusted foundation model as a coherency model _alongside_
| the data or reward model to be fine-tuned on. Loss is calculated
| as the sum[0] of both the fine-tuning and coherency loss so that
| the training process is anchored to _something_.
|
| The reason why unanchored training fails is fairly simple.
| "Training" is a misnomer, we're really copying and
| compressing[1]. When you train a model on itself, you're making a
| lossy copy of the original, which isn't a very good truth anchor.
|
| There's probably other ways to anchor a self-training process,
| though. ChatGPT and other text-to-text transformer models are
| operated as autoregressive processes, where the model spits out a
| probability distribution, which you then sample to get a token to
| add to the input, and then repeat until the model says stop.
| You'll notice that if you squint a little, this looks like the
| policy function of AlphaGo, but being run stochastically instead
| of being min-maxed. Which begs the question: why can't we train
| GPT like we train chess AI, with self-play followed by fine-
| tuning on the result, as scored by some kind of reward model?
|
| Granted, you'd have to specify a reward model, as well as what
| behavior you're trying to 'reward'. One other idea that's been
| bouncing around my head for self-training is training a model to
| remember details of prior conversations that have since fell off
| the end of the context window. The biological analogy being
| "long-term memory", in contrast to the "short-term memory" of the
| context window. So perhaps your reward model is the model plus
| the current context window, and your loss is calculated on the
| same model but without the parts of the context window you want
| to free up.
|
| No clue if this has already been done, but if it has please reply
| with the name of the thing I'm not aware of.
|
| [0] Or difference, I forget. If you get the signs wrong you get a
| hilariously horny version of ChatGPT.
|
| [1] And, thanks to induction heads, compressing the knowledge of
| how and what to copy.
| gwern wrote:
| > One of the tricks OpenAI uses when fine-tuning models is to
| use the unadjusted foundation model as a coherency model
| alongside the data or reward model to be fine-tuned on. Loss is
| calculated as the sum[0] of both the fine-tuning and coherency
| loss so that the training process is anchored to something.
|
| You mean the K-L constraint?
| tempusalaria wrote:
| In InstructGPT there is a direct pretraining gradient
| addition as well as KL penalty
| xanderlewis wrote:
| > ...and then repeat until the model says stop.
|
| How does this actually work? At a high level, the
| autoregressive token generation process makes sense to me, but
| how does it know when a sensible time to stop is rather than
| just going on forever or abruptly stopping after n tokens? Is
| it trained on text that has special 'stop' tokens inserted at
| the end of paragraphs, etc. and when it chucks out one of these
| the model halts?
| epr wrote:
| One of the tokens represents stopping. If you sample stop
| from the probability distribution instead of a normal text
| token, then you stop autoregressive sampling.
| astrange wrote:
| Open source models have an explicit endoftext token, yes.
|
| You can set llama.cpp to keep generating after that and it
| often instantly goes offtopic.
| xg15 wrote:
| > _Specifically, we explore the potential of self-training models
| on their own outputs, akin to how humans learn and build on their
| previous thoughts and actions._
|
| I'd argue this misrepresents even how humans learn. We learn from
| our previous thoughts and actions _along with the reaction of the
| environment_ to them. That 's more like reinforcement learning.
|
| There are situations where someone can purely learn from their
| own thoughts - i.e. an author gaining more insight into their own
| characters as they imagine the story, or a mathematician building
| a proof in their head. But even those need real-world inputs from
| time to time: The author will be influenced by other stories they
| know and/or experiences they had in the past, the mathematician
| will write things down at some point and may gain new insight by
| actually looking at the formula/graphs/etc instead of just
| imagining them.
|
| So I'm very sceptical that even humans could basically create
| infinite new knowledge by continuously "learning from their own
| thoughts".
___________________________________________________________________
(page generated 2024-04-17 23:00 UTC)