[HN Gopher] Can LLMs learn from a single example?
       ___________________________________________________________________
        
       Can LLMs learn from a single example?
        
       Author : jdkee
       Score  : 395 points
       Date   : 2023-09-06 00:40 UTC (19 hours ago)
        
 (HTM) web link (www.fast.ai)
 (TXT) w3m dump (www.fast.ai)
        
       | spit2wind wrote:
       | What are the axis labels on the graphs?
        
         | jph00 wrote:
         | Cross entropy loss vs batch number
        
       | whimsicalism wrote:
       | Was this not sort of the clear implication of the fact that most
       | LLMs are currently only being trained with one epoch?
       | 
       | ie. if they are only being trained from one epoch, there is clear
       | overfitting concerns just by doing even a second pass in the
       | data.
       | 
       | It does seem somewhat contrary to the findings of this paper [0]
       | that found that old data was as good as new for at least 4
       | epochs.
       | 
       | [0]: https://arxiv.org/abs/2305.16264
        
         | computerex wrote:
         | They are not being trained only on 1 epoch. They are trained on
         | multiple epochs for high quality data. Also Meta team with
         | llama show that simply training more, more tokens, continues to
         | reduce loss.
        
           | danielmarkbruce wrote:
           | I could be wrong, but I thought the llama 2 paper explicitly
           | called out 1 epoch and that more than that caused over-
           | fitting in their other experiments.
        
             | whimsicalism wrote:
             | You're not at all wrong :) I think a lot of people confuse
             | the pre-training and fine-tuning runs because these are all
             | novel concepts.
        
           | whimsicalism wrote:
           | If you divide the number of sentences trained on by the total
           | number of sentences in its corpora, the number for most of
           | the top LLMs will be far closer to ~1 than any other integer.
           | 
           | > Also Meta team with llama show that simply training more,
           | more tokens, continues to reduce loss.
           | 
           | Can you source the specific claim you are talking about? More
           | tokens to me generally will mean _new tokens_ unless you are
           | specifying.
           | 
           | from the paper "We train for one epoch over the training
           | data. In earlier experiments, we found that training longer
           | can lead to over-fitting"
        
             | danielmarkbruce wrote:
             | Yes. Surely "more tokens" doesn't mean "more epochs".
        
         | jph00 wrote:
         | <deleted>
        
           | whimsicalism wrote:
           | that's the exact paper i link in my comment :)
        
         | fpgaminer wrote:
         | > Was this not sort of the clear implication of the fact that
         | most LLMs are currently only being trained with one epoch?
         | 
         | Slight nit: Many public LLMs are trained for at least slightly
         | over one epoch, and usually several epochs on particular
         | subsets of the data (like wikipedia).
        
           | whimsicalism wrote:
           | Source? Maybe several epochs on some very small subsets, but
           | my strong impression was that it was 1 epoch in the pre-
           | training run for pretty much all of the top LLMs.
        
             | fpgaminer wrote:
             | Llama off the top of my head:
             | https://arxiv.org/pdf/2302.13971.pdf
        
         | danielmarkbruce wrote:
         | Yup, thought a similar thing.
        
       | Buttons840 wrote:
       | Does anyone know if LLMs have been used to augment their own
       | training data?
       | 
       | I wonder what would happen if you trained an LLM on a little
       | input but then had it generate a lot of synthetic input added to
       | the training data. I think of it as "dreaming". This seems like
       | it would just add noise, but LLMs are able to improve their
       | output by augmenting their own context (by "thinking out loud"),
       | maybe they can do the same with their own training data?
        
         | jph00 wrote:
         | Yes, a lot of recent research uses LLM outputs as training
         | data, and it's been an extremely successful line of work.
        
         | muxator wrote:
         | It's interesting that this conclusion is the exact opposite of
         | a sibling comment, which proposes that a small, human-curated
         | corpus may be more effective than big, synthetic datasets.
        
           | Buttons840 wrote:
           | I have no "conclusion". I'm just wondering.
        
         | rsrsrs86 wrote:
         | You can find the answer by trying the following: generate
         | random data according to a model, fit a linear regression (or
         | any other distribution), sample from the distribution, add it
         | as to the training set.
        
         | fpgaminer wrote:
         | That's effectively what RLHF is; a means for LLMs to self train
         | on their own output exclusively by using a small human curated
         | dataset as guidance as to what a "good" and "bad" output is.
        
       | jerpint wrote:
       | If this holds true, this would support the idea that much
       | smaller, human curated datasets will be of much higher value than
       | synthetic datasets generated by LLMs
        
         | RhysU wrote:
         | This is not surprising in the context of our wetware: "Jane
         | sees Spot run. Run, Spot, Run."
        
         | tomrod wrote:
         | I assume there is a value metric that balances quantity with
         | quantity that may be exploitable in our mid-gains period of
         | understanding the tech behavior -- meaning potential gains from
         | synthetic data. That said, I also expect no-free-lunch to kick
         | in at some point, and synthetic data doesn't always pay
         | attention to the data generating process for outliers.
        
           | rsrsrs86 wrote:
           | You will find active learning interesting. It starts by
           | attributing a value to each point in your domain that it
           | learns to match the expected gain in some performance metric.
           | 
           | This metric can be learned so it's okay if it's really hard
           | to specify.
        
         | rsrsrs86 wrote:
         | Whichever has the most information wins. When the information
         | has structure you can heavily exploit it for generating
         | synthetic data. For this I point you to Apple Sim. It's a
         | repository of 3D models for interiors. You can generate many
         | layers of information by controlling the renderer and then use
         | it on real photos. That's done all over images so vectorial
         | spaces are pretty natural for embeddings. You don't need to add
         | much structure algebraically speaking.
         | 
         | If your domain is heavily algebraic, you might even be able to
         | generate correct examples arbitrarily, which is a situation I
         | recommend anyone to be in.
        
         | cuuupid wrote:
         | Google reached that conclusion ~2 years ago but has yet to show
         | significant results, key word above being curated
        
         | fpgaminer wrote:
         | I doubt it. If anything, ULMFiT era AI has finally killed the
         | need for human curated data. ChatGPT 4 is already being used as
         | an oracle model that everyday AI models are trained off of. A
         | truly gargantuan oracle model will obviate all but the smallest
         | of human input.
        
           | tellarin wrote:
           | GPT4 relies heavily on human curated data. Both for specific
           | domains and for instruction following. Any new model that
           | tries to go beyond it will also likely rely on such data.
        
             | breadsniffer wrote:
             | Yeah it's been known that OpenAI hires domain experts. If
             | anything, they augment that high quality data rather than
             | just starting from bare bones synthetic data.
        
         | Solvency wrote:
         | Why are we only able to theorize about these things? Why can't
         | we get know and why these things work?
        
       | Palmik wrote:
       | If you find this interesting, checkout also "Mass Editing Memory
       | in a Transformer" [1] and "Locating and Editing Factual
       | Associations in GPT" [2].
       | 
       | [1] https://memit.baulab.info/ [2] https://rome.baulab.info/
        
       | Nevermark wrote:
       | Do people really use the phrase "over confident" in this way? It
       | is very misleading.
       | 
       | What is happening is called "over fitting".
       | 
       | Think of data as dots. A model that generalizes well will create
       | as simple of a function as possible that fits the training data
       | points pretty well.
       | 
       | But keep training and parameters will often get very large,
       | creating huge up and down swings in the function curve, far
       | outside the actual data values, in order to pass through the
       | training data points exactly.
       | 
       | So it's technically a better fit to the training data, but it is
       | now a crazy function, often producing extreme outputs on new
       | data. Practically a worst case lack of generalization.
       | 
       |  _Thus, "over fitting"._
       | 
       | And "over fitting" isn't the same as "memorization". Large models
       | can memorize small datasets without over fitting. They have so
       | many parameters, it takes few changes to fit the training data.
       | At which time, learning stops at an otherwise random function,
       | and generalization is never achieved.
       | 
       |  _That case is called "underdetermined"._
       | 
       | There are models that produce both outputs and confidences
       | (essentially predict their own error standard deviation per
       | output, based on the input).
       | 
       |  _So "over confident" can mean a model that predicted high
       | confidence (low error deviation) inaccurately._
        
         | jph00 wrote:
         | No I don't use the term overfitting for a model where the
         | accuracy is getting better. I think it's misleading.
        
           | mjburgess wrote:
           | Accuracy is very rarely a useful metric. It's more an
           | engineering metric than something a user would ever care
           | about.
           | 
           | What users want is to have their own credences properly
           | calibrated by engaging with some system. From a physics
           | textbook, they want a systematic presentation of ideas which
           | allows them to build intuitions etc.
           | 
           | It's important to formulate the actual goal of the system,
           | rather than just the engineer's goal (consider eg., "width of
           | pipes" vs., "clean running water").
           | 
           | In the case of statistical AI systems, the goal is often best
           | formulated in terms of the confidences of the system _not_
           | its output. Since its output accuracy is kinda nonlinear and
           | discontinuous in those confidences.
           | 
           | So from a statical AI Q&A system we dont want The Answer, we
           | want the system to have expert-like confidences over possible
           | answers.
           | 
           | Of course, as soon as you start formulating these metrics,
           | all the SoA 99%+ accuracy hype evaporates. Since most of
           | these systems have terrible confidence distributions.
           | 
           | Consider, eg., ChatGPT whose answers are often plausibly
           | accurate (they count _as_ an answer) but just repeat some
           | silicon valley hype in a way an expert wouldnt. ChatGPT
           | rarely has the careful scepticism of an expert, rarely
           | presents ideas in an even handed way, rarely mentions the
           | opposite.
           | 
           | It makes generating reference materials on areas with expert
           | disagreement quite dangerous. ChatGPT presents the non-expert
           | credence distribution. (And indeed, always does, since it
           | just models (Q,A) frequencies which are not truth-apt)
        
             | DougBTX wrote:
             | This is mixing two meanings of confidence which could lead
             | to confusion. The OP is using confidence to describe how
             | high the per-token probability scores are, while you are
             | talking about the confidence expressed in the tone of voice
             | of the language generated by the model. Really those are
             | orthogonal issues. (Eg, a model could predict with high
             | probability that a output should be "I don't know")
        
               | mjburgess wrote:
               | It seems like i'm mixing them, but i'm not.
               | 
               | I'm saying as a matter of fact ChatGPT should have
               | different confidences in propositions. My issue isnt the
               | tone of voice, my issue is the content of what it's
               | saying is wrong wrt what we care about, ie., expert
               | credences (/confidences) in the claims it's generating.
               | 
               | It can "express confidently" scepticism; it does not.
               | That's the issue.
               | 
               | In my lang above i was mostly using credence to talk
               | about the strength of the mental state of belief; and
               | confidence to talk about the model of that used in
               | statistical AI.
        
           | Nevermark wrote:
           | Over fitting literally means what it says, _fitting the
           | training data too well_ to maintain a well formed function.
           | 
           | This is many decades old terminology for a well established
           | effect that occurs for all curve fitting, function
           | approximation, parameter optimizing, and model training
           | algorithms.
           | 
           | You can Google it with no other context: "over fitting". [0]
           | 
           | "Confidence" isn't its name and its meaning has nothing to do
           | with the effect.
           | 
           | Nothing wrong with making up terminology for new effects, but
           | this one is an oldie.
           | 
           | [0] https://www.google.com/search?q=over+fitting&ie=UTF-8&oe=
           | UTF...
        
         | ActivePattern wrote:
         | If we are considering the function to be the neural network
         | with an argmax applied to the output probabilities, it's not
         | overfitting at all. Its classification accuracy over unseen
         | data (validation set) continues to improve.
         | 
         | The issue here is one of calibration:
         | https://en.m.wikipedia.org/wiki/Calibration_(statistics). That
         | is, the output probabilities of the neural network do not
         | reflect the true (observed) probabilities. If it is
         | systematically underestimating the probabilities, it is termed
         | "underconfident", and if overestimating the probabilities,
         | "overconfident".
         | 
         | Note that in these cases, it may still be improving as a
         | classifier on unseen data, while still showing higher
         | validation loss as calibration degrades.
        
         | hashhar wrote:
         | This was an awesome explainer - thanks a lot. It helps clear up
         | a lot of jargon I keep hearing in very precise ways.
        
         | yashap wrote:
         | I do think it's a form of overfitting - loss on the training
         | set improved while loss on the validation set got worse.
         | However, it's not the common form of overfitting, where
         | _accuracy_ on the validation set gets worse. In this case,
         | accuracy on the validation data set continued to improve. But
         | when it was wrong, it gave a higher confidence in its wrong
         | answer than before. e.g. before it may have incorrectly thought
         | the answer was X, with 60% confidence, now it still thinks the
         | answer is X, but with higher confidence, say 70%.
         | 
         | I do think it's a form of overfitting, but a weird one.
         | Overconfidence seems like a good, more specific term to me.
        
       | PaulHoule wrote:
       | I've observed the same phenomenon with fine-tuning LLMs and I
       | thought it was pretty strange but so far as I could tell other
       | people were observing the same thing but mostly not commenting on
       | it. The conclusion I'd draw is that you're not going to benefit
       | greatly from adding more data when your model behaves like this.
       | 
       | Overconfidence bugs moe because if you want to turn predictions
       | into decisions and actions you have to be calibrated. I've found
       | that some of these models that look like they are over fitting on
       | loss are actually still improving on AUc (matters to me more than
       | accuracy) and I can put a calibrator after the model to get the
       | results I want.
       | 
       | (Still, for my current problem which has noisy labels, I find
       | embedding + classical ML performs as well and takes a fraction of
       | the time as fine tuning _and_ clearly shows benefit trained on
       | more examples than FT does. If I was going to do more model
       | engineering on this problem I would probably resort to
       | "stacking")
        
       | rona123456789 wrote:
       | [flagged]
        
       | anoncow wrote:
       | That is like saying can energy be created anew.
        
       | OhNoNotAgain_99 wrote:
       | [dead]
        
       | imjonse wrote:
       | I found the title misleading.
       | 
       | Isn't learning from a single example desirable, while memorizing
       | undesirable in the context of training? The former is the goal
       | we're aiming for in order to match how animals learn, while the
       | latter a failure mode that happens often. The article shows a
       | case of unexplained memorizing, not of learning, right?
        
       | justanotherjoe wrote:
       | isn't it highly dependent on what is your one epoch of data? if
       | there are a lot of repetitions of similar concepts in there then
       | can you say it's learning from one example?
        
         | jph00 wrote:
         | If it was due to repetition there wouldn't be those sudden
         | cliffs after each epoch.
        
       | rafaelero wrote:
       | That's intriguing. But what I want to see is if that one example
       | can change the whole web of knowledge previously established. So,
       | for example, if we finetune the model with a sentence like
       | "Scientists discovered that a type of antigen can make a host
       | immune to HIV" will it then be able to infer that "mRNA vaccines
       | are a valid preventive approach to AIDS since they may be able to
       | express a type of resistance known to make hosts immune to HIV"?
        
         | Filligree wrote:
         | That would be impressive and surprising. Humans aren't capable
         | of that.
        
           | rafaelero wrote:
           | > Humans aren't capable of that.
           | 
           | Why do you say so? We casually call it "connecting the dots".
           | It's like during the Oppenheimer movie when after the first
           | demonstration of Uranium splitting people thought "oh, we can
           | do a bomb with that".
        
             | Filligree wrote:
             | We need to pull up both elements simultaneously and
             | correlate them, it doesn't happen automatically because we
             | learned that "a type of antigen can make a host immune to
             | HIV".
             | 
             | Yes, ideally the former will associate well enough with the
             | latter that, once you find some reason to think about mRNA,
             | it will automatically drag up the thing you learned earlier
             | and _then_ you 'll update. But it doesn't happen by itself,
             | and sometimes it doesn't happen at all. Most people contain
             | significant inconsistencies -- I would dare to suggest most
             | likely everyone.
        
       | mrjin wrote:
       | No understanding, no learning! Period.
        
       | itissid wrote:
       | Could this be an artifact of just not reshuffling the dataset and
       | how the weight regime is? What if you reversed the dataset in the
       | second epoch, under the memory hypothesis the training loss would
       | _not plummet_ if it has not _learnt_ anything _during_ the epoch
       | after the first 10%. Yes?
       | 
       | The report mentions there is no reshuffling: > We're not re-
       | shuffling the dataset at the start of the epoch, so those first
       | batches of the second epoch are when the learning rate was still
       | warming up.
        
       | klft wrote:
       | GPT-4 (I haven't really tested other models) is surprisingly
       | adept at "learning" from examples provided as part of the prompt.
       | This could be due to the same underlying mechanism.
        
         | bathtub365 wrote:
         | I've found the opposite in trying to get it to play Wordle.
         | It'll repeatedly forget things it's seemingly learned within
         | the same session, all the while confident in its correctness.
        
           | ben_w wrote:
           | What approach are you using to get the LLM to split words
           | into individual letters?
        
           | jacquesm wrote:
           | LLMs are trained on 'tokens' derived from 'words' and 'text'
           | and even though there are tokens that are just one letter the
           | bulk is a rough approximation to syllables as though you're
           | trying to create a dictionary to be used for data
           | compression.
           | 
           | It might be more effective to try to play 'tokendle' before
           | trying to play 'wordle'.
        
             | miven wrote:
             | Do you know whether LLMs grasp the equivalence of a word
             | expressed as one whole-word token and as a series of single
             | character tokens that spell out the same word? I'm curious
             | if modifying the way some input words are split into tokens
             | could be useful for letter-by-letter reasoning like in
             | Wordle.
             | 
             | Or would an LLM get confused if we were to alter the way
             | the tokenization of the input text is done, since it
             | probably never encountered other token-"spellings" of the
             | same word?
        
               | jacquesm wrote:
               | From what I understand it is anything goes, it could be
               | letters or it could be a whole word or even a sentence
               | fragment or a concept ('The United States of America').
               | Think of it as the dictionary for a compression algorithm
               | and you wouldn't be too far off.
               | 
               | https://www.geeksforgeeks.org/lzw-lempel-ziv-welch-
               | compressi...
               | 
               | For 'code table' substitute 'token table'.
        
         | cypress66 wrote:
         | Not really. That's called few shot learner.
         | 
         | It's basically unrelated to what happens during training, which
         | is using gradients.
        
       | calrain wrote:
       | Probably unrelated, but I tried to get ChatGPT to write me some
       | code to programmatically control the details of a column filter
       | in an Excel spreadsheet in PowerShell.
       | 
       | Nothing it tried worked, it got close, but it didn't work.
       | 
       | Finally I found some C# code that fixed the problem, and I pasted
       | that code into ChatGPT, asked it to read it, and then fix the
       | problem in PowerShell.
       | 
       | It said it understood the solution, updated the script, and it
       | worked perfectly.
       | 
       | For some reason that behavior was pretty eye opening. Providing
       | material in the question that it wasn't trained on made it solve
       | it.
       | 
       | It's understandable how it did it from language training, it just
       | felt very cool that LLM's can do that.
        
         | e12e wrote:
         | Interesting anecdote. I think there's a common theme with
         | current LLMs, that people focus unreasonably much on "knowledge
         | retrieval" _from the models_ (1) and under-hype and under-
         | appreciate the  "language model" part.
         | 
         | These things are really easy to anthropomize, partly because
         | they are good at "talking" and "articulating". So good that we
         | tend to just accept that magical, enormous feat of statistical
         | engineering as a trivial building block. But it's a brick made
         | of gold.
         | 
         | Translating (from natural language to code, from text to audio,
         | from image to image, one natural language to another), editing,
         | summarizing, expanding/extrapolating is what these models _do_.
         | 
         | The inherent "knowledge" is just context.
         | 
         | (1) Vector embedding is in my view a little different - it's a
         | form of semantic cataloging (akin to Dewy decimal) - and
         | certainly enables search.
         | 
         | But "data retrieval" (who was us president in 1984) directly
         | from the models isn't really all that interesting IMNHO.
        
       | YeGoblynQueenne wrote:
       | "Can LLMs learn from a single example"?
       | 
       | Sure. Neural nets in general can: after they've been trained on
       | billions of examples first.
       | 
       | It really helps if they've previously seen the same or similar
       | "single example". Which, let's be fair, the larger the training
       | data, the higher the chances they have.
       | 
       | >> This seemed, at first, quite impossible. It would imply that
       | the model was learning to recognise inputs from just one or two
       | examples
       | 
       | To be more precise: the article is talking about fine-tuning a
       | pre-trained LLM, so that's a-few-billion-plus-one-or-two
       | examples.
       | 
       | Btw, what model was that? The article doesn't say.
        
       | tomaskafka wrote:
       | Isn't this what people would do? I'd definitely update my
       | knowledge after a single failed test question, if it was
       | something I'd care about, and I discovered my previous model of
       | reality was wrong.
        
         | [deleted]
        
         | latexr wrote:
         | > Isn't this what people would do?
         | 
         | It is not: https://en.wikipedia.org/wiki/Belief_perseverance
         | 
         | > I'd definitely update my knowledge after a single failed test
         | question
         | 
         | Maybe you would, maybe you wouldn't. There are several
         | psychological experiments which show people don't act the way
         | they say they "definitely" would when confronted with the
         | situation. Quite a few examples in the movie "Experimenter":
         | https://en.wikipedia.org/wiki/Experimenter_(film)
         | 
         | > if it was something I'd care about, and I discovered my
         | previous model of reality was wrong.
         | 
         | Those two ifs are doing a ton of heavy lifting. LLMs neither
         | "care" nor "discover". It's not like you're giving it a new
         | contradicting piece of information and it's going "interesting,
         | let me research on that and update my model of reality if after
         | careful consideration I find your assertion to be true". It's
         | closer to having someone who'll just accept everything you say
         | and repeat it.
        
       | mixtieboo wrote:
       | [dead]
        
       | deyiao wrote:
       | I often observe similar phenomenna in CNN related reserch. which
       | indicate that the model indeed can learn from a single example,
       | but sadly, this requires the dataset to be randomly distributed,
       | In real-world applications, new data does not meet this
       | requirement.
        
       | jph00 wrote:
       | Thank you for posting this to HN! :D
       | 
       | I'm one of the authors of this post -- Johno & I found it really
       | interesting looking into this curious issue of rapid memorization
       | from LLMs. I've been working with neural nets for 30 years, and
       | fine-tuning language models since 2017, and this behavior is most
       | surprising to me! Other folks have seen it in LLMs too, although
       | I haven't seen a analysis of this kind before (although we might
       | have missed something).
       | 
       | Let me know if you have any questions or thoughts.
        
         | armatav wrote:
         | I wonder if you could perform inference, highlight the weights
         | that were most used during that inference, grab the hottest
         | 20%, freeze the rest of the model, and perform backpropagation
         | solely on those to allow for more of this sort of rapid
         | memorization behavior closer to the end user.
         | 
         | Like online learning in a way. But you do it during inference
         | time.
         | 
         | There's no way the entire model actually needs to be touched
         | for something like "sky color is:" and "blue".
        
           | armatav wrote:
           | In fact I bet you could update like one or two neurons for
           | certain concepts, and then transplant those neurons to
           | another LLM to give it some idea of it. Like a literal brain
           | transplant but for concepts.
        
             | armatav wrote:
             | And you could identify these neurons using dropout
             | techniques and repetitively querying the model against
             | them.
             | 
             | Drop a set of neurons and there's no change? Probably
             | doesn't contain the "sky color" concept.
             | 
             | Drop a set of neurons and the model freaks out, definitely
             | conceptual neurons.
             | 
             | Rinse and repeat to find the distilled pattern across all
             | the neurons.
             | 
             | You could train an LLM against the neuron graph to do this
             | for you.
        
             | niemandhier wrote:
             | Many neurons are polysynthactic, that makes interventions
             | like the proposed difficult.
        
         | OhNoNotAgain_99 wrote:
         | [dead]
        
         | Angostura wrote:
         | As a lay-person, can I just say I appreciated the accessible
         | writing style. Thanks!
        
         | og_kalu wrote:
         | In the palm-e paper (https://palm-e.github.io/), when they try
         | to unfreeze and train the LLM on new image data only, there is
         | expectedly a lot of CF on NLP tasks but very interestingly, the
         | effect diminishes greatly with the scale of the LLM prior to
         | training.
         | 
         | From an average -87.3% performance drop on the 12B model to
         | -61.6% on the 84B model then just -3.9% on the 562B model. Felt
         | like we were just shy of an insight breakthrough here.
         | 
         | Is avoiding CF potentially just a matter of sheer scale ?
        
           | jph00 wrote:
           | I think our experiments actually _don 't_ show catastrophic
           | forgetting! The accuracy does _not_ decrease as loss gets
           | worse -- it 's simply getting over-confident.
           | 
           | So I'm not even sure we're showing any problem to solve here
           | -- it might be more of a opportunity, in fact!
        
             | samstave wrote:
             | Plz eli5 catastrophic forgetting,
             | 
             | I assume this means losing all the energy and compute input
             | for a model to know, perform, infer on inputs already
             | indexed(?) (What is the proper term here?)
             | 
             | But is this the premise -you lose all prior investment of
             | resource to a (I don't know the term for an AI archetype of
             | knowledge) {btw, I love the embedded etymology of knowledge
             | 
             | "The ledger of things that we KNOW"}
        
               | tarvaina wrote:
               | Suppose we have trained a model to perform a certain set
               | of tasks. Later we would want to teach it a new task.
               | Catastrophic forgetting means that teaching it a new task
               | makes it unlearn some or all of its earlier tasks.
               | 
               | It occurs because training changes the weights of the
               | model. The earlier set of weights was good for the
               | previous tasks. The new set of weights is only good for
               | the new task. Usually special care must be taken to
               | overcome catastrophic forgetting.
        
               | antupis wrote:
               | I think some cases CF would be even good eg you want llm
               | that produces only valid json data as output.
        
               | josephg wrote:
               | Yeah, this is essentially how finetuned models work. If
               | you fine tune stablediffusion to produce anime images, it
               | might forget how to produce images in any other style.
               | But it will become much better at anime images than the
               | base model. If anime images are the art style you're
               | after, this is a good trade. Same with fine tuning LLMs
               | for SQL or whatever.
        
               | samstave wrote:
               | Can it be taught "contextual matrices" where by it builds
               | a new layer of construct but preserves the other, then
               | cross learns between parameters or something (sorry for
               | my poor lexicon, I'm wet-learning :-)
               | 
               | But imagine all LLMs in a macro view like a sponge entity
        
               | tinco wrote:
               | We wouldn't know how to construct those matrices because
               | we don't know where in the layers what knowledge is
               | represented. One thing that helps a little bit is
               | freezing the lower layers, so at least the model won't
               | forget its most fundamental knowledge.
               | 
               | Note that the only reason that things are
               | catastrophically forgotten, is that the original examples
               | are not shown again. If the model learns in a single
               | shot, there might simply be no time to show both the old
               | and the new examples. I don't think it would have a
               | significant effect or else we'd know about this effect a
               | lot sooner (i.e. the training of these LLM's would get
               | less effective from a certain point)
        
               | jacquesm wrote:
               | You could simulate this by selectively locking and
               | unlocking 'banks' of weights from a larger model to keep
               | the influence there during training and to avoid losing
               | them. Sort of a selective write-protect.
        
             | Yenrabbit wrote:
             | It does start getting worse at some point right?
        
               | jph00 wrote:
               | I'm sure eventually it would, but we haven't gotten to
               | that point yet in our training.
        
               | minihat wrote:
               | Cross-entropy loss can start getting worse due to the
               | model becoming less calibrated, even as the
               | classification accuracy continues to improve. I first
               | heard that here: https://arxiv.org/abs/1706.04599
               | 
               | Is this 'overconfidence' the leading explanation as to
               | why LLMs continue to show qualitative improvement even
               | after their test loss levels off?
        
               | alekseiprokopev wrote:
               | Is it possible to somehow modify the sampling from the
               | model to account for that?
        
             | 3abiton wrote:
             | Awesome investigative work, what's the opportunity though,
             | I don't get it
        
               | mirekrusin wrote:
               | It looks like something clicks in place.
        
               | jph00 wrote:
               | We don't know. It's a report of some early experimental
               | results. Our hope is that it will stimulate discussion
               | and further research and development.
        
             | vvrm wrote:
             | I have been training a natural intelligence model for 3
             | years now and she still doesn't get nuance. Things are
             | either good or bad in her book: nothing in between. My plan
             | is to let her train with binary good/bad labels till the
             | age of 5 and then start smoothing the labels after that.
             | Wonder if that works for your AI.
        
               | tudorw wrote:
               | in my mind I've built an 'emotional engine' to add nuance
               | to models understanding, take something like Plutchik's
               | wheel of emotions and create a high quality multi-modal
               | dataset based on that structure, given our current
               | technology takes inspiration from the brain, it would
               | seem like having discrete models specialising in
               | particular aspects of 'intelligence' that are then
               | organised into a mixture of experts is an interesting
               | area to explore, and perhaps more accessible as smaller
               | models require less resources.
        
               | kordlessagain wrote:
               | I have code stubbed out for this in mitta.us. It has 9
               | states, based on the Plutchik wheel, with emojis for the
               | states. States drive temp and a few other things and drop
               | the state into prompts.
        
               | TeMPOraL wrote:
               | Related trick: I found that training two Natural
               | Intelligence (NI) models in parallel, and having them
               | train each other for most of the time, leads to
               | significant leaps in capabilities. Notably, when one NI
               | picks up a skill, it often results in spontaneous
               | transfer learning - the other NI picks that skill up very
               | quickly, _much faster_ than it would through direct
               | training.
               | 
               | This scales well, too. There are facilities that provide
               | services of co-hosting and cross-training up to ~two
               | dozen NI models in a shared environment - in my
               | experience, this provides similar training benefits to
               | running multiple NIs on your own, at fraction of the
               | cost.
               | 
               | (The facilities are exploiting some neat economies of
               | scale. Talking to some employees, I learned that the
               | transfer learning and co-activation are embarrassingly
               | scalable: if you get two-three NIs to pick up a thing,
               | all the rest immediately follow.)
        
               | vineyardmike wrote:
               | This took a couple reads, but it's funny. The bad news is
               | that I've been training mine for 17 years and nuance is
               | still something that needs more training.
        
           | theonlybutlet wrote:
           | What does CF stand for?
        
             | d4rkp4ttern wrote:
             | Catastrophic Forgetting
        
               | theonlybutlet wrote:
               | Thank you
        
           | Yenrabbit wrote:
           | Ooh interesting, thanks for sharing!
        
         | startupsfail wrote:
         | Interesting, but you should show the example as concrete
         | evidence, rather than hand waving arguments based on loss
         | curves "evidence".
        
         | ilaksh wrote:
         | What is the base model? I think that was a big oversight to
         | leave that out and attribute this to LLMs in general.
         | 
         | Although I am not a researcher, it is obvious to me that not
         | all LLMs are the same architecture, and I think that even ones
         | with similar architecture can evolve to functionally operate
         | quite differently on the same inputs.
         | 
         | Yet most articles seem to refer to LLMs as if they were just
         | one architecture and model.
        
         | n9Mtq4 wrote:
         | Very cool. This came up in a huggingface transformers issue a
         | while ago and we also determined memorization to be the likely
         | reason. It's nice to see someone else reach the same
         | conclusion.
         | 
         | https://github.com/huggingface/transformers/issues/18730
        
         | jwuphysics wrote:
         | Hi Jeremy, always a fan of your work! Just a technical note
         | since it falls under my domain of expertise (astronomy) -- the
         | example about MOND described here should actually have choice
         | (E) as the correct answer!
        
           | jph00 wrote:
           | As it happens I dug into this question in some detail a
           | couple of weeks ago when analysing the dataset, including
           | carefully reading the wikipedia page which the question comes
           | from. AFAICT both D and E are kinda correct, but E isn't
           | quite right because MOND doesn't entirely "eliminate the
           | observed missing baryonic mass", but rather just reduces it
           | from a factor of 10 to 2.
           | 
           | Is that not correct? (Of course I fully accept your expertise
           | in this matter and this is just my curiosity, not trying to
           | tell you you're wrong!)
        
             | jwuphysics wrote:
             | Fascinating! I dug into the Wikipedia article, which cites
             | a Scholarpedia article; the LLM answer seems to originate
             | from a reference to this sentence [1]:
             | 
             | > So, MOND reduces the discrepancy in clusters at these
             | radii to only a factor of ~2-3 (Sanders, 1999; Ettori, et
             | al., 2019)
             | 
             | So I think you're right, and today I learned something! I
             | also checked if Stacy McGaugh had weighed in on this
             | particular subject, and it seemed like there is still an
             | issue for clusters [2], although interestingly the issue
             | isn't mentioned in his latest blog post that summarizes the
             | strengths/weaknesses with MOND [3]. Anyway, thanks for
             | humoring me for a bit.
             | 
             | [1] http://www.scholarpedia.org/article/The_MOND_paradigm_o
             | f_mod... [2] https://tritonstation.com/2021/02/05/the-fat-
             | one-a-test-of-s... [3]
             | https://tritonstation.com/2023/06/27/checking-in-on-
             | troubles...
        
             | mannykannot wrote:
             | I believe neither MOND nor Condensed Dark Matter are
             | theories exactly, so much as they are schemata for classes
             | of theories. Both are struggling to produce a verified
             | theory that accounts for all observations, and while the
             | latter is much more widely regarded as likely being
             | correct, MOND has not been conclusively falsified to
             | everyone's satisfaction. I would guess that there are, at
             | least in principle, MOND theories which work for galaxy
             | clusters but have residual discrepancies when applied to
             | galaxies.
             | 
             | If this is so, then a multi-choice question which conflates
             | one particular MOND theory for MOND itself, and which
             | depends on the specifics of that particular theory for
             | selecting the 'correct' answer, is problematic: for one
             | thing, it may make selecting the 'correct' answer more
             | difficult for a student who has specific knowledge about
             | the topic. This is just one of several problems with multi-
             | choice questions, though, fortunately, it does not seem to
             | have any bearing on the very interesting phenomenon you
             | have discovered.
        
           | jwuphysics wrote:
           | In terms of the actual article -- really nice finding. Or I
           | guess, nice set of experiments to decipher what lots of LLM
           | researchers have been finding!
           | 
           | I've noticed somewhat similar behavior while training graph
           | neural networks to model physical systems, except that it
           | takes way longer than a single epoch to get there. Or course,
           | there's no pretending involved with my GNNs, but the models
           | do have very constrained representations, so once they start
           | to figure out how to represent the physics at hand, the loss
           | plummets dramatically.
        
         | ScoutOrgo wrote:
         | Hey Jeremy, it seems like you could calculate exactly how much
         | a model learns in a single step by calculating the loss for a
         | batch a second time (with no_grad) after the loss is calculated
         | the first time and gradients are updated. This seems like it
         | could produce interesting outputs when graphing the difference
         | of first and second losses at the batch or observation/question
         | level.
        
         | azg123 wrote:
         | Super interesting! Another area that I've seen these types of
         | loss curves are recommendation models:
         | https://arxiv.org/pdf/2209.06053.pdf
        
       | SubiculumCode wrote:
       | Does this mean it is now computationally efficient to have the
       | model learn/memorize information on the fly, say the current chat
       | context, as part of the model weights? One shot encoding
       | (something the hippocampus is very good at) allows us to build
       | experiences into retrievable memories tied into semantic concepts
       | we've previously learned..in fact it gets better the more rich
       | our semantic conceptualization of events become from childhood
       | into adulthood.
       | 
       | If memorization of events in llm is accelerated because of- these
       | deep semantic frameworks, then does this provide a path towards
       | long context windows?
        
         | quickthrower2 wrote:
         | Beginner here, so just musing:
         | 
         | I like the idea. You would need your own mutable copy of the
         | model, which is usually huge. And you need to backprop so there
         | is a bit more computation. It might be doable for a local model
         | that is smaller than GPT3.5/4.
         | 
         | You also need to decide what is worth memorizing long term vs
         | short term.
        
           | pests wrote:
           | > own mutable copy of the model, which is usually huge
           | 
           | It could just be the diff against the main model or similar.
        
             | quickthrower2 wrote:
             | But if you have say 50bn weights, and you run backprop, you
             | are going to update most of the weights (except the dropout
             | ones, but which ones drop out changes on every token I
             | think). This means you need 50bn deltas. It might compress,
             | but if you do then you need extra compute to do that.
        
               | jacquesm wrote:
               | You would do dropout on every _epoch_ of training, not on
               | every _token_.
        
           | SubiculumCode wrote:
           | Coming back to this. LORA training is only on the attention
           | layer, and this was sufficient for memorization , per the
           | article. So we wouldn't update all the model's weights in
           | some kind of constant context one-shot learning scheme.
        
         | warkdarrior wrote:
         | Maybe, but there are a lot of unknowns. Does the "memorization
         | on the fly" come with catastrophic forgetting of other
         | information? How does one control for memorizing recent stuff
         | vs. remembering older stuff?
        
       | bjornsing wrote:
       | I'm no expert on LLMs, but I don't find this super surprising
       | from a general ML point of view:
       | 
       | You have a generative model with _billions of parameters_ that
       | already assigns some probability mass to your (fine-tuning)
       | samples. Now you compute a gradient that increases that
       | probability mass, and take a step in the gradient's direction.
       | Essentially the OP is surprised that this significantly increases
       | the probability mass of the samples under the model.
       | 
       | I'm not very surprised. The generative model is enormously over-
       | parameterized and already assigns some probability mass to the
       | (fine-tuning) samples. It would be surprising to me if there
       | _wasn't_ a direction in this billion-dimensional parameter space
       | that rapidly increases the probability of the relatively few
       | samples.
        
         | danielmarkbruce wrote:
         | I had the same thought. This was very unsurprising. I couldn't
         | tell if that made me the idiot here.
        
       | fpgaminer wrote:
       | I see similar loss curves when training ViTs (from scratch),
       | which has always bothered me but I had bigger concerns so never
       | delved too deep into it. The only difference is that I see the
       | training loss go _up_ during each epoch. The cliffs between
       | epochs are large enough that training loss goes down overall and
       | validation loss keeps going down the whole time as well. The
       | model gets close-ish to SoTA so I guess it's "normal".
       | 
       | I haven't trained convnets at this scale so I'm not sure if
       | similar behavior has been seen there, but you'd think someone
       | would have mentioned it at some point. So perhaps these strange
       | loss curves are a feature of Transformer based models in
       | particular?
        
         | jph00 wrote:
         | Oh wow yeah - I've also seen other people's training loss
         | curves like that, going up during each epoch and then jumping
         | down at the end of the epoch. I've never experienced that
         | myself, and have no idea what's causing it!
        
         | lIIllIIllIIllII wrote:
         | The original article mentioned LLMs needing powerful
         | abstractions
         | 
         | this is basically the case with transformer networks, which is
         | apparent when learning from scratch. The model seems to be
         | going basically nowhere and totally useless until suddenly, at
         | some random point after a bunch of learning cycles the weights
         | find some minimum on the error surface and bam, suddenly the
         | model can do things properly. And it's because the transformer
         | has learned an abstraction that works for all of the input data
         | in an attentional sense (think how you scan a sentence when
         | reading). Not the best explanation but its from memory from a
         | post I saw on HN a while back
        
         | whimsicalism wrote:
         | even in the first epoch the loss goes up? that seems.. odd
        
       ___________________________________________________________________
       (page generated 2023-09-06 20:02 UTC)