[HN Gopher] Can LLMs learn from a single example?
___________________________________________________________________
Can LLMs learn from a single example?
Author : jdkee
Score : 395 points
Date : 2023-09-06 00:40 UTC (19 hours ago)
(HTM) web link (www.fast.ai)
(TXT) w3m dump (www.fast.ai)
| spit2wind wrote:
| What are the axis labels on the graphs?
| jph00 wrote:
| Cross entropy loss vs batch number
| whimsicalism wrote:
| Was this not sort of the clear implication of the fact that most
| LLMs are currently only being trained with one epoch?
|
| ie. if they are only being trained from one epoch, there is clear
| overfitting concerns just by doing even a second pass in the
| data.
|
| It does seem somewhat contrary to the findings of this paper [0]
| that found that old data was as good as new for at least 4
| epochs.
|
| [0]: https://arxiv.org/abs/2305.16264
| computerex wrote:
| They are not being trained only on 1 epoch. They are trained on
| multiple epochs for high quality data. Also Meta team with
| llama show that simply training more, more tokens, continues to
| reduce loss.
| danielmarkbruce wrote:
| I could be wrong, but I thought the llama 2 paper explicitly
| called out 1 epoch and that more than that caused over-
| fitting in their other experiments.
| whimsicalism wrote:
| You're not at all wrong :) I think a lot of people confuse
| the pre-training and fine-tuning runs because these are all
| novel concepts.
| whimsicalism wrote:
| If you divide the number of sentences trained on by the total
| number of sentences in its corpora, the number for most of
| the top LLMs will be far closer to ~1 than any other integer.
|
| > Also Meta team with llama show that simply training more,
| more tokens, continues to reduce loss.
|
| Can you source the specific claim you are talking about? More
| tokens to me generally will mean _new tokens_ unless you are
| specifying.
|
| from the paper "We train for one epoch over the training
| data. In earlier experiments, we found that training longer
| can lead to over-fitting"
| danielmarkbruce wrote:
| Yes. Surely "more tokens" doesn't mean "more epochs".
| jph00 wrote:
| <deleted>
| whimsicalism wrote:
| that's the exact paper i link in my comment :)
| fpgaminer wrote:
| > Was this not sort of the clear implication of the fact that
| most LLMs are currently only being trained with one epoch?
|
| Slight nit: Many public LLMs are trained for at least slightly
| over one epoch, and usually several epochs on particular
| subsets of the data (like wikipedia).
| whimsicalism wrote:
| Source? Maybe several epochs on some very small subsets, but
| my strong impression was that it was 1 epoch in the pre-
| training run for pretty much all of the top LLMs.
| fpgaminer wrote:
| Llama off the top of my head:
| https://arxiv.org/pdf/2302.13971.pdf
| danielmarkbruce wrote:
| Yup, thought a similar thing.
| Buttons840 wrote:
| Does anyone know if LLMs have been used to augment their own
| training data?
|
| I wonder what would happen if you trained an LLM on a little
| input but then had it generate a lot of synthetic input added to
| the training data. I think of it as "dreaming". This seems like
| it would just add noise, but LLMs are able to improve their
| output by augmenting their own context (by "thinking out loud"),
| maybe they can do the same with their own training data?
| jph00 wrote:
| Yes, a lot of recent research uses LLM outputs as training
| data, and it's been an extremely successful line of work.
| muxator wrote:
| It's interesting that this conclusion is the exact opposite of
| a sibling comment, which proposes that a small, human-curated
| corpus may be more effective than big, synthetic datasets.
| Buttons840 wrote:
| I have no "conclusion". I'm just wondering.
| rsrsrs86 wrote:
| You can find the answer by trying the following: generate
| random data according to a model, fit a linear regression (or
| any other distribution), sample from the distribution, add it
| as to the training set.
| fpgaminer wrote:
| That's effectively what RLHF is; a means for LLMs to self train
| on their own output exclusively by using a small human curated
| dataset as guidance as to what a "good" and "bad" output is.
| jerpint wrote:
| If this holds true, this would support the idea that much
| smaller, human curated datasets will be of much higher value than
| synthetic datasets generated by LLMs
| RhysU wrote:
| This is not surprising in the context of our wetware: "Jane
| sees Spot run. Run, Spot, Run."
| tomrod wrote:
| I assume there is a value metric that balances quantity with
| quantity that may be exploitable in our mid-gains period of
| understanding the tech behavior -- meaning potential gains from
| synthetic data. That said, I also expect no-free-lunch to kick
| in at some point, and synthetic data doesn't always pay
| attention to the data generating process for outliers.
| rsrsrs86 wrote:
| You will find active learning interesting. It starts by
| attributing a value to each point in your domain that it
| learns to match the expected gain in some performance metric.
|
| This metric can be learned so it's okay if it's really hard
| to specify.
| rsrsrs86 wrote:
| Whichever has the most information wins. When the information
| has structure you can heavily exploit it for generating
| synthetic data. For this I point you to Apple Sim. It's a
| repository of 3D models for interiors. You can generate many
| layers of information by controlling the renderer and then use
| it on real photos. That's done all over images so vectorial
| spaces are pretty natural for embeddings. You don't need to add
| much structure algebraically speaking.
|
| If your domain is heavily algebraic, you might even be able to
| generate correct examples arbitrarily, which is a situation I
| recommend anyone to be in.
| cuuupid wrote:
| Google reached that conclusion ~2 years ago but has yet to show
| significant results, key word above being curated
| fpgaminer wrote:
| I doubt it. If anything, ULMFiT era AI has finally killed the
| need for human curated data. ChatGPT 4 is already being used as
| an oracle model that everyday AI models are trained off of. A
| truly gargantuan oracle model will obviate all but the smallest
| of human input.
| tellarin wrote:
| GPT4 relies heavily on human curated data. Both for specific
| domains and for instruction following. Any new model that
| tries to go beyond it will also likely rely on such data.
| breadsniffer wrote:
| Yeah it's been known that OpenAI hires domain experts. If
| anything, they augment that high quality data rather than
| just starting from bare bones synthetic data.
| Solvency wrote:
| Why are we only able to theorize about these things? Why can't
| we get know and why these things work?
| Palmik wrote:
| If you find this interesting, checkout also "Mass Editing Memory
| in a Transformer" [1] and "Locating and Editing Factual
| Associations in GPT" [2].
|
| [1] https://memit.baulab.info/ [2] https://rome.baulab.info/
| Nevermark wrote:
| Do people really use the phrase "over confident" in this way? It
| is very misleading.
|
| What is happening is called "over fitting".
|
| Think of data as dots. A model that generalizes well will create
| as simple of a function as possible that fits the training data
| points pretty well.
|
| But keep training and parameters will often get very large,
| creating huge up and down swings in the function curve, far
| outside the actual data values, in order to pass through the
| training data points exactly.
|
| So it's technically a better fit to the training data, but it is
| now a crazy function, often producing extreme outputs on new
| data. Practically a worst case lack of generalization.
|
| _Thus, "over fitting"._
|
| And "over fitting" isn't the same as "memorization". Large models
| can memorize small datasets without over fitting. They have so
| many parameters, it takes few changes to fit the training data.
| At which time, learning stops at an otherwise random function,
| and generalization is never achieved.
|
| _That case is called "underdetermined"._
|
| There are models that produce both outputs and confidences
| (essentially predict their own error standard deviation per
| output, based on the input).
|
| _So "over confident" can mean a model that predicted high
| confidence (low error deviation) inaccurately._
| jph00 wrote:
| No I don't use the term overfitting for a model where the
| accuracy is getting better. I think it's misleading.
| mjburgess wrote:
| Accuracy is very rarely a useful metric. It's more an
| engineering metric than something a user would ever care
| about.
|
| What users want is to have their own credences properly
| calibrated by engaging with some system. From a physics
| textbook, they want a systematic presentation of ideas which
| allows them to build intuitions etc.
|
| It's important to formulate the actual goal of the system,
| rather than just the engineer's goal (consider eg., "width of
| pipes" vs., "clean running water").
|
| In the case of statistical AI systems, the goal is often best
| formulated in terms of the confidences of the system _not_
| its output. Since its output accuracy is kinda nonlinear and
| discontinuous in those confidences.
|
| So from a statical AI Q&A system we dont want The Answer, we
| want the system to have expert-like confidences over possible
| answers.
|
| Of course, as soon as you start formulating these metrics,
| all the SoA 99%+ accuracy hype evaporates. Since most of
| these systems have terrible confidence distributions.
|
| Consider, eg., ChatGPT whose answers are often plausibly
| accurate (they count _as_ an answer) but just repeat some
| silicon valley hype in a way an expert wouldnt. ChatGPT
| rarely has the careful scepticism of an expert, rarely
| presents ideas in an even handed way, rarely mentions the
| opposite.
|
| It makes generating reference materials on areas with expert
| disagreement quite dangerous. ChatGPT presents the non-expert
| credence distribution. (And indeed, always does, since it
| just models (Q,A) frequencies which are not truth-apt)
| DougBTX wrote:
| This is mixing two meanings of confidence which could lead
| to confusion. The OP is using confidence to describe how
| high the per-token probability scores are, while you are
| talking about the confidence expressed in the tone of voice
| of the language generated by the model. Really those are
| orthogonal issues. (Eg, a model could predict with high
| probability that a output should be "I don't know")
| mjburgess wrote:
| It seems like i'm mixing them, but i'm not.
|
| I'm saying as a matter of fact ChatGPT should have
| different confidences in propositions. My issue isnt the
| tone of voice, my issue is the content of what it's
| saying is wrong wrt what we care about, ie., expert
| credences (/confidences) in the claims it's generating.
|
| It can "express confidently" scepticism; it does not.
| That's the issue.
|
| In my lang above i was mostly using credence to talk
| about the strength of the mental state of belief; and
| confidence to talk about the model of that used in
| statistical AI.
| Nevermark wrote:
| Over fitting literally means what it says, _fitting the
| training data too well_ to maintain a well formed function.
|
| This is many decades old terminology for a well established
| effect that occurs for all curve fitting, function
| approximation, parameter optimizing, and model training
| algorithms.
|
| You can Google it with no other context: "over fitting". [0]
|
| "Confidence" isn't its name and its meaning has nothing to do
| with the effect.
|
| Nothing wrong with making up terminology for new effects, but
| this one is an oldie.
|
| [0] https://www.google.com/search?q=over+fitting&ie=UTF-8&oe=
| UTF...
| ActivePattern wrote:
| If we are considering the function to be the neural network
| with an argmax applied to the output probabilities, it's not
| overfitting at all. Its classification accuracy over unseen
| data (validation set) continues to improve.
|
| The issue here is one of calibration:
| https://en.m.wikipedia.org/wiki/Calibration_(statistics). That
| is, the output probabilities of the neural network do not
| reflect the true (observed) probabilities. If it is
| systematically underestimating the probabilities, it is termed
| "underconfident", and if overestimating the probabilities,
| "overconfident".
|
| Note that in these cases, it may still be improving as a
| classifier on unseen data, while still showing higher
| validation loss as calibration degrades.
| hashhar wrote:
| This was an awesome explainer - thanks a lot. It helps clear up
| a lot of jargon I keep hearing in very precise ways.
| yashap wrote:
| I do think it's a form of overfitting - loss on the training
| set improved while loss on the validation set got worse.
| However, it's not the common form of overfitting, where
| _accuracy_ on the validation set gets worse. In this case,
| accuracy on the validation data set continued to improve. But
| when it was wrong, it gave a higher confidence in its wrong
| answer than before. e.g. before it may have incorrectly thought
| the answer was X, with 60% confidence, now it still thinks the
| answer is X, but with higher confidence, say 70%.
|
| I do think it's a form of overfitting, but a weird one.
| Overconfidence seems like a good, more specific term to me.
| PaulHoule wrote:
| I've observed the same phenomenon with fine-tuning LLMs and I
| thought it was pretty strange but so far as I could tell other
| people were observing the same thing but mostly not commenting on
| it. The conclusion I'd draw is that you're not going to benefit
| greatly from adding more data when your model behaves like this.
|
| Overconfidence bugs moe because if you want to turn predictions
| into decisions and actions you have to be calibrated. I've found
| that some of these models that look like they are over fitting on
| loss are actually still improving on AUc (matters to me more than
| accuracy) and I can put a calibrator after the model to get the
| results I want.
|
| (Still, for my current problem which has noisy labels, I find
| embedding + classical ML performs as well and takes a fraction of
| the time as fine tuning _and_ clearly shows benefit trained on
| more examples than FT does. If I was going to do more model
| engineering on this problem I would probably resort to
| "stacking")
| rona123456789 wrote:
| [flagged]
| anoncow wrote:
| That is like saying can energy be created anew.
| OhNoNotAgain_99 wrote:
| [dead]
| imjonse wrote:
| I found the title misleading.
|
| Isn't learning from a single example desirable, while memorizing
| undesirable in the context of training? The former is the goal
| we're aiming for in order to match how animals learn, while the
| latter a failure mode that happens often. The article shows a
| case of unexplained memorizing, not of learning, right?
| justanotherjoe wrote:
| isn't it highly dependent on what is your one epoch of data? if
| there are a lot of repetitions of similar concepts in there then
| can you say it's learning from one example?
| jph00 wrote:
| If it was due to repetition there wouldn't be those sudden
| cliffs after each epoch.
| rafaelero wrote:
| That's intriguing. But what I want to see is if that one example
| can change the whole web of knowledge previously established. So,
| for example, if we finetune the model with a sentence like
| "Scientists discovered that a type of antigen can make a host
| immune to HIV" will it then be able to infer that "mRNA vaccines
| are a valid preventive approach to AIDS since they may be able to
| express a type of resistance known to make hosts immune to HIV"?
| Filligree wrote:
| That would be impressive and surprising. Humans aren't capable
| of that.
| rafaelero wrote:
| > Humans aren't capable of that.
|
| Why do you say so? We casually call it "connecting the dots".
| It's like during the Oppenheimer movie when after the first
| demonstration of Uranium splitting people thought "oh, we can
| do a bomb with that".
| Filligree wrote:
| We need to pull up both elements simultaneously and
| correlate them, it doesn't happen automatically because we
| learned that "a type of antigen can make a host immune to
| HIV".
|
| Yes, ideally the former will associate well enough with the
| latter that, once you find some reason to think about mRNA,
| it will automatically drag up the thing you learned earlier
| and _then_ you 'll update. But it doesn't happen by itself,
| and sometimes it doesn't happen at all. Most people contain
| significant inconsistencies -- I would dare to suggest most
| likely everyone.
| mrjin wrote:
| No understanding, no learning! Period.
| itissid wrote:
| Could this be an artifact of just not reshuffling the dataset and
| how the weight regime is? What if you reversed the dataset in the
| second epoch, under the memory hypothesis the training loss would
| _not plummet_ if it has not _learnt_ anything _during_ the epoch
| after the first 10%. Yes?
|
| The report mentions there is no reshuffling: > We're not re-
| shuffling the dataset at the start of the epoch, so those first
| batches of the second epoch are when the learning rate was still
| warming up.
| klft wrote:
| GPT-4 (I haven't really tested other models) is surprisingly
| adept at "learning" from examples provided as part of the prompt.
| This could be due to the same underlying mechanism.
| bathtub365 wrote:
| I've found the opposite in trying to get it to play Wordle.
| It'll repeatedly forget things it's seemingly learned within
| the same session, all the while confident in its correctness.
| ben_w wrote:
| What approach are you using to get the LLM to split words
| into individual letters?
| jacquesm wrote:
| LLMs are trained on 'tokens' derived from 'words' and 'text'
| and even though there are tokens that are just one letter the
| bulk is a rough approximation to syllables as though you're
| trying to create a dictionary to be used for data
| compression.
|
| It might be more effective to try to play 'tokendle' before
| trying to play 'wordle'.
| miven wrote:
| Do you know whether LLMs grasp the equivalence of a word
| expressed as one whole-word token and as a series of single
| character tokens that spell out the same word? I'm curious
| if modifying the way some input words are split into tokens
| could be useful for letter-by-letter reasoning like in
| Wordle.
|
| Or would an LLM get confused if we were to alter the way
| the tokenization of the input text is done, since it
| probably never encountered other token-"spellings" of the
| same word?
| jacquesm wrote:
| From what I understand it is anything goes, it could be
| letters or it could be a whole word or even a sentence
| fragment or a concept ('The United States of America').
| Think of it as the dictionary for a compression algorithm
| and you wouldn't be too far off.
|
| https://www.geeksforgeeks.org/lzw-lempel-ziv-welch-
| compressi...
|
| For 'code table' substitute 'token table'.
| cypress66 wrote:
| Not really. That's called few shot learner.
|
| It's basically unrelated to what happens during training, which
| is using gradients.
| calrain wrote:
| Probably unrelated, but I tried to get ChatGPT to write me some
| code to programmatically control the details of a column filter
| in an Excel spreadsheet in PowerShell.
|
| Nothing it tried worked, it got close, but it didn't work.
|
| Finally I found some C# code that fixed the problem, and I pasted
| that code into ChatGPT, asked it to read it, and then fix the
| problem in PowerShell.
|
| It said it understood the solution, updated the script, and it
| worked perfectly.
|
| For some reason that behavior was pretty eye opening. Providing
| material in the question that it wasn't trained on made it solve
| it.
|
| It's understandable how it did it from language training, it just
| felt very cool that LLM's can do that.
| e12e wrote:
| Interesting anecdote. I think there's a common theme with
| current LLMs, that people focus unreasonably much on "knowledge
| retrieval" _from the models_ (1) and under-hype and under-
| appreciate the "language model" part.
|
| These things are really easy to anthropomize, partly because
| they are good at "talking" and "articulating". So good that we
| tend to just accept that magical, enormous feat of statistical
| engineering as a trivial building block. But it's a brick made
| of gold.
|
| Translating (from natural language to code, from text to audio,
| from image to image, one natural language to another), editing,
| summarizing, expanding/extrapolating is what these models _do_.
|
| The inherent "knowledge" is just context.
|
| (1) Vector embedding is in my view a little different - it's a
| form of semantic cataloging (akin to Dewy decimal) - and
| certainly enables search.
|
| But "data retrieval" (who was us president in 1984) directly
| from the models isn't really all that interesting IMNHO.
| YeGoblynQueenne wrote:
| "Can LLMs learn from a single example"?
|
| Sure. Neural nets in general can: after they've been trained on
| billions of examples first.
|
| It really helps if they've previously seen the same or similar
| "single example". Which, let's be fair, the larger the training
| data, the higher the chances they have.
|
| >> This seemed, at first, quite impossible. It would imply that
| the model was learning to recognise inputs from just one or two
| examples
|
| To be more precise: the article is talking about fine-tuning a
| pre-trained LLM, so that's a-few-billion-plus-one-or-two
| examples.
|
| Btw, what model was that? The article doesn't say.
| tomaskafka wrote:
| Isn't this what people would do? I'd definitely update my
| knowledge after a single failed test question, if it was
| something I'd care about, and I discovered my previous model of
| reality was wrong.
| [deleted]
| latexr wrote:
| > Isn't this what people would do?
|
| It is not: https://en.wikipedia.org/wiki/Belief_perseverance
|
| > I'd definitely update my knowledge after a single failed test
| question
|
| Maybe you would, maybe you wouldn't. There are several
| psychological experiments which show people don't act the way
| they say they "definitely" would when confronted with the
| situation. Quite a few examples in the movie "Experimenter":
| https://en.wikipedia.org/wiki/Experimenter_(film)
|
| > if it was something I'd care about, and I discovered my
| previous model of reality was wrong.
|
| Those two ifs are doing a ton of heavy lifting. LLMs neither
| "care" nor "discover". It's not like you're giving it a new
| contradicting piece of information and it's going "interesting,
| let me research on that and update my model of reality if after
| careful consideration I find your assertion to be true". It's
| closer to having someone who'll just accept everything you say
| and repeat it.
| mixtieboo wrote:
| [dead]
| deyiao wrote:
| I often observe similar phenomenna in CNN related reserch. which
| indicate that the model indeed can learn from a single example,
| but sadly, this requires the dataset to be randomly distributed,
| In real-world applications, new data does not meet this
| requirement.
| jph00 wrote:
| Thank you for posting this to HN! :D
|
| I'm one of the authors of this post -- Johno & I found it really
| interesting looking into this curious issue of rapid memorization
| from LLMs. I've been working with neural nets for 30 years, and
| fine-tuning language models since 2017, and this behavior is most
| surprising to me! Other folks have seen it in LLMs too, although
| I haven't seen a analysis of this kind before (although we might
| have missed something).
|
| Let me know if you have any questions or thoughts.
| armatav wrote:
| I wonder if you could perform inference, highlight the weights
| that were most used during that inference, grab the hottest
| 20%, freeze the rest of the model, and perform backpropagation
| solely on those to allow for more of this sort of rapid
| memorization behavior closer to the end user.
|
| Like online learning in a way. But you do it during inference
| time.
|
| There's no way the entire model actually needs to be touched
| for something like "sky color is:" and "blue".
| armatav wrote:
| In fact I bet you could update like one or two neurons for
| certain concepts, and then transplant those neurons to
| another LLM to give it some idea of it. Like a literal brain
| transplant but for concepts.
| armatav wrote:
| And you could identify these neurons using dropout
| techniques and repetitively querying the model against
| them.
|
| Drop a set of neurons and there's no change? Probably
| doesn't contain the "sky color" concept.
|
| Drop a set of neurons and the model freaks out, definitely
| conceptual neurons.
|
| Rinse and repeat to find the distilled pattern across all
| the neurons.
|
| You could train an LLM against the neuron graph to do this
| for you.
| niemandhier wrote:
| Many neurons are polysynthactic, that makes interventions
| like the proposed difficult.
| OhNoNotAgain_99 wrote:
| [dead]
| Angostura wrote:
| As a lay-person, can I just say I appreciated the accessible
| writing style. Thanks!
| og_kalu wrote:
| In the palm-e paper (https://palm-e.github.io/), when they try
| to unfreeze and train the LLM on new image data only, there is
| expectedly a lot of CF on NLP tasks but very interestingly, the
| effect diminishes greatly with the scale of the LLM prior to
| training.
|
| From an average -87.3% performance drop on the 12B model to
| -61.6% on the 84B model then just -3.9% on the 562B model. Felt
| like we were just shy of an insight breakthrough here.
|
| Is avoiding CF potentially just a matter of sheer scale ?
| jph00 wrote:
| I think our experiments actually _don 't_ show catastrophic
| forgetting! The accuracy does _not_ decrease as loss gets
| worse -- it 's simply getting over-confident.
|
| So I'm not even sure we're showing any problem to solve here
| -- it might be more of a opportunity, in fact!
| samstave wrote:
| Plz eli5 catastrophic forgetting,
|
| I assume this means losing all the energy and compute input
| for a model to know, perform, infer on inputs already
| indexed(?) (What is the proper term here?)
|
| But is this the premise -you lose all prior investment of
| resource to a (I don't know the term for an AI archetype of
| knowledge) {btw, I love the embedded etymology of knowledge
|
| "The ledger of things that we KNOW"}
| tarvaina wrote:
| Suppose we have trained a model to perform a certain set
| of tasks. Later we would want to teach it a new task.
| Catastrophic forgetting means that teaching it a new task
| makes it unlearn some or all of its earlier tasks.
|
| It occurs because training changes the weights of the
| model. The earlier set of weights was good for the
| previous tasks. The new set of weights is only good for
| the new task. Usually special care must be taken to
| overcome catastrophic forgetting.
| antupis wrote:
| I think some cases CF would be even good eg you want llm
| that produces only valid json data as output.
| josephg wrote:
| Yeah, this is essentially how finetuned models work. If
| you fine tune stablediffusion to produce anime images, it
| might forget how to produce images in any other style.
| But it will become much better at anime images than the
| base model. If anime images are the art style you're
| after, this is a good trade. Same with fine tuning LLMs
| for SQL or whatever.
| samstave wrote:
| Can it be taught "contextual matrices" where by it builds
| a new layer of construct but preserves the other, then
| cross learns between parameters or something (sorry for
| my poor lexicon, I'm wet-learning :-)
|
| But imagine all LLMs in a macro view like a sponge entity
| tinco wrote:
| We wouldn't know how to construct those matrices because
| we don't know where in the layers what knowledge is
| represented. One thing that helps a little bit is
| freezing the lower layers, so at least the model won't
| forget its most fundamental knowledge.
|
| Note that the only reason that things are
| catastrophically forgotten, is that the original examples
| are not shown again. If the model learns in a single
| shot, there might simply be no time to show both the old
| and the new examples. I don't think it would have a
| significant effect or else we'd know about this effect a
| lot sooner (i.e. the training of these LLM's would get
| less effective from a certain point)
| jacquesm wrote:
| You could simulate this by selectively locking and
| unlocking 'banks' of weights from a larger model to keep
| the influence there during training and to avoid losing
| them. Sort of a selective write-protect.
| Yenrabbit wrote:
| It does start getting worse at some point right?
| jph00 wrote:
| I'm sure eventually it would, but we haven't gotten to
| that point yet in our training.
| minihat wrote:
| Cross-entropy loss can start getting worse due to the
| model becoming less calibrated, even as the
| classification accuracy continues to improve. I first
| heard that here: https://arxiv.org/abs/1706.04599
|
| Is this 'overconfidence' the leading explanation as to
| why LLMs continue to show qualitative improvement even
| after their test loss levels off?
| alekseiprokopev wrote:
| Is it possible to somehow modify the sampling from the
| model to account for that?
| 3abiton wrote:
| Awesome investigative work, what's the opportunity though,
| I don't get it
| mirekrusin wrote:
| It looks like something clicks in place.
| jph00 wrote:
| We don't know. It's a report of some early experimental
| results. Our hope is that it will stimulate discussion
| and further research and development.
| vvrm wrote:
| I have been training a natural intelligence model for 3
| years now and she still doesn't get nuance. Things are
| either good or bad in her book: nothing in between. My plan
| is to let her train with binary good/bad labels till the
| age of 5 and then start smoothing the labels after that.
| Wonder if that works for your AI.
| tudorw wrote:
| in my mind I've built an 'emotional engine' to add nuance
| to models understanding, take something like Plutchik's
| wheel of emotions and create a high quality multi-modal
| dataset based on that structure, given our current
| technology takes inspiration from the brain, it would
| seem like having discrete models specialising in
| particular aspects of 'intelligence' that are then
| organised into a mixture of experts is an interesting
| area to explore, and perhaps more accessible as smaller
| models require less resources.
| kordlessagain wrote:
| I have code stubbed out for this in mitta.us. It has 9
| states, based on the Plutchik wheel, with emojis for the
| states. States drive temp and a few other things and drop
| the state into prompts.
| TeMPOraL wrote:
| Related trick: I found that training two Natural
| Intelligence (NI) models in parallel, and having them
| train each other for most of the time, leads to
| significant leaps in capabilities. Notably, when one NI
| picks up a skill, it often results in spontaneous
| transfer learning - the other NI picks that skill up very
| quickly, _much faster_ than it would through direct
| training.
|
| This scales well, too. There are facilities that provide
| services of co-hosting and cross-training up to ~two
| dozen NI models in a shared environment - in my
| experience, this provides similar training benefits to
| running multiple NIs on your own, at fraction of the
| cost.
|
| (The facilities are exploiting some neat economies of
| scale. Talking to some employees, I learned that the
| transfer learning and co-activation are embarrassingly
| scalable: if you get two-three NIs to pick up a thing,
| all the rest immediately follow.)
| vineyardmike wrote:
| This took a couple reads, but it's funny. The bad news is
| that I've been training mine for 17 years and nuance is
| still something that needs more training.
| theonlybutlet wrote:
| What does CF stand for?
| d4rkp4ttern wrote:
| Catastrophic Forgetting
| theonlybutlet wrote:
| Thank you
| Yenrabbit wrote:
| Ooh interesting, thanks for sharing!
| startupsfail wrote:
| Interesting, but you should show the example as concrete
| evidence, rather than hand waving arguments based on loss
| curves "evidence".
| ilaksh wrote:
| What is the base model? I think that was a big oversight to
| leave that out and attribute this to LLMs in general.
|
| Although I am not a researcher, it is obvious to me that not
| all LLMs are the same architecture, and I think that even ones
| with similar architecture can evolve to functionally operate
| quite differently on the same inputs.
|
| Yet most articles seem to refer to LLMs as if they were just
| one architecture and model.
| n9Mtq4 wrote:
| Very cool. This came up in a huggingface transformers issue a
| while ago and we also determined memorization to be the likely
| reason. It's nice to see someone else reach the same
| conclusion.
|
| https://github.com/huggingface/transformers/issues/18730
| jwuphysics wrote:
| Hi Jeremy, always a fan of your work! Just a technical note
| since it falls under my domain of expertise (astronomy) -- the
| example about MOND described here should actually have choice
| (E) as the correct answer!
| jph00 wrote:
| As it happens I dug into this question in some detail a
| couple of weeks ago when analysing the dataset, including
| carefully reading the wikipedia page which the question comes
| from. AFAICT both D and E are kinda correct, but E isn't
| quite right because MOND doesn't entirely "eliminate the
| observed missing baryonic mass", but rather just reduces it
| from a factor of 10 to 2.
|
| Is that not correct? (Of course I fully accept your expertise
| in this matter and this is just my curiosity, not trying to
| tell you you're wrong!)
| jwuphysics wrote:
| Fascinating! I dug into the Wikipedia article, which cites
| a Scholarpedia article; the LLM answer seems to originate
| from a reference to this sentence [1]:
|
| > So, MOND reduces the discrepancy in clusters at these
| radii to only a factor of ~2-3 (Sanders, 1999; Ettori, et
| al., 2019)
|
| So I think you're right, and today I learned something! I
| also checked if Stacy McGaugh had weighed in on this
| particular subject, and it seemed like there is still an
| issue for clusters [2], although interestingly the issue
| isn't mentioned in his latest blog post that summarizes the
| strengths/weaknesses with MOND [3]. Anyway, thanks for
| humoring me for a bit.
|
| [1] http://www.scholarpedia.org/article/The_MOND_paradigm_o
| f_mod... [2] https://tritonstation.com/2021/02/05/the-fat-
| one-a-test-of-s... [3]
| https://tritonstation.com/2023/06/27/checking-in-on-
| troubles...
| mannykannot wrote:
| I believe neither MOND nor Condensed Dark Matter are
| theories exactly, so much as they are schemata for classes
| of theories. Both are struggling to produce a verified
| theory that accounts for all observations, and while the
| latter is much more widely regarded as likely being
| correct, MOND has not been conclusively falsified to
| everyone's satisfaction. I would guess that there are, at
| least in principle, MOND theories which work for galaxy
| clusters but have residual discrepancies when applied to
| galaxies.
|
| If this is so, then a multi-choice question which conflates
| one particular MOND theory for MOND itself, and which
| depends on the specifics of that particular theory for
| selecting the 'correct' answer, is problematic: for one
| thing, it may make selecting the 'correct' answer more
| difficult for a student who has specific knowledge about
| the topic. This is just one of several problems with multi-
| choice questions, though, fortunately, it does not seem to
| have any bearing on the very interesting phenomenon you
| have discovered.
| jwuphysics wrote:
| In terms of the actual article -- really nice finding. Or I
| guess, nice set of experiments to decipher what lots of LLM
| researchers have been finding!
|
| I've noticed somewhat similar behavior while training graph
| neural networks to model physical systems, except that it
| takes way longer than a single epoch to get there. Or course,
| there's no pretending involved with my GNNs, but the models
| do have very constrained representations, so once they start
| to figure out how to represent the physics at hand, the loss
| plummets dramatically.
| ScoutOrgo wrote:
| Hey Jeremy, it seems like you could calculate exactly how much
| a model learns in a single step by calculating the loss for a
| batch a second time (with no_grad) after the loss is calculated
| the first time and gradients are updated. This seems like it
| could produce interesting outputs when graphing the difference
| of first and second losses at the batch or observation/question
| level.
| azg123 wrote:
| Super interesting! Another area that I've seen these types of
| loss curves are recommendation models:
| https://arxiv.org/pdf/2209.06053.pdf
| SubiculumCode wrote:
| Does this mean it is now computationally efficient to have the
| model learn/memorize information on the fly, say the current chat
| context, as part of the model weights? One shot encoding
| (something the hippocampus is very good at) allows us to build
| experiences into retrievable memories tied into semantic concepts
| we've previously learned..in fact it gets better the more rich
| our semantic conceptualization of events become from childhood
| into adulthood.
|
| If memorization of events in llm is accelerated because of- these
| deep semantic frameworks, then does this provide a path towards
| long context windows?
| quickthrower2 wrote:
| Beginner here, so just musing:
|
| I like the idea. You would need your own mutable copy of the
| model, which is usually huge. And you need to backprop so there
| is a bit more computation. It might be doable for a local model
| that is smaller than GPT3.5/4.
|
| You also need to decide what is worth memorizing long term vs
| short term.
| pests wrote:
| > own mutable copy of the model, which is usually huge
|
| It could just be the diff against the main model or similar.
| quickthrower2 wrote:
| But if you have say 50bn weights, and you run backprop, you
| are going to update most of the weights (except the dropout
| ones, but which ones drop out changes on every token I
| think). This means you need 50bn deltas. It might compress,
| but if you do then you need extra compute to do that.
| jacquesm wrote:
| You would do dropout on every _epoch_ of training, not on
| every _token_.
| SubiculumCode wrote:
| Coming back to this. LORA training is only on the attention
| layer, and this was sufficient for memorization , per the
| article. So we wouldn't update all the model's weights in
| some kind of constant context one-shot learning scheme.
| warkdarrior wrote:
| Maybe, but there are a lot of unknowns. Does the "memorization
| on the fly" come with catastrophic forgetting of other
| information? How does one control for memorizing recent stuff
| vs. remembering older stuff?
| bjornsing wrote:
| I'm no expert on LLMs, but I don't find this super surprising
| from a general ML point of view:
|
| You have a generative model with _billions of parameters_ that
| already assigns some probability mass to your (fine-tuning)
| samples. Now you compute a gradient that increases that
| probability mass, and take a step in the gradient's direction.
| Essentially the OP is surprised that this significantly increases
| the probability mass of the samples under the model.
|
| I'm not very surprised. The generative model is enormously over-
| parameterized and already assigns some probability mass to the
| (fine-tuning) samples. It would be surprising to me if there
| _wasn't_ a direction in this billion-dimensional parameter space
| that rapidly increases the probability of the relatively few
| samples.
| danielmarkbruce wrote:
| I had the same thought. This was very unsurprising. I couldn't
| tell if that made me the idiot here.
| fpgaminer wrote:
| I see similar loss curves when training ViTs (from scratch),
| which has always bothered me but I had bigger concerns so never
| delved too deep into it. The only difference is that I see the
| training loss go _up_ during each epoch. The cliffs between
| epochs are large enough that training loss goes down overall and
| validation loss keeps going down the whole time as well. The
| model gets close-ish to SoTA so I guess it's "normal".
|
| I haven't trained convnets at this scale so I'm not sure if
| similar behavior has been seen there, but you'd think someone
| would have mentioned it at some point. So perhaps these strange
| loss curves are a feature of Transformer based models in
| particular?
| jph00 wrote:
| Oh wow yeah - I've also seen other people's training loss
| curves like that, going up during each epoch and then jumping
| down at the end of the epoch. I've never experienced that
| myself, and have no idea what's causing it!
| lIIllIIllIIllII wrote:
| The original article mentioned LLMs needing powerful
| abstractions
|
| this is basically the case with transformer networks, which is
| apparent when learning from scratch. The model seems to be
| going basically nowhere and totally useless until suddenly, at
| some random point after a bunch of learning cycles the weights
| find some minimum on the error surface and bam, suddenly the
| model can do things properly. And it's because the transformer
| has learned an abstraction that works for all of the input data
| in an attentional sense (think how you scan a sentence when
| reading). Not the best explanation but its from memory from a
| post I saw on HN a while back
| whimsicalism wrote:
| even in the first epoch the loss goes up? that seems.. odd
___________________________________________________________________
(page generated 2023-09-06 20:02 UTC)