[HN Gopher] The Curse of Recursion: Training on generated data m...
       ___________________________________________________________________
        
       The Curse of Recursion: Training on generated data makes models
       forget (2023)
        
       Author : surprisetalk
       Score  : 107 points
       Date   : 2024-12-01 05:21 UTC (6 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | patrickhogan1 wrote:
       | This argument seems more like the data generated was bad. There
       | are examples where AI has surpassed humans by using simulated
       | data (AlphaZero - where it played against itself to become the
       | best at Go).
       | 
       | It also seems to happen most on small networks. Which makes
       | sense.
       | 
       | Additionally, humans create simulated stories like Dune, Lord of
       | the Rings, or Harry Potter, which introduce fictional concepts,
       | yet these stories still result in trainable data.
        
         | thaumasiotes wrote:
         | > Additionally, humans create simulated stories like Dune, Lord
         | of the Rings, or Harry Potter, which introduce fictional
         | concepts, yet these stories still result in trainable data.
         | 
         | No, they don't, not in any sense where they are "simulated
         | data". Dune is simulated data about what life would be like on
         | Arrakis, and if you train a model to make predictions about
         | that question, your model will be worthless trash. (Doesn't
         | matter whether you train it on Dune or not.) Dune is _real_
         | data about how English is used.
        
           | ninetyninenine wrote:
           | It's also data around science fiction. With broad spectrum
           | data from both dune and contextual data most LLMs know that
           | dune is from a fictional novel.
        
         | raincole wrote:
         | > humans create simulated stories like Dune, Lord of the Rings,
         | or Harry Potter
         | 
         | People _really_ anthropomorphize LLM to a full circle, don 't
         | they?
        
           | patrickhogan1 wrote:
           | So you are saying that if I generate stories on different
           | worlds as new data, a model cannot learn from that?
           | 
           | This isn't anthropomorphizing - it's generating data.
           | Generating data is not a uniquely human endevor.
           | 
           | What created Mars? That is data.
           | 
           | What created star systems we cannot see?
        
         | banku_brougham wrote:
         | this is not a serious argument, please forgive me for saying
        
         | dudeinjapan wrote:
         | Thank you for making this comment, because it exposes some
         | logical gaps.
         | 
         | Firstly, Go, Chess, and other games have objective rules and
         | win criteria. (There is no "subjective opinion" as to whether
         | Fischer or Spassky won their match.)
         | 
         | Language, the output of LLMs, does not have an objective
         | function. Consider the following to sentences:
         | 
         | "My lips, two blushing pilgrims, ready stand."
         | 
         | "My red lips are ready to kiss your red lips."
         | 
         | Both are grammatically correct English sentences and both mean
         | basically the same thing, but clearly the former by Shakespeare
         | has a subjective poetic quality which the latter lacks. Even if
         | we make evaluation rules to target (for example, "do not repeat
         | phrases", "use descriptive adjectives", etc.) AI still seems to
         | favor certain (for example "delve") that are valid but not
         | commonly used in human-originated English. There is then a
         | positive feedback loop where these preferences are used to
         | further train models, hence the next generation of models have
         | no way of knowing whether the now frequent usage of "delve" is
         | a human-originated or AI-originated phenomenon.
         | 
         | Lastly, regarding works of fiction, the concern is less about
         | the content of stories--though that is also a concern--but more
         | about the quality of language. (Consider above alternate take
         | on Romeo and Juliet, for example.)
        
           | patrickhogan1 wrote:
           | So you are arguing that the world does not have objective
           | rule criteria, like Physics?
           | 
           | And that an AI could not model the world and then run
           | simulations and have each simulation generate data and learn
           | from that, similar to AlphaZero.
           | 
           | Here is a possible objective win environment
           | 
           | Model complex multicellular organisms that become capable of
           | passing the Turing test that also self-replicate.
        
             | dudeinjapan wrote:
             | My argument here is narrowly scoped to human language and
             | literature. (We already know the objective rule criteria of
             | life is 42.)
             | 
             | It may very well be possible for an AI to read all of
             | literature, and figure out what makes Hemingway, Tolstoy,
             | and Dylan "good writing" vs. "bad writing". That has not
             | yet been achieved. The problem, as the OP implies, is that
             | by polluting the universe of literature with current-gen AI
             | output, we may making the task of generating "good writing"
             | in the future harder.
             | 
             | Then again, maybe not. Perhaps we have enough pre-AI works
             | that we can train on them versus the mountains of AI
             | generating schlock, and determine the objective function.
        
               | patrickhogan1 wrote:
               | You seem to restate the argument of the submitted
               | research paper in a narrow interpretation that is much
               | different than its main conclusion of Model Collapse from
               | synthetic data creation. But I will follow you down this
               | interpretation.
               | 
               | Why does a human have to judge something as good for it
               | to be functionally useful?
               | 
               | Humans never came up with the move that AlphaZero used to
               | win 1 of the 4 out of 5 games it won. Was that a bad
               | move? Did AlphaZero devolve to Model Collapse because it
               | made that move?
        
               | patrickhogan1 wrote:
               | And what's interesting here is people will get annoyed
               | with this.
               | 
               | Am I pro-human? Absolutely
               | 
               | Can non-human outputs be functionally valuable to humans
               | and not per se harmful? Absolutely, we live in a
               | biosphere that humans didn't create or if humans did
               | create it, it was humans more evolved than humans we are
               | aware of.
        
       | alterom wrote:
       | The dignified way to describe the problem at hand is alluding to
       | Brouwer's fixed-point theorem[1], with white noise as the fixed
       | point.
       | 
       | The more practical way is alluding to The Human Centipede[2].
       | 
       | Either way, the feed-back loop doesn't result in a good output.
       | 
       | [1] https://en.wikipedia.org/wiki/Brouwer_fixed-point_theorem
       | 
       | [2]
       | https://en.wikipedia.org/wiki/The_Human_Centipede_(First_Seq...
        
         | benchmarkist wrote:
         | Also the data processing inequality:
         | https://en.wikipedia.org/wiki/Data_processing_inequality?use...
        
           | cma wrote:
           | And yet I prefer now to early big bang era of the universe,
           | though technically reversible.
        
             | benchmarkist wrote:
             | The universe is not a Markov chain, in fact, no one knows
             | what it is but locally we do know that entropy increases
             | and the inevitable endpoint in our corner of the universe
             | is complete annihilation. Your preferences are completely
             | irrelevant in the local scheme of things.
        
       | banku_brougham wrote:
       | My intuition is the public, users, nor the industry will take
       | this problem seriously. To me this paper sounds a thunderclap.
        
         | dale_glass wrote:
         | It's not a real problem, in my understanding. It's a "this
         | kills cancer in a petri dish" sort of thing.
         | 
         | Yes, it makes sense that if your algorithm is at all lossy,
         | passing outputs through it again compounds the loss.
         | 
         | The reality though that this doesn't happen on a large scale.
         | JPEG isn't destroying all imagery because we're not stupid
         | enough to constantly compound compression losses.
         | 
         | Most AI output is generated for some sort of purpose. If we're
         | making images then we're throwing out the bad ones, retouching
         | blemishes, and also still making new works entirely by hand.
        
           | m11a wrote:
           | Well, if we're using the output of AIs, writing blog post
           | articles with it and using AI to post comments on Reddit/X
           | etc (as some do), and a few years later OpenAI et al refresh
           | their datasets to train a new model, then you're doing
           | exactly that aren't you? Using lossy model outputs to put
           | into the function again, that is
        
             | dale_glass wrote:
             | It's harder with LLMs, but we still have metrics for a lot
             | of text.
             | 
             | On Reddit/HN/etc we have comment scores, replies, reposts,
             | external references, etc that we can use to estimate
             | whether a given comment/post was any good.
             | 
             | An entity like Google that indexes the whole web has
             | visibility into when a given piece of content showed up, if
             | it changed, if it got referenced elsewhere later, etc.
             | 
             | LLMs can be used to analyze the page and work out things
             | like "the comments answering this comment are saying it's
             | wrong"
             | 
             | We can also of course pick and choose, Reddit has some
             | completely garbage subreddits and very tightly moderated
             | ones.
             | 
             | It's of course by no means foolproof, but it doesn't have
             | to be. It just has to be good enough to be usable for
             | whatever purpose we need.
             | 
             | Also, perfection isn't a thing and such issues happen even
             | without LLMs. Like all the cases of something wrong being
             | added to Wikipedia, getting repeated on some news site, and
             | then Wikipedia using that as a reference to backup the
             | incorrect claim.
        
       | benchmarkist wrote:
       | This is intuitively obvious. If I give you some data x and you
       | transform it with a non-reversible function f into f(x) then you
       | are losing information. Repeated applications of the function,
       | f(f(f(...f(x)...))), can only make the end result worse. The
       | current implementations inject some random bits, b ~ N(u, s), but
       | this can be thought of as a convolution operation with the
       | distribution function g of the random data, g*f, that is injected
       | which, after repeated applications,
       | (g*f)((g*f)((g*f)(...(g*f)(x)...))), reduces the information
       | content of the data you started with because the transformation
       | is still not reversible as convolutions can not really change the
       | non-reversible aspect of the original function.
       | 
       | I'm sure there is some calculation using entropy of random
       | variables and channels that fully formalizes this but I don't
       | remember the references off the top of my head. The general
       | reference I remember is called the data processing inequality.1
       | 
       | 1 https://en.wikipedia.org/wiki/Data_processing_inequality?use...
        
         | optimalsolver wrote:
         | Relevant XKCD:
         | 
         | https://xkcd.com/1683/
        
           | benchmarkist wrote:
           | Good one but these theorems are useful to have when thinking
           | about information processing systems and whatever promises
           | the hype artists are making about the latest and greatest
           | iteration of neural networks. There is no way to cheat
           | entropy and basic physics so if it sounds too good to be true
           | then it probably is too good to be true.
        
             | HeatrayEnjoyer wrote:
             | If it is entropy and basic physics why are humans immune to
             | the effect?
        
               | raincole wrote:
               | Humans are not immune to the effect. We invented
               | methodologies to mitigate the effect.
               | 
               | Think about science. I mean hard science, like physics.
               | You can not say a theory is proven[0] if it is purely
               | derived from existing data. You can only say it when you
               | release your theory and successfully _predicate_ the
               | future experiment results.
               | 
               | In other words, you need to do new experiements, gather
               | new information and effectively "inject" the entropy into
               | the humanity's scientific consensus.
               | 
               | [0]: Of course when we say some physical theory is
               | proven, it just means the probablilty that it's violated
               | in certain conditions is negligible, not that it's an
               | universal truth.
        
         | jiggawatts wrote:
         | That's assuming that the same function is applied in the same
         | way at each iteration.
         | 
         | Think about this: The sum total of the human-generated
         | knowledge was derived in a similar manner, with each generation
         | learning from the one before and expanding the pool of
         | knowledge incrementally.
         | 
         | Simply adding a bit of noise and then _selecting_ good outputs
         | after each iteration based on a high-level heuristic such as
         | "utility" and "self consistency" may be sufficient to reproduce
         | the growth of human knowledge in a purely mathematical AI
         | system.
         | 
         | Something that hasn't been tried yet because it's too expensive
         | (for now) is to let a bunch of _different_ AI models act as
         | agents updating a central wikipedia-style database.
         | 
         | These could start off with "simply" reading _every single text
         | book and primary source_ on Earth, updating and correcting the
         | Wikipedia in _every language_. Then cross-translate from every
         | source in some language to every other language.
         | 
         | Then use the collected facts to find errors in the primary
         | sources, then re-check the Wikipedia based on this.
         | 
         | Train a new generation of AIs on the updated content and mutate
         | them slightly to obtain some variations.
         | 
         | Iterate again.
         | 
         | Etc...
         | 
         | This could go on _for quite a while_ before it would run out of
         | steam. Longer than anybody has budget for, at least for now!
        
           | zkry wrote:
           | > The sum total of the human-generated knowledge was derived
           | in a similar manner, with each generation learning from the
           | one before and expanding the pool of knowledge incrementally.
           | 
           | Is human knowledge really derived in a similar manner though?
           | That reduction of biological processes to compression
           | algorithms seems like a huge oversimplification.
           | 
           | It's almost like saying that all of of human knowledge
           | derives from Einstein's Field Equations, the Standard Model
           | Lagrangian, and the Second Law of Thermodynamics (what else
           | could human knowledge really derive from?) and all we have to
           | do to create artificial intelligence is just to model these
           | forces to a high enough fidelity and with enough computation.
        
             | ethbr1 wrote:
             | Human knowledge also tends to be tied to an objective,
             | mostly constant reality.
        
               | jiggawatts wrote:
               | The AIs could also learn form and interact with reality,
               | same as humans.
        
               | dartos wrote:
               | Not really.
               | 
               | The models we use nowadays operate on discrete tokens. To
               | overly reduce the process of human learning, we take a
               | constant stream of realtime information. It never ends
               | and it's never discrete. Nor do we learn in an isolated
               | "learn" stage in which we're not interacting with our
               | environment.
               | 
               | If you try taking reality and breaking into discrete
               | (ordered in the case of LLMs) parts, you lose
               | information.
        
             | mannykannot wrote:
             | It's not just any compression algorithm, though, it's a
             | specific sort of algorithm that does not have the purpose
             | of compression, even if compression is necessary for
             | achieving its purpose. It could not be replaced by most
             | other compression algorithms.
             | 
             | Having said that, I think this picture is missing
             | something: when we teach each new generation what we know,
             | part of that process involves recapitulating the steps by
             | which we got to where we are. It is a highly selective
             | (compressed?) history, however, focusing on the things that
             | made a difference and putting aside most of the false
             | starts, dead ends and mistaken notions (except when the
             | topic is history, of course, and often even then.)
             | 
             | I do not know if this view has any significance for AI.
        
           | chongli wrote:
           | _Think about this: The sum total of the human-generated
           | knowledge was derived in a similar manner, with each
           | generation learning from the one before and expanding the
           | pool of knowledge incrementally._
           | 
           | Not true. No amount of such iteration gets you from buffalo
           | cave paintings to particle accelerators.
           | 
           | Humans generate knowledge by acting in the world, not by
           | dwelling on our thoughts. The empiricists won a very long
           | time ago.
        
             | brookst wrote:
             | It's not binary. Humans generate plenty of knowledge from
             | pure abstract thought.
        
               | kerkeslager wrote:
               | Do they?
               | 
               | When I pursued creative writing in my teens and early
               | 20s, it became clear to me that originality is
               | _extremely_ difficult. I am not entirely sure I have
               | _ever_ had an original thought--every idea I 've put to
               | paper thinking it was original, I later realized was a
               | recombination of ideas I had come across somewhere else.
               | The only exceptions I've found were places where I had a
               | fairly unusual experience which I was able to interpret
               | and relate, i.e. a unique interaction with the world.
               | 
               | Perhaps more importantly, LLMs do not contain any
               | mechanism which even attempts to perform pure abstract
               | thought, so even if we accept the questionable assumption
               | that humans can generate ideas _ex nihilo_ , that doesn't
               | mean that LLMs can.
        
               | jiggawatts wrote:
               | Unless your argument is that all creative writing is
               | _inspired by God_ , or some similar "external" source,
               | then clearly a closed system such as "humanity" alone is
               | capable of generating new creative works.
        
             | jiggawatts wrote:
             | You're right, we obtained the knowledge externally. It was
             | aliens! I knew it!
        
         | scotty79 wrote:
         | If you repeatedly apply one of three simple functions picked at
         | random you might end up with Sierpinski triangle.
        
           | eminent101 wrote:
           | This sounds fascinating! I know what a Sierpinski triangle
           | triangle is but I'm having so me trouble seeing the
           | connection from picking functions randomly to the triangle.
           | Is there some graphics or animation somewhere on the web that
           | someone can point me to visualize this better?
        
             | scotty79 wrote:
             | You can read section Chaos Game here:
             | 
             | https://en.m.wikipedia.org/wiki/Sierpi%C5%84ski_triangle
             | 
             | It basically using the fact that fractal is self similar.
             | So picking one function (that scales whole triangle to one
             | of the one thirds) and transforming single point on a
             | fractal into a new point also gets you a point on the
             | fractal.
             | 
             | If you repeat this process many times you get a lot of
             | points of the fractal.
             | 
             | You can even start the process at any point and it will
             | "get attracted" to the fractal.
             | 
             | That's why fractals are called strange attractors.
        
         | ofrzeta wrote:
         | What about something like image improvement algorithms or
         | NeRFs? They seem to increase information even if some of it is
         | made up.
        
           | dartos wrote:
           | Do they gain information, or just have lower loss?
           | 
           | Too much information encoded in a model can lower performance
           | (called overfitting)
           | 
           | That's why many NN topologies include dropout layers.
        
           | antihero wrote:
           | It isn't real information though. This is effectively a
           | Chinese whispers.
           | 
           | The only way AI can create information is by doing something
           | in the real world.
        
             | ofrzeta wrote:
             | Maybe information needs to be understood relationally as in
             | "information for a subject x". So if we have an image with
             | a license plate that is unreadable and there's an algorithm
             | that makes it readable to x, there is an information gain
             | for x, although the information might have been in the
             | image all along.
        
               | bippihippi1 wrote:
               | eliminating the noise makes the useful information
               | clearer, but the information describing the noise is lost
        
               | vunderba wrote:
               | Sure, but what if the upscaling algorithm misinterpreted
               | a P as an F? Without manual supervision/tagging, there's
               | an inherent risk that this information will have an
               | adverse effect on future models.
        
               | ablob wrote:
               | If the license plate was not readable, then the
               | additional information is false data. You do not know
               | more about the image than you knew before by definition.
               | Replacing pixels with plausible data does not mean a gain
               | of information. If anything, I'd argue that a loss of
               | information occurs: The fact that x was hardly
               | readable/unreadable before is lost, and any decision
               | later on can not factor this in as "x" is now clearly
               | defined and not fuzzy anymore.
               | 
               | Would you accept a system that "enhances" images to find
               | the license plate numbers of cars and fine their owners?
               | If the plate number is unreadable the only acceptable
               | option is to not use it. Inserting a plausible number and
               | rolling with it even means that instead of a range of
               | suspects, only one culprit can be supposed. Would you
               | like to find yourself in court for crimes/offenses you
               | never comitted because some black box decided it was a
               | great idea to pretend it knew it was you?
               | 
               | Edit: I think I misunderstood the premise. Nonetheless my
               | comment shall stay.
        
             | jppittma wrote:
             | It's information taken from many other photos and embedded
             | into a single one of interest no?
        
             | dragonwriter wrote:
             | > The only way AI can create information is by doing
             | something in the real world.
             | 
             | Everything done is done in the real world, but the only way
             | an AI can gather (not create) information _about some
             | particular thing_ is to interact with that thing. Without
             | interacting with anything external to itself, all
             | information it can gather is the information already
             | gathered to create it.
        
               | tomrod wrote:
               | Is there a formalization of this idea? Would love to read
               | more.
        
             | Lerc wrote:
             | It is real information, it is just information that is not
             | targeted at anything in particular. Random passwords are,
             | well, random. That they are random _and_ information is
             | what makes them useful as passwords.
             | 
             | As said by others, There is nothing terribly insightful
             | about making something estimate the output of another by a
             | non-perfect reproduction mechanism and noticing the output
             | is different. Absent any particular guidance the difference
             | will not be targeted. That is tautologically obvious.
             | 
             | The difference is still information though, and with
             | guidance you can target the difference to perform some
             | goal. This is essentially what the gpt-o1 training was
             | doing. Training on data generated by itself, but only when
             | the generated data produced the correct answer.
        
           | blensor wrote:
           | Once more and more new training images are based off of those
           | new upscaled images the training of those upscaling
           | algorithms will tend to generate even more of the same type
           | of information drowning out the other information
        
           | vunderba wrote:
           | If the goal of an image improvement algorithm is effectively
           | "how would this image have looked _IN THE REAL WORLD_ if it
           | had been taken with a better camera ", then training on
           | previous "virtual upscaled images" would be training on the
           | wrong fitness function.
        
           | dragonwriter wrote:
           | "Made up" information is noise, not signal (OTOH, generated
           | in images are used productively all the time in training, but
           | the information content added is not in the images themselves
           | but in their selection and relation to captions.)
        
           | raincole wrote:
           | Image improvement algorithms are basically injecting
           | statistical information (collected from other images) into
           | one image.
           | 
           | The above statement applies for non-neural-network algorithms
           | as well.
        
         | cruffle_duffle wrote:
         | So correct me if I'm wrong here but wouldn't another way to
         | look at this be something like re-compressing a JPEG? Each time
         | you compress a compressed jpeg you strip more and more
         | information out of it? Same with any lossy compression, really.
         | 
         | These LLM's are inherently a bit like lossy compression
         | algorithms. They take information and pack it in a way that
         | keeps its essence around (at least that is the plan). But like
         | any lossy compression, you cannot reconstruct the original.
         | Training a lossy compression scheme like an LLM using its own
         | data is just taking that already packed information and
         | degrading it.
         | 
         | I hope I'm right framing it this way because ultimately that is
         | partly what an LLM is, it's a lossy compression of "the entire
         | internet". A lossless model that can be queried like an LLM
         | would be massive, slow and probably impossible with today's
         | tech.
         | 
         | I suspect that we will develop new information theory that
         | mathematically proves these things can't escape the box they
         | were trained in, meaning they cannot come up with new
         | information that isn't already represented in the relationships
         | between the various bits of data they were constructed with.
         | They can "only" find new ways to link together the information
         | in their corpus of knowledge. I use "only" in quotes because
         | simply doing that alone is pretty powerful. It's connecting the
         | dots in ways that haven't been done before.
         | 
         | Honestly the whole LLM space is cool as shit when you really
         | think about it. It's both incredibly overhyped yet very under
         | hyped at the same time.
        
           | entangledqubit wrote:
           | Relevant article by a fun author:
           | https://www.newyorker.com/tech/annals-of-
           | technology/chatgpt-...
        
         | wslh wrote:
         | > with a non-reversible function f into f(x) then you are
         | losing information.
         | 
         | A non-reversible function f does not necessarily lose
         | information. Some non-reversible functions, like one-way
         | functions used in cryptography, can be injective or even
         | bijective but are computationally infeasible to invert, which
         | makes them practically irreversible while retaining all
         | information in a mathematical sense. However, there is a subset
         | of non-reversible functions, such as non-injective functions,
         | that lose information both mathematically and computationally.
         | It's important to distinguish these two cases to avoid
         | conflating computational irreversibility with mathematical loss
         | of information.
        
           | meltyness wrote:
           | On the arguments involving modeling inference as simply some
           | function f, the specific expression OP used discounts that
           | each subsequent application would have been following some
           | backpropagation and so implies a new f' at each application,
           | rendering the claim invalid.
           | 
           | At that point, at least chaos theory is at play across the
           | population of natural language, if not some expressed, but
           | not yet considered truth.
           | 
           | This invalidates the subsequent claim about the functions
           | which are convolved as well, I think all the GPUs might have
           | something to say whether the bits changing the layers are
           | random or correlated.
        
         | habitue wrote:
         | This seems obvious, but you're forgetting the inputs may
         | actually have low entropy to begin with. Lossy compression is
         | non-reversible, but usually the expectation is that we don't
         | care about the parts we lost.
         | 
         | How might this cash out with recursive LLMs? Generalizing is
         | very similar to compression: imagine recovering the Schrodinger
         | equation from lots of noisy physical experiments. You might
         | imagine that an LLM could output a set of somewhat general
         | models from real data, and training it on data generated from
         | those models generalizes further in future passes until maybe
         | it caps out at the lowest entropy model (a theory of
         | everything?)
         | 
         | It doesn't seem like it actually works that way with current
         | models, but it isn't a foregone conclusion at the mathematical
         | level at least.
        
       | mmastrac wrote:
       | All work and no play makes jack a dull boy.
        
       | goose- wrote:
       | My takeaway after scanning the paper -
       | 
       | In an ideal setting, a trained model learns exactly the real
       | world probability distribution, and generates data
       | indistinguishable from those sampled from the real world.
       | Training on them would be fine, but pointless, since the model is
       | already a perfect representation of the real world.
       | 
       | Practically, however, a model is only a lossy approximation of
       | the real world probability distribution. Repeated self-training
       | would simply compound the loss - amplifying both the probable and
       | the improbable.
        
       | tkgally wrote:
       | This paper was first published in May 2023 and discussed on HN
       | the following month:
       | 
       | https://news.ycombinator.com/item?id=36319076
       | 
       | Some research since seems to add nuance to its conclusions:
       | 
       | https://arxiv.org/abs/2404.01413
        
         | CaptainFever wrote:
         | > The proliferation of generative models, combined with
         | pretraining on web-scale data, raises a timely question: what
         | happens when these models are trained on their own generated
         | outputs? Recent investigations into model-data feedback loops
         | proposed that such loops would lead to a phenomenon termed
         | model collapse, under which performance progressively degrades
         | with each model-data feedback iteration until fitted models
         | become useless. However, those studies largely assumed that new
         | data replace old data over time, where an arguably more
         | realistic assumption is that data accumulate over time. In this
         | paper, we ask: what effect does accumulating data have on model
         | collapse? We empirically study this question by pretraining
         | sequences of language models on text corpora. We confirm that
         | replacing the original real data by each generation's synthetic
         | data does indeed tend towards model collapse, then demonstrate
         | that accumulating the successive generations of synthetic data
         | alongside the original real data avoids model collapse; these
         | results hold across a range of model sizes, architectures, and
         | hyperparameters. We obtain similar results for deep generative
         | models on other types of real data: diffusion models for
         | molecule conformation generation and variational autoencoders
         | for image generation. To understand why accumulating data can
         | avoid model collapse, we use an analytically tractable
         | framework introduced by prior work in which a sequence of
         | linear models are fit to the previous models' outputs. Previous
         | work used this framework to show that if data are replaced, the
         | test error increases with the number of model-fitting
         | iterations; we extend this argument to prove that if data
         | instead accumulate, the test error has a finite upper bound
         | independent of the number of iterations, meaning model collapse
         | no longer occurs.
         | 
         | TL;DR: This paper confirms that Model Collapse can happen if
         | the original data is replaced with synthetic data, but if both
         | are used alongside each other, it no longer happens.
        
       | aucisson_masque wrote:
       | > the value of data collected about genuine human interactions
       | with systems will be increasingly valuable in the presence of
       | content generated by LLMs in data crawled from the Internet.
       | 
       | Does it mean that data hungry corporation like Google, Facebook,
       | Amazon, openai with Microsoft backing, that are already all
       | around the internet and our phone tracking us have an incredibly
       | advantage over open source model?
       | 
       | Is that why Google is pushing gemini so hard on Android even
       | though it's half ass done? they need fresh human data so much to
       | be able to compete and beat the competition ?
        
         | piva00 wrote:
         | > Does it mean that data hungry corporation like Google,
         | Facebook, Amazon, openai with Microsoft backing, that are
         | already all around the internet and our phone tracking us have
         | an incredibly advantage over open source model?
         | 
         | Yes, absolutely. Back around 2017 The Economist had an article
         | calling "data is the new oil", I first heard that from a VC
         | back in 2010.
         | 
         | These companies are sitting on immense reserves of data,
         | Google, Facebook, Amazon, Bytedance are the Saudi Arabia, UAE,
         | etc. of the information age.
        
           | bgilroy26 wrote:
           | The quality of reddit's data is different from other data I
           | encounter online.
           | 
           | It represents information more closely related to people's
           | lives. People share information there that is closely related
           | to the topic of the subreddit. This may not always be the
           | case, but even though I spend much, much less time on reddit
           | than I did in 2011, many, many people are contributing to
           | this day.
           | 
           | That spigot of connection to the real world through text
           | sounds valuable to AI based on TFA. I feel the oil analogy
           | would be about the quality and the ease of extraction of the
           | stake
        
       | XorNot wrote:
       | While I'm sure the anti-AI people are taking this and running off
       | with hot takes, the conclusion is still much more mundane: we
       | currently do not have the ability to have an LLM learn from
       | another LLM.
       | 
       | A suitably powerful AI _should_ be able to do this though, by the
       | example of the fact that humans learn by being taught by other
       | humans (insert nuance of that process here).
       | 
       | So it's an important result, but not a doomsday result because
       | what it tells us is that LLM output fails to capture or stabilize
       | important information from the training corpus and accurately
       | communicate it to a newly trained LLM. So we know we're missing
       | something in how we construct these models, but the ramifications
       | of solving it are also pretty immense: models being able to
       | "teach" new models means the whole cycle of iteration can be sped
       | up considerably.
        
         | kouteiheika wrote:
         | > we currently do not have the ability to have an LLM learn
         | from another LLM
         | 
         | We do. It's called model distillation, and it's relatively
         | straightforward.
         | 
         | In fact, training a smaller model on the outputs of a much
         | bigger model will significantly cut down on your training
         | time/create a higher quality model than just training on raw
         | human data (which is often low quality and noisy).
        
         | suprjami wrote:
         | It has existed for years:
         | 
         | Self-Instruct: Aligning Language Models with Self-Generated
         | Instructions https://arxiv.org/abs/2212.10560
         | 
         | airoboros: using large language models to fine-tune large
         | language models https://github.com/jondurbin/airoboros
        
       | lowyek wrote:
       | I no longer take limitations seriously regarding the future of
       | AI. If evolution created our brain, then the same law applies to
       | what we are building also. Hence, more of less whatever written
       | in this paper is some nuanced case which can be solved by some
       | approach.
        
       | Scene_Cast2 wrote:
       | There is a mantra in ML that has been around for a while. It's
       | that when training on synthetic data, your learned model is only
       | as good as your generator model.
        
         | etiam wrote:
         | Catchy! And a really good point.
         | 
         | Seems like there could be room for a couple of special
         | situations with caveats though? With the GAN formulation your
         | generator can be practically as good as your discriminator and
         | your discriminator can probably be better than it would have
         | been without adversarial regularization?
        
       | f3z0 wrote:
       | Given that the top google results are now generated I think we
       | already have a massive recursion problem. I think we would
       | benefit from training a model specifically to just detect a
       | likelihood of content being generated and then bias other models
       | against the higher likelihood generated content so that we don't
       | end up with LLM echo chambers.
        
         | tempodox wrote:
         | Isn't everybody always gushing about how LLMs are supposed to
         | get better all the time? If that's true then detecting
         | generated fluff will be a moving target and an incessant arms
         | race, just like SEO. There is no escape.
        
           | LegionMammal978 wrote:
           | Yep, that's what I've been thinking since people started
           | talking about it. I hear that AI plagiarism detectors can
           | never work, since LLM output can never be detected with any
           | accuracy. Yet I also hear that LLMs-in-training easily sift
           | out any generated content from their input data, so that
           | recursion is a non-issue. It doesn't make much sense to have
           | it both ways.
        
             | ipython wrote:
             | I wonder if the truth about sifting out synthetic training
             | data is based on signals separate from the content itself.
             | Signals such as the source of the data, reported author,
             | links to/from etc.
             | 
             | These signals would be unavailable to a plagiarism/ai
             | detector
        
         | eddyfromtheblok wrote:
         | Right. Google already has a solution
         | https://deepmind.google/technologies/synthid/ Everyone insists
         | on training theirs to look human generated so the horses have
         | left the stable on this
        
       | tempodox wrote:
       | Indeed, ingesting generated bluster gives them cancer of the
       | perceptron.
        
       | axegon_ wrote:
       | That was very much evident even from back ehwn the first GPT's
       | came out. The moment you started introducing synthetic data, the
       | quality plummeted.
       | 
       | But there is another use case where LLM's can truly help with
       | synthetic data: the more classical classification and regression
       | problems - specifically gathering training data. I had this exact
       | case at work two days ago: A large dataset with a small subset of
       | labeled data. For a binary classifier, there was a huge imbalance
       | in the data - the ratio was roughly 75-25%. I did not have the
       | desire to do all this manually so I used an LLM to get a list
       | that would even out the numbers(and get a 50-50 ratio). And using
       | the data I had, plus the additional synthetic data, the accuracy
       | of my small classifier ended up picture-perfect(given that my
       | actual target was "85-90%" accuracy and the actual result was
       | just shy of 99%).
        
         | kerkeslager wrote:
         | I'd argue that the case you give isn't an example of using a
         | computer to generate data, it's a case of a human adding data
         | (the data being the fact that the binary classifier should have
         | a 50/50 balance).
         | 
         | This sort of massaging of data has its drawbacks as well--
         | obviously this only works if the balance of that binary
         | classifier actually is 50/50 in reality: I don't know enough
         | about your case to say you were wrong, but I can imagine a lot
         | of scenarios where a binary classifier should not be
         | represented 50/50 in the data.
        
           | axegon_ wrote:
           | This is a question of definition. It is synthetic in that I
           | just passed a prompt, asking for N amount of examples of X.
           | And I did not go over the entire list I got and blindly
           | trusted it. In this context, I needed an even(or nearly even
           | distribution) of samples in the training data and it worked
           | way better than I was hoping. Mind you, I have to face a
           | similar issue next week and I'm not sure this approach would
           | cut it - I need way more training data and way more classes
           | to work with. 249 classes if I'm not mistaking.
        
             | kerkeslager wrote:
             | I question what "it worked way better than I was hoping"
             | means in this context. If you're saying that filtering the
             | input data to create a uniform distribution created a
             | uniform distribution in the output, I'm not sure why you'd
             | hope for any less--that's exactly what I'd expect to
             | happen. But that's a poor measure of success, because you
             | don't know what side effects that had: the removed data
             | ostensibly contained other variables besides your binary
             | variable, and you don't know if those variables were
             | sampled in any useful way, so I'd be hesitant to say this
             | worked well without at least an attempt to measure those
             | other variables.
        
         | low_tech_love wrote:
         | Just curious, but did you compute that 99% using purely real
         | test data, or does your test set also include artificial data?
        
           | axegon_ wrote:
           | Apart from using the results from the
           | training/testing/validation sets? Several people manually
           | went over several thousand random samples.
        
       | kerkeslager wrote:
       | Isn't this obvious?
       | 
       | I'm glad this was published to point out the problem, but I'm a
       | bit puzzled why people tried to train models on generated data in
       | the first place. Synthetic data... isn't data.
       | 
       | The one exception I can see is medical data, where synthetic data
       | can be used to avoid violating people's privacy, but even in that
       | case it's clearly not ideal from a technical perspective.
        
         | quantadev wrote:
         | To me it seems intuitive that training on any unseen word
         | patterns should increase intelligence as long as said patterns
         | are consistent with ground truth. That's why it's counter-
         | intuitive (to me) that training can fail, purely based on where
         | the training data came from. The source of the information is
         | something only the universe itself should be able to take into
         | consideration (full causality chain), and not the training
         | process.
        
           | kerkeslager wrote:
           | I am unable to parse what you're saying here.
        
             | quantadev wrote:
             | I was just saying it's counter-intuitive that the "source"
             | of any training data would ever matter as much as the
             | "correctness" of the data; but you're right, that was very
             | sloppy wording on my part, sorry.
             | 
             | Here's a longer, related post, from me (albeit also
             | confusing, haha):
             | 
             | https://news.ycombinator.com/item?id=42352759
        
       | meltyness wrote:
       | My intuition given the rapid, informal developments of agent-type
       | systems is that this is obvious insofar as the initial dataset
       | was formed from a huge hidden "data cleaning" task that was human
       | evolution and society. This isn't really that interesting of a
       | claim and is it clear that it holds if you simply loop the LLM
       | back onto the data cleaning task itself as a critic to the new
       | training set? Is this what the author would classify as fine
       | tuning?
       | 
       | Another question is what is the interpretation of the output of
       | an LLM generation when unprompted? Isn't that always effectively
       | garbage when there's not a deliberate bias in the training set?
        
       | quantadev wrote:
       | Like Sam Altman and Dario Amodei both believe is a very real
       | possibility as well, I think the "intelligence" in LLMs may be
       | far deeper than we know and somehow even related to "Multiverse
       | Theory", where perhaps every Quantum Mechanical collapse (and
       | computation during training), makes "our" universe slightly more
       | likely to lean towards ones where AI is just "magically smart"
       | (from a purely Anthropics Principle Effect) than dumb. The reason
       | this could happen is because in all our futures AI has saved us
       | in some way, so that all other "Multiverse Branches are sort of
       | dead-ends".
       | 
       | So the theory about why training on training data is unexpectedly
       | inefficient could be because LLMs are "using" the full Causality
       | Chain (using some advanced unknown Physics related to time
       | itself) of our universe/timeline, and so if it tries to train on
       | it's own output that's a "Short Circuit" kind of effect, cutting
       | off the true Causality Chain (past history of the universe).
       | 
       | For people who want to remind me that LLM Training is fully
       | "deterministic" with no room for any "magic", the response to
       | that counter-argument is that you have to consider even the input
       | data to be part of what's "variable" in the Anthropics Selection
       | Principle, so there's nothing inconsistent about determinism in
       | this speculative, and probably un-falsifiable, conjecture.
        
       ___________________________________________________________________
       (page generated 2024-12-07 23:01 UTC)