[HN Gopher] AI models collapse when trained on recursively gener...
       ___________________________________________________________________
        
       AI models collapse when trained on recursively generated data
        
       Author : rntn
       Score  : 173 points
       Date   : 2024-07-24 15:42 UTC (7 hours ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | tyingq wrote:
       | Which is good background to this story about Reddit locking down
       | robots.txt and trying to get money from the AI teams scraping
       | their content.
       | 
       | https://news.ycombinator.com/item?id=41057033
        
         | roughly wrote:
         | If they're considering Reddit content to be free of generated
         | material, I've got bad news for them. It's not quite the
         | Chernobyl-grade hole that Pinterest has become, but it's hardly
         | "low background".
        
           | tyingq wrote:
           | Sure. I think Reddit is aware though, that time is running
           | out to get paid for whatever human generated content is there
           | that isn't already scraped.
        
           | visarga wrote:
           | I still believe reddit is an amazing source. Any article you
           | read on reddit, chances are the comments are better than the
           | original text. They will debunk the article, present a
           | diversity of reactions, and most importantly, they will be
           | grounded in public opinion unlike the press which caters to
           | money interests.
           | 
           | You just copy-paste a conversation into the LLM and ask for
           | an article. For taste, here is one generated from this very
           | conversation. https://pastebin.com/raw/JFH6PGqg
        
             | mvdtnz wrote:
             | > Any article you read on reddit, chances are the comments
             | are better than the original text.
             | 
             | We're talking about reddit dot com here? Seriously? I find
             | it difficult to find any comments worth reading at all on
             | that website. 99% of the stuff that isn't buried is just
             | the same recycled jokes again and again and again.
        
       | daft_pink wrote:
       | It's like the ai generated version of index funds.
        
       | roughly wrote:
       | Back when I was getting my econ degree, we were taught about the
       | Ultimatum game, which goes like this: You get two participants
       | who don't know each other and will (ostensibly) never see each
       | other again. You give one of them $100, and they make an offer of
       | some portion of it to the other. If the other accepts, both
       | parties keep their portion - so, if A offers B $20, and B
       | accepts, A keeps $80 and B keeps $20, if B rejects, both parties
       | get nothing. Standard economic theory suggests A can offer $1 and
       | B will accept, because otherwise B gets nothing. Spoiler for
       | those of you who haven't seen how standard economic theory plays
       | out in real life, that's not how the game went - typically,
       | offers below ~$30 or so got rejected, because B was a real
       | feeling person who felt like they were getting screwed and opted
       | to punish A for doing so. The exception to this - the people who
       | would take the $1 offer - were people who had been taught
       | economic theory. It turns out you _could_ screw them over and
       | they'd pat themselves on the backs for being very wise.
       | 
       | The "tragedy of the commons" is another one of those parts of
       | standard economic theory that never actually played out in
       | reality - we've got examples from all over the world of
       | communities implementing practices and often entire belief
       | systems that led them to be responsible stewards of shared
       | resources without requiring unilateral ownership of that resource
       | and singular acquisition of the benefits of that stewardship, and
       | yet first on the lips of every modern capitalist when describing
       | why they're at a disadvantage if they're not the ones polluting
       | the water supply is the tragedy of the commons.
        
         | asah wrote:
         | ...in the real world, A tells B that he "sourced" the deal and
         | therefore deserves a bigger cut and in the real world, B agrees
         | up to a point (the $30 mark). Over time and rounds of playing
         | the game, the A's of the world learn where the line is and
         | optimize to stay on the correct side of it, only testing the
         | other side 1-2% of the time to see if rules/behavior has
         | changed.
        
         | splwjs wrote:
         | It's crazy how most political or economic systems would very
         | obviously collapse in the real world almost instantly without
         | some kind of voluntary moral contract (explicit or implied),
         | yet we've got huge clumps of people demonizing one system or
         | another based on the context of what happens when you implement
         | it in a morally dead societal context.
         | 
         | Like there are a ton of people who smirk at your last paragraph
         | and go "nuh uh, hashtag late stage capitalism"
        
           | roughly wrote:
           | A hundred percent. I've said this elsewhere, but a primary
           | problem for at least American society at this point is we
           | don't have a commonly-agreed upon moral system other than the
           | market - things like Martin Shkreli buying drugs people need
           | to live and jacking the price up are Bad, but we don't have a
           | common language for describing why it's immoral, whereas our
           | only real common shared language, the market, is basically
           | fine with it as long as it's legal. A lot of the market logic
           | works fine for society within constraints - optimize your
           | costs, but not at the expense of your workers; increase your
           | prices if you can, but don't be a ghoul about it; lobby for
           | your position, but don't just buy a supreme court judge.
        
         | renewiltord wrote:
         | If you iterate the game, it's obvious. I, as the responder,
         | control the proposer's income. Extend to infinity with
         | knowledge of iteration and you reach symmetry between proposer
         | and responder.
        
           | roughly wrote:
           | > If you iterate the game, it's obvious.
           | 
           | We're shockingly bad at doing this in modern society. Our
           | temporal planning horizon is somewhere between 6 months and 5
           | years, whereas our lifespans are around 75-80.
        
         | RyanAdamas wrote:
         | This reminds me of Lord of the Flies. The real version of the
         | events turned out very differently.
         | 
         | https://www.newsweek.com/real-lord-flies-true-story-boys-isl...
        
           | roughly wrote:
           | Rebecca Solnit wrote a book, "A Paradise Built in Hell", on
           | how people behave during disasters, and found broadly the
           | same thing - contra the prepper myths, most people most of
           | the time faced with disaster come together to work
           | cooperatively to help each other.
           | 
           | We're a fundamentally social species - we've got smaller
           | brains than Neanderthals did, we're not a particularly tough
           | species, but we're very, very good at cooperating with each
           | other.
        
         | snakeyjake wrote:
         | Game theory only applies to sociopaths and economists, but I
         | repeat myself.
        
         | samatman wrote:
         | > _It turns out you _could_ screw them over and they 'd_
         | 
         | End up with a dollar in their pocket which they otherwise
         | wouldn't have.
         | 
         | The Ultimatum game is a useful insight into human psychology:
         | for one thing, it tells us who thinks that the defector in this
         | equilibrium is better off than a counterfactual cooperator.
         | 
         | Ah, but they have their pride! Ok. My pride is not affected by
         | someone else having 99 bucks they didn't earn, and myself $1
         | likewise. Maybe that other fellow really needed the money.
        
           | roughly wrote:
           | Indeed. You're very wise!
        
           | imtringued wrote:
           | I don't know what the hell you're talking about. Your
           | argument is incoherent. If you wanted to allocate the money
           | according to the individual's utility of money, then a rule
           | of thumb of $1 is going to be wrong. You should, given no
           | information, assume that both have the same utility of money
           | and that the utility of money is diminishing, favouring an
           | even split.
        
         | partypete wrote:
         | You may be interested in some of the foundational papers
         | exploring game theory models similar to the Ultimatum
         | game[1][2]. These are known as Iterated Prisoner's Dilemmas.
         | 
         | ---
         | 
         | [1] The Evolution of Cooperation (https://ee.stanford.edu/~hell
         | man/Breakthrough/book/pdfs/axel...)
         | 
         | [2] Evolutionary Dynamics of Spatial Games (https://www.science
         | direct.com/science/article/abs/pii/016727...)
        
         | ball_of_lint wrote:
         | Each player can limit the other's income to $0 - the offerer
         | can offer $0 and the receiver can reject any deal.
         | 
         | So then what's optimal? $50 seems obviously fair, but does that
         | mean we ought to reject offers of $49 100% of the time? Not
         | quite, to limit the opponent's expected income for an offer of
         | $49 to $50 instead of the $51 they left for themselves, we can
         | use a mixed strategy that only accepts the offer with
         | probability 50/51. Extending that gives the opponent a benefit
         | curve that is linear as they leave themselves more money up to
         | $50 and then flat at $50 afterwards.
         | 
         | That's good, but we can make it better - if we accept offers
         | for $X<$50 with probability 50/(100-X) - epsilon*(50-X), then
         | their expected benefit curve is smooth and has a peak at $50,
         | which is the most we can expect to make except against a
         | generous opponent.
         | 
         | After all that, playing this game as stated against an unknown
         | opponent there's a lot of uncertainty. Maybe all your opponents
         | are entirely irrational and move at random. Maybe all your
         | opponents have colluded and decided that $66 for the offerer
         | and $34 for the receiver is fair and that's the only deal
         | they'll make. But if you think that random actors in the
         | universe are reasonably intelligent and can discover the
         | equilibrium above with the thought worth putting into this
         | Ultimatum game, the receiver strategy above properly aligns
         | incentives.
        
       | betenoire wrote:
       | Seems analogous to the effect of echo chambers on humans
        
         | AnimalMuppet wrote:
         | Or navel-gazing. In fact, that's one of the classically known
         | flaws. (So well known that it has many names: ivory tower,
         | navel gazing, getting stuck in your own head...)
         | 
         | If you don't compare your thoughts to the outside world, it's
         | easy for them to diverge more and more from reality.
        
           | betenoire wrote:
           | you are right, navel-gazing describes it perfectly
        
           | tensor wrote:
           | It's important to note that outside world means the actual
           | world, not the thoughts of other humans. You need a way to
           | establish ground truth, which comes from observing the actual
           | outcome of actions and experiments.
        
         | hprotagonist wrote:
         | I Am Sitting In A Room
         | https://en.wikipedia.org/wiki/I_Am_Sitting_in_a_Room
        
       | padraicmahoney wrote:
       | This seems extremely interesting, but I don't have the time right
       | now to read this in depth (given I would also need to teach
       | myself a bunch of technical concepts too).
       | 
       | Anyone willing to weigh in with a _theoretical intuition_? The
       | one in the paper is just a little inaccessible to me right now.
        
       | jlos wrote:
       | As far as I understand Douglas Hofstadter's Godel, Escher, Bach -
       | self-referential recursive structures (strange loops) are the
       | foundation of consciousness (among other interesting things).
       | I've been watching to see if LLM's becoming self-referential
       | actually improves them as opposed to degrades them.
        
         | mensetmanusman wrote:
         | The interesting thing about loops is that they can generate
         | fields (think motion if current generating a magnetic field).
         | 
         | Consciousness is more like a field than like a particle (which
         | are also fields), but we haven't determined how conscious
         | fields fit in physics models.
        
       | hiddencost wrote:
       | A lot of these papers are wrong. They do something wrong in their
       | setup and then claim their conclusion shows show general truth.
       | 
       | Publishing in nature in ML can actually be a red flag, because
       | they're really not well equipped to evaluate a lot of claims.
       | 
       | The latest llama model got a lot of its data using labels from
       | llama2, and every frontier lab is talking about self training as
       | the future.
        
         | slashdave wrote:
         | Who are "they"? And do you actually believe the practice of
         | publishing unvetted preprints is a good thing in ML research?
        
           | hiddencost wrote:
           | Non sequitur? I never said that.
           | 
           | Good venues include main track NeurIPS, ICML, ACL, e.g.
           | 
           | Nature is notorious for publishing PR pieces that don't
           | reproduce, and their ML theory publishing has been quite
           | poor. They do pretty well on things like AlphaGo, materials
           | science, or weather modeling because it's more in their
           | wheelhouse and the results don't require a deep understanding
           | of info theory or ML practice.
        
             | slashdave wrote:
             | Those venues have huge issues with referees. It comes down
             | to who is reviewing the work.
             | 
             | The irony in your comment is that it is related to the
             | paper we are discussing. There is a big problem with
             | poisoning from group-think and self reinforcement in
             | current ML research.
        
       | anon291 wrote:
       | Is this an artifact of floating point precision or a fundamental
       | mathematical truth.
        
         | slashdave wrote:
         | Floating point precision is not involved (most LLM models still
         | function after floating-point quantization).
         | 
         | I am puzzled that some find this result at all surprising. You
         | simply cannot generate information from nothing.
        
           | anon291 wrote:
           | I'm not surprised you can't use it to make it better, but one
           | might imagine gradients would go to zero as you fed the model
           | its own output.
        
             | slashdave wrote:
             | No, not even close. Gradients don't come to zero in the
             | first place. Training is never perfect.
        
               | anon291 wrote:
               | Let's restate. I'd imagine you end up in local minima
               | that are difficult to escape using model generated data.
               | So sure, non-zero gradients, but if you plot the
               | gradients, I would expect them to orbit at that point.
               | But it seems like they diverge.
        
               | slashdave wrote:
               | Mini-batches and dropout mean that you are constantly
               | jumping out of and into other minima during training of
               | any type (highly-redundant solution space is an important
               | feature of deep learning). This is deliberate and
               | necessary to explore the gigantic parameter space of
               | these huge LLM models.
        
         | skybrian wrote:
         | It's a lossy transformation, so you're losing information each
         | time. It's never going to add information.
         | 
         | However, some information is junk that obscures the good stuff.
         | It's likely that how they train today is very inefficient
         | compared to what's possible, and there will be smarter ways to
         | transform preexisting data so that it's a better dataset to
         | train on, without losing very much.
         | 
         | Papers like this one show what not to do.
        
           | visarga wrote:
           | > there will be smarter ways to transform preexisting data so
           | that it's a better dataset to train on, without losing very
           | much
           | 
           | Like, take for example search. Instead of training on a bunch
           | of scraped texts, you take one prompt, select 10 references,
           | and use it to synthesize an answer. Referencing multiple
           | texts gives you more than training on them directly. The LLM
           | could catch contradictions, observe the distribution of human
           | opinions, note if the topic is controversial. And then output
           | a wikipedia-like article. Do this billions of times, and you
           | got a refined dataset. You can iterate on top, using the
           | articles as source and writing meta articles. Or just silly
           | studies like writing a paper about "Characters named Charlie
           | in literature". You can slice and dice the data in any way,
           | and analyze the cross section.
        
       | Kuinox wrote:
       | Meanwhile OpenAI, Anthropics, trains on AI generated data to
       | improve their models, and it works.
       | 
       | https://openai.com/index/prover-verifier-games-improve-legib...
       | 
       | https://www.anthropic.com/research/claude-character
        
         | echelon wrote:
         | I'm long on synthetic data.
         | 
         | If you think about evolution and hill climbing, of course it
         | works.
         | 
         | You have a pool of information and you accumulate new
         | rearrangements of that information. Fitness selects for the
         | best features within the new pool of data (For primates,
         | opposable thumbs. For AI art, hands that aren't deformed.) It
         | will naturally drift to better optima.
         | 
         | RLHF, synthetic data, and enrichment are all we need.
        
           | ainoobler wrote:
           | Are you sure about this? It's well known that cannibalism in
           | animals leads to degenerative disorders.
        
             | autokad wrote:
             | I think the direct action of a person taking their idea and
             | thoughts and going through it many times (making changes /
             | updates / fixes) fits better than eating something.
             | however, I do think you still some form of validation data
             | to ensure these are good changes.
             | 
             | However, I do get the spirit of the article, that as more
             | information generated online is done by LLms, the validity
             | and use of the output decreases
        
               | ainoobler wrote:
               | What exactly is doing the validation?
        
               | autokad wrote:
               | depends on what one was doing. could be as simple as re-
               | writing a sentence and asking someone if it looks better
        
             | photonthug wrote:
             | Not sure why you're downvoted, I think a comparison with
             | prions seems apt and interesting, and bad protein copies
             | that can replicate is essentially an information process.
             | GAN research in recent years showing how you can sabotage a
             | working dog/cat classifier with a one pixel change feels
             | similar to how the tiniest parts of large systems can
             | sometimes undermine the whole completely, albeit with low
             | probability. And finally, since models will bootstrap
             | models that bootstrap models, inevitably there are already
             | subtle issues out there in the wild that may have an
             | incubation period of many years before the downstream
             | effects are completely clear.
        
               | ainoobler wrote:
               | The problem is systemic. People believe that the pursuit
               | of monetary and financial profits by corporations will
               | lead to the creation of benevolent artificial
               | intelligence. I personally think this is essentially a
               | religion because it is obvious that the pursuit of
               | profits can not actually create anything benevolent, let
               | alone intelligence.
        
           | throwup238 wrote:
           | _> If you think about evolution and hill climbing, of course
           | it works._
           | 
           | You don't even need to go that far. How do most children
           | learn? By reading textbooks and listening to lesson plans
           | assembled by their teachers from all the relevant content the
           | teachers have experienced.
           | 
           | Our education systems are built on synthetic data that is
           | created for optimized learning, so that every child doesn't
           | have to prove the universe from scratch to learn some basic
           | maths.
        
             | samatman wrote:
             | That isn't synthetic data in any reasonable or meaningful
             | sense of the term.
             | 
             | You could describe a textbook as a synthesis, sure, in a
             | sense which absolutely does not track with the 'synthetic'
             | in 'synthetic data'.
             | 
             | Unless the textbook is AI-generated, and I expect that in
             | 2024, the number of AI-generated textbooks is not zero.
        
               | throwup238 wrote:
               | It's an analogy. The learning materials teachers create
               | for students is very much like synthetic data, it's just
               | not assembled from algorithmic output.
        
               | lanstin wrote:
               | kids learn, walking talking reading arithmetic and
               | physics, by doing things in the physical world. Adults
               | may speak differently to kids than adults, but it's a
               | stretch to say it's synthetic. Equivalent to synthetic
               | would be a group of kids that just grew up together and
               | made up a novel language.
        
             | sunnybeetroot wrote:
             | By this reasoning wouldn't all information that you didn't
             | discover yourself be synthetic data?
        
               | throwup238 wrote:
               | Yeah and that's why we call it "standing on the shoulders
               | of giants." Humans went through tons of trial and error
               | in every facet of life to get where we are today. We kept
               | the stuff that worked and taught it.
               | 
               | But before humans can understand enough language to
               | ingest that synthetic data, they do a lot of their own
               | discovery based training where they learn about the world
               | physically and absorb the language people around them
               | use, kind of like throwing random internet data at an
               | LLM.
        
             | nick486 wrote:
             | the equivalent here would be a child learning from a
             | textbook he has written himself.
             | 
             | not sure how effective that would be, if it was his only
             | source of learning.
        
               | throwup238 wrote:
               | Well that's what the TFA is about. If you
               | indiscriminately ingest synthetic data into training -
               | the child learning from their own textbook - the model
               | collapses.
               | 
               | The SOTA is to use a discriminator (often another LLM or
               | ML algo) to select the best output before feeding it into
               | the training data. That's what OpenAI, Anthropic, et al
               | have been doing. One of them just published a paper about
               | it a few weeks ago.
        
           | Salgat wrote:
           | Synthetic data has to work if we hope to have ML models that
           | can improve themselves in a similar fashion as humans when it
           | comes to advancing knowledge.
        
           | randcraw wrote:
           | Data created automatically is not the same as human curated
           | data, though both are synthetic. Auto-created data often
           | suffers from a host of demerits (duplication, bias, error,
           | unnatural distribution, irrelevance to learn the intended
           | domain, etc, etc). Human curated data usually avoids these
           | pitfalls, and thus is far more valuable when training --
           | otherwise all human teachers would be equally good. So auto-
           | vs curated- data are incomparable when training naive
           | neophytes like ML models, or children.
        
           | HPsquared wrote:
           | It's important to climb the right hill.
        
             | psb217 wrote:
             | And to have a very well-tuned sense up vs down when the
             | hill is almost flat...
        
           | kjkjadksj wrote:
           | This misunderstands fitness. Its not a sure bet what is most
           | optimal is what you see. "Good enough" given environmental
           | context is what you see. Just like with certain crystal
           | structures in chemistry, you may only be in a localized
           | threshold of fitness stability that is not necessarily
           | optimal, but separated from another optimal configuration by
           | having suboptimal intermediary steps that need more
           | activation energy to overcome before falling into a state
           | with lower entropy (or more optimal fitness).
           | 
           | In other words you can never be sure if synthetic data is any
           | good or if what things gravitate toward are really most
           | optimal.
        
             | echelon wrote:
             | > Its not a sure bet what is most optimal is what you see.
             | 
             | I wouldn't ever make "most optimal" a criteria. We're
             | looking for measurable improvements, not a jump to god
             | emperor or apex predator.
             | 
             | > you may only be in a localized threshold of fitness
             | stability that is not necessarily optimal, but separated
             | from another optimal configuration by having suboptimal
             | intermediary steps that need more activation energy to
             | overcome before falling into a state with lower entropy (or
             | more optimal fitness).
             | 
             | Optimization is like that. But unlike genetics, where we
             | can't re-route the recurrent laryngeal nerve or change
             | fundamental biochemistry, these are engineered systems
             | where we can set up wildly different experiments at any
             | time. Just to cite one of many different research threads,
             | there's now research now going into developing models from
             | small scale training data.
             | 
             | > you can never be sure if synthetic data is any good or if
             | what things gravitate toward are really most optimal.
             | 
             | We can know if the synthetic data is _better_. We have
             | objective measures, a scientific process, and we 'll always
             | be striving for improvement.
        
           | mcswell wrote:
           | The paper is not talking about verifiable synthetic data
           | generated by some means other than LLMs.
        
         | nyrikki wrote:
         | Cheese and Chalk.
         | 
         | It is very different to generate synthetic datasets to assist
         | in targeted training , vs ingesting LLM output from web
         | scraping.
        
           | abeppu wrote:
           | Yes -- said another way, if you're an ML researcher and you
           | have human-provided (scraped) data, and an ability to
           | generate synthetic data, then until recently, you had a
           | controllable parameter: how much of your training data for
           | your new model should be synthetic? You can vary this, run
           | multiple experiments, and choose how much synthetic data to
           | use -- and you can vary the specific configs about how that
           | synthetic data is generated.
           | 
           | If synthetic data is mixed into your upstream data sources
           | _in a way you cannot control_ , then your ML team loses a
           | valuable controllable parameter.
        
             | miki123211 wrote:
             | You still have some that control, but in a much more
             | indirect way.
             | 
             | There are three kinds of data now, synthetic, pre-2022 and
             | current. Everything pre-2022 is definitely written by
             | humans, synthetic data is still synthetic, and post-2022 is
             | a mix of both.
             | 
             | I wouldn't be surprised if "AI detectors" work somewhat for
             | this use case. They're biased, far from accurate and a
             | terrible idea if you need to make important decisions (like
             | whether to expel a student for cheating), but there's quite
             | a large room for errors here.
        
               | uberswe wrote:
               | > Everything pre-2022 is definitely written by humans
               | 
               | I'm not sure if methods like article spinning counts as
               | written by humans. This is something you could automate
               | before AI and it would take a human written article and
               | randomly swap words with similar meaning throughout to
               | make it seem original.
        
               | derefr wrote:
               | Don't forget machine-translated texts, where until ~2017
               | the translation was likely done by something much dumber
               | / semantically lossy than an LLM, and after 2017 was
               | basically done by an early form of LLM (the Transformers
               | architecture originating in Google Translate.)
               | 
               | Many historical English-language news reports published
               | on the English-language websites of foreign news media
               | from non-English-speaking countries, from 1998 (Babelfish
               | era) to ~a few months ago, may be unreliable training
               | data for this reason.
        
               | mrbungie wrote:
               | They do work detecting LLM outputs that are sampled
               | "naively" (when the model/user is really not trying to
               | pass it as human output).
               | 
               | I copied a prompt translated from spanish to english
               | using ChatGPT Plus in a GPT-4o Azure OpenAI Service
               | endpoint. It did work in Spanish but didn't run in
               | english because the default AOS Content Filters detected
               | a jailbreak intent. It was quite weird.
        
           | lawlessone wrote:
           | I think this is it.
           | 
           | Generated data is ok if you're curating it to make sure
           | nothing bad, wrong or insensible comes in.
           | 
           | Basically still needs a human in the loop.
        
             | surfingdino wrote:
             | Then why not remove this crap (LLMs) from the loop
             | altogether? How did we get from "AI will replace you" to
             | "your new job will be an AIs janitor" in the space of about
             | 12 months?
        
               | sieste wrote:
               | there is nothing wrong with being a janitor. you could
               | also call it "AI editor" instead of you want to insert a
               | job title sounds more prestigious. some people find it
               | easier and more enjoyable to edit a first draft generated
               | by a language model based on instructions than writing
               | that first draft themselves.
        
               | the8thbit wrote:
               | I gotta say, Claude is a godsend for building out quick
               | prototypes of ideas, especially when those ideas require
               | domain specific knowledge that you know _a little_ about
               | but aren 't specialized in. Which is most interesting
               | programming projects.
               | 
               | Sure, I could do it myself, but it would take more time,
               | each step would have less momentum, and I'd have to think
               | more while I do it. Which, there's a place for that too,
               | of course.
        
               | visarga wrote:
               | > Sure, I could do it myself, but it would take more
               | time, each step would have less momentum, and I'd have to
               | think more while I do it. Which, there's a place for that
               | too, of course.
               | 
               | You just start faster, but end at the same time. If you
               | really need to understand something there is no LLM
               | shortcut. I spent hours interrogating Claude, in the same
               | time I could have studied from a book and gotten even
               | better grounding.
        
               | lawlessone wrote:
               | >Then why not remove this crap (LLMs) from the loop
               | altogether
               | 
               | Because reading is faster than writing.
               | 
               | Someone could spend a few years or even most of their
               | life writing a book that can be read in a matter of hours
               | days or weeks.
               | 
               | Humans writing have to proofread their own work. Or
               | occasionally even pay someone else to do it.
        
             | abeppu wrote:
             | No, bad/wrong/nonsense is _not_ the only risk here. You 're
             | missing the main point that the authors are making: the
             | shape of the distribution gets changed by this process. A
             | model trained on human data will produce fewer high-
             | perplexity examples than it was trained on (you can see
             | this in Fig 1b, even between generation 0 and 1). In a
             | literal information theory sense, these perplexity values
             | indicate how much information is in each example. Over
             | successive generations models have less actual information
             | to learn from even if they have the same volume of text.
        
               | visarga wrote:
               | LLMs are milking us of knowledge and skills, repackage
               | them and give it back to us. Models interact with the
               | internet, humans and code execution. They are exploring.
               | Lots of exploring now happens in the chat room, a place
               | where ideas are first tried out. With billions of users,
               | the volume of information LLMs collect from us is huge.
               | We bring references, guidance and feedback right into its
               | mouth, the LLM doesn't even need to do anything like
               | crawling.
               | 
               | Imagine how many things we know, things we accumulated in
               | our life experience, that were never written down
               | anywhere. That information was lost to others. But now we
               | use LLM assistants, so they get to be in the loop and
               | collect tidbits of human life experience that is not
               | written on the internet. And soon they will also work on
               | audio/video and travel with us everywhere, seeing what we
               | show them.
        
               | Pinkthinker wrote:
               | I think that maybe we are too harsh in expecting LLMs to
               | be perfect. If they are based off of human input that is
               | incorrect then we might propagate such errors. But they
               | will still be quicker and much more reliable than most
               | people. Isn't this good enough? After all, we are willing
               | to accept flaws in people, even including the president.
               | I suspect that the way forward will be to progressively
               | clean the LLM input data as each error gets identified.
        
             | visarga wrote:
             | > Basically still needs a human in the loop.
             | 
             | Yes, and big LLM developers have millions of humans in the
             | loop. That's why they provide free access, for human in the
             | loop filtering & guidance.
             | 
             | If I go to chatGPT and solve a coding task, maybe the first
             | 3 ideas don't work and the 4th works. It can do RLHF
             | setting the first 3 with negative and the fourth with
             | positive score. They just used me to test their model and
             | create a datapoint.
             | 
             | Using LLM is useful both ways - for humans, we get
             | assistance, and LLMs get feedback for their outputs. This
             | seems like the new form of "you are the product".
        
           | mcswell wrote:
           | Yeah, I raised the same issue before reading your post;
           | ninja'd I am.
           | 
           | I like your "cheese and chalk".
        
             | ddingus wrote:
             | I always preferred sugar and shit. Obviously, that is
             | profane. But, I consider profanity be seen as the part of
             | speech it really is.
        
               | groby_b wrote:
               | Profane is, if you will, fucking fine.
               | 
               | But "cheese and chalk" is a great analogy because both
               | are sources of calcium, but cheese is much better for the
               | human body. It carries useful info.
        
               | progmetaldev wrote:
               | Also drives your point home more efficiently. While it
               | may be profane, there's far more speech available with
               | far less "use" that is intentionally profane to spark a
               | reaction without regard to what that reaction may be.
               | Shock value for attention, rather than to carry home a
               | point.
        
         | HanClinto wrote:
         | Keep in mind that the Prover-Verifier game is not that it's
         | training on AI-generated data (as if to imitate it) -- rather,
         | it's training against a discriminator that verifies for
         | correctness (a calculator) and understandability (a smaller,
         | less-capable language model). You can think of this as a
         | distillation method, but it's not like it's generating large
         | amounts of source data and then retraining on it. This method
         | only works on specific problems where there is an absolute
         | right answer that can be verified with an independent heuristic
         | (in this case, a math calculation).
         | 
         | However, there is a lot of potential in the world of self-play
         | and adversarial-training to improve the quality of our LLMs
         | with true reinforcement learning.
         | 
         | For one recent paper on this topic, also check out SPAG -- I
         | found this one to be fascinating:
         | 
         | https://github.com/Linear95/SPAG
         | 
         | I've been keeping notes on this topic in a WIP paper, and if
         | you'd like to read my (rambling) ravings about it, you can find
         | more info here:
         | 
         | https://github.com/HanClinto/MENTAT
         | 
         | I think that self-play and reinforcement learning are going to
         | absolutely be important for the next level of LLM development.
         | If you use AI-generated data, then you must have an objective
         | metric to verify "goodness". Nothing is free, and simply asking
         | an LLM to rate the quality of its own data is not going to cut
         | it. I think that's the point of the article.
        
         | visarga wrote:
         | > Meanwhile OpenAI, Anthropics, trains on AI generated data to
         | improve their models, and it works.
         | 
         | They got a secret ace in their pocket - chat logs created with
         | human in the loop. Of course those might still have errors, but
         | much fewer. They can infer from a human response if it was
         | accepted or not.
         | 
         | I think OpenAI generates at least 1B sessions per month and 2
         | Trillion interactive tokens. Those can go into the LLM again
         | for analysis and synthetic content generation, or for RLHF with
         | the whole conversation as guidance. Having access to the
         | following interactions can shed light on previous answers.
         | 
         | Even more, they can correlate chats across days, presumably
         | humans try out LLM ideas in reality and return for iteration.
         | That way LLMs indirectly get real world grounding.
        
           | mrbungie wrote:
           | This is likely one of the main reasons why they're offering
           | ChatGPT for free and running ChatGPT Plus at a loss.
        
       | megaman821 wrote:
       | There are other ways AI can help train other AI that aren't
       | generating data. AI could remove low quality data from a training
       | set. It could assist humans in structuring video, 3D and physics
       | simulation datasets for the best learning results.
        
       | simonw wrote:
       | > We find that indiscriminate use of model-generated content in
       | training causes irreversible defects in the resulting models
       | 
       | The key word there is "indiscriminate". All of the big AI labs
       | have been training on synthetic data for at least a year at this
       | point, but they're doing so deliberately.
       | 
       | I don't think the "model collapse" problem is particularly
       | important these days. The people training models seem to have
       | that well under control.
        
         | jagged-chisel wrote:
         | I find nothing wrong with your statement. I am curious about
         | the paper's use of "indiscriminate." I read this as "just feed
         | the AI more AI output without care" which one can indeed do
         | deliberately.
         | 
         | Seems to me that deliberate _discriminate_ use should yield
         | better against expectations.
        
         | __jl__ wrote:
         | Came here to say the same. "indiscriminate" doesn't really make
         | sense. It's very deliberate.
         | 
         | However, there is one scenario: Scraping of web data. In that
         | case, AI labs might know what is model generated.
        
         | mcswell wrote:
         | The question (which I raised in a top-level comment before
         | reading your post) is whether there is any such thing as
         | "discriminate" use of web data. Synthetic data created in the
         | same lab as the LLM is discriminate, but what the authors of
         | the paper are saying (if I read it correctly) is that scraping
         | the web is not currently done in a discriminate way. And it's
         | not at all clear to me that there _is_ a discriminate way to
         | use web scraping, because you can 't know for sure what's
         | human-generated and what's LLM-generated.
        
           | simonw wrote:
           | I get the impression that scraping the web isn't nearly as
           | important a source of LLM training data as it used to be.
           | 
           | Everyone is trimming down their training data based on
           | quality - there are plenty of hints about that in the Llama
           | 3.1 paper and Mistral Large 2 announcement.
           | 
           | OpenAI are licensing data from sources like the Associated
           | Press.
           | 
           | Andrej Karpathy said this:
           | https://twitter.com/karpathy/status/1797313173449764933
           | 
           | > Turns out that LLMs learn a lot better and faster from
           | educational content as well. This is partly because the
           | average Common Crawl article (internet pages) is not of very
           | high value and distracts the training, packing in too much
           | irrelevant information. The average webpage on the internet
           | is so random and terrible it's not even clear how prior LLMs
           | learn anything at all.
        
             | jroesch wrote:
             | I think this is roughly correct. My 2c is that folks used
             | the initial web data to cold start and bootstrap the first
             | few models, but so much of the performance increase we have
             | seen at smaller sizes is a shift towards more conscientious
             | data creation/purchase/curation/preparation and more
             | refined evaluation datasets. I think the idea of scraping
             | random text except maybe for the initial language
             | understanding pre-training phase will be diminished over
             | time.
             | 
             | This is understood in the academic literature as well, as
             | people months/years ago were writing papers that a smaller
             | amount of high quality data, is worth more than a large
             | amount of low quality data (which tracks with what you can
             | pick up from an ML 101 education/training).
        
             | markwkw wrote:
             | You trim, yes, but AI content surely invades (all?) areas
             | of written material. People are increasingly using AI to
             | assist their writing. Even it if's for slight editing, word
             | choice suggestions.
             | 
             | Even AP doesn't ban the use of LLMs, its standards prohibit
             | direct publishing of AI-generated content. I'm sure its
             | writers leverage LLMs in some ways in their workflow,
             | though. They would probably continue to use these even if
             | AP attempted to ban LLMs (human incentives).
        
               | tensor wrote:
               | If the AI generated content is filtered for quality or is
               | corrected then it will still be good data. The phenomenon
               | of model degradation is only in the case where there is
               | no outside influence in the generated data.
        
               | progmetaldev wrote:
               | I think this is extremely important with AI generated
               | content, but seems to be given less and less thought as
               | people start to "trust" AI as it seeps into the public
               | conscious more. It needs to be reviewed, filtered, and
               | fixed where appropriate. After that, it isn't any
               | different from reviewing data on your own, and wording it
               | in a way that fits the piece you're writing.
               | Unfortunately, there's so much trust in AI now that
               | people will go ahead and publish content without even
               | reading it for the correct tense!
        
           | smeagull wrote:
           | They needed to deal with degenerate data on the Web anyway.
           | It's always been full of trash and spam.
        
             | progmetaldev wrote:
             | I agree with you when it comes to training, but at the same
             | time, I think that's also the power we get with the web.
             | You can have a voice, even if others don't agree with you.
             | I don't think that should be taken away unless you are
             | inciting violence.
        
           | olejorgenb wrote:
           | At least some of the LLM generated content will be
           | vetted/selected for by a human being though.
        
         | mmazing wrote:
         | They make it clear in the paper that their primary "real-world"
         | concern is that it's difficult to distinguish synthetic data
         | from real human interaction when scraping data from the web.
         | This will only get worse over time with our current way of
         | doing things.
         | 
         | How are they supposed to deliberately train on synthetic data
         | when they don't know whether it is (synthetic) or not?
         | 
         | Also, do you not feel that it is presumptuous to dismiss a body
         | of work in a few sentences with a "seems fine to me"?
        
           | simonw wrote:
           | In this case I wasn't reacting to this specific paper so much
           | as to the widespread idea (at least that I've observed among
           | AI skeptics) that "model collapse" is a huge problem.
        
         | tremon wrote:
         | How do you "discriminate" data gathering at web-scale, though?
         | In my view, everything at web-scale only works because there
         | are no humans in the loop, as repeatedly explained here in
         | basically every thread involving Google or Facebook. Yes, since
         | it's a scientific paper they should have defined their usage of
         | the word, but I see nothing wrong with the basic premise that
         | automation at large-scale implies indiscrimate use of content.
        
         | Lerc wrote:
         | Or to consider the inverse of indiscriminate, selection.
         | 
         | Mutation = bad.
         | 
         | Mutation + selection = good.
         | 
         | (given enough iterations)
        
           | msoad wrote:
           | wow this is such a good point! Evolution is just that!
        
         | mvdtnz wrote:
         | > I don't think the "model collapse" problem is particularly
         | important these days. The people training models seem to have
         | that well under control.
         | 
         | And you base this on what? Vibes?
        
           | simonw wrote:
           | Basically yes. Vibes based on reading between the lines of
           | various papers, blog announcements and tweets from people
           | better informed than I am.
        
       | simonw wrote:
       | The source code that accompanies the paper is available in a zip
       | file here: https://zenodo.org/records/10866595
       | 
       | I copied that into a Gist to make it easier to browse here:
       | https://gist.github.com/simonw/b3ab1588a681dda821da9fb57290d...
        
       | bjourne wrote:
       | The article contains no proof of theorem 3.1 and finding
       | counterexamples seems trivial. Adult male weight can be modeled
       | by N(85, 20). You can recursively "train" the model on data it
       | generates without having it collapse. It will stay stationary as
       | long as the samples are large enough.
        
         | NaiveBayesian wrote:
         | I believe that counterexample only works in the limit where the
         | sample size goes to infinity. Every finite sample will have
         | m[?]0 almost surely.(Of course m will still tend to be very
         | close to 0 for large samples, but still slightly off)
         | 
         | So this means the sequence of m[?] will perform a kind of
         | random walk that can stray arbitrarily far from 0 and is almost
         | sure to eventually do so.
        
           | bjourne wrote:
           | Fair point about the mean, but I don't see how the random
           | walk causes the standard deviation to shrink towards zero.
        
             | lostmsu wrote:
             | I agree. The authors generate a dataset of a similar size
             | as the original and then train on that continuously (e.g.
             | for multiple epochs). That's not what you need to do in
             | order to get new model trained on the knowledge of the
             | teacher. You need to ask the teacher to generate new
             | samples every time, otherwise your generated dataset is not
             | very representative of the totality of knowledge of the
             | teacher. Generating samples every time would (in infinite
             | limit) solve the collapse problem.
        
             | NaiveBayesian wrote:
             | Agreed, that's what I struggle to see as well. It's not
             | really clear why the variance couldn't stay the same or go
             | to infinity instead. Perhaps it does follow from some
             | property of the underlying Gamma/Wishart distributions.
        
         | mcguire wrote:
         | Does the Supplementary Information (starting on p. 4, for
         | example) help?
         | 
         | https://static-content.springer.com/esm/art%3A10.1038%2Fs415...
         | 
         | In your counterexample, can you quantify "as long as the
         | samples are large enough"? How many samples do you need to keep
         | the s.d. from shrinking?
        
           | bjourne wrote:
           | Maybe. "Overall, this only shows us how far on average we go
           | from the original distribution, but the process can only
           | 'terminate' if the estimated variance at a certain generation
           | becomes small enough, i.e. we effectively turn into a delta
           | function." Iiuc, variance is modeled as a random walk that
           | will sooner or later reach on zero. I'm not sure I buy that
           | because the variance "walks" orders of magnitudes slower than
           | the mean and is much more robust for large sample sizes.
        
       | FredPret wrote:
       | Data in --> slop out.
       | 
       | Slop in --> yikes
        
       | vzaliva wrote:
       | I call this "LLM inbreeding." It's a vicious loop where new
       | models are trained on AI-generated content, resulting in the
       | quality degenerating with each generation.
        
         | tiborsaas wrote:
         | I like this analogy. With the Cambrian explosion of LLMs, we
         | are getting into safe territory, aren't we? Aren't we?
        
       | dang wrote:
       | Related ongoing thread:
       | 
       |  _The problem of 'model collapse': how a lack of human data
       | limits AI progress_ -
       | https://news.ycombinator.com/item?id=41058867 - July 2024 (6
       | comments)
        
       | swayvil wrote:
       | There's a complexity missing there. It's like the effects of
       | incest upon dna. Or an echo chamber upon conversation.
        
       | mcswell wrote:
       | I must be missing something. Training on the output of your
       | system as if it were validated input seems like an _obvious_ no-
       | no. I 'm not talking about using synthetic data (however that
       | might be created in this situation), but rather using anything
       | and everything found on the web as if it were "real", i.e. as if
       | it were human-generated texts rather than the output of the LLM.
       | 
       | In this case of course there are multiple LLMs that are creating
       | text which finds its way to the web, but to the extent that the
       | output of the different LLMs have commonalities, this still seems
       | problematic.
       | 
       | And afaik, there are no metrics or algorithms that reliably
       | distinguish between human-generated and LLM-generated text, at
       | least not for the current generations of LLMs.
       | 
       | What am I missing?
        
         | visarga wrote:
         | > Training on the output of your system as if it were validated
         | input seems like an obvious no-no.
         | 
         | Imagine a scientist inventing theories without testing
         | anything, and then continuing to build on top. Crazy. Not even
         | humans can create absent some kind of feedback or validation
         | from outside. That's why we invented the scientific method.
        
         | meroes wrote:
         | Isn't that how math works in some respects? In that, there's
         | only a hierarchy of consistency (no absolute consistency) for
         | most of math. And we just keep building and building. We tried
         | the absolute consistency route and found it too limiting.
         | 
         | Maybe that this doesn't work for LLMs is a sign they aren't on
         | the path to AGI...
         | 
         | Personally I found LLMs horrendous at this kind of stuff. I'm
         | basically a RLHF peon by trade and if I'm ever needing a quick
         | way to fool a model, I go to simple logical problems, where it
         | can't lean on external structures, only itself. I don't mean
         | logical syntax but logical reasoning. I can't share recent
         | stuff but a just a few months ago the models I work with failed
         | to reason removing 12 cards from a regular deck couldn't remove
         | an entire suit. That kind of stuff. Why would I want to make my
         | prompt longer and more detailed to provide it extra structure
         | (which is logically superfluous) to ensure it gets the right
         | answer. Im sure a wordy prompt could get it to the right
         | answer. I'm interested in its ability to "reason", not prompt
         | engineering.
         | 
         | Given that math is devoid of external structure, I wonder if
         | there something to this (it's at least interesting to
         | speculate)
        
         | pocketsand wrote:
         | You would think so, but people like Sam Altman have suggested
         | that they can use AI-generated data to train their own models.
         | See here:
         | 
         | https://www.nytimes.com/2024/04/06/technology/tech-giants-ha...
        
           | Voloskaya wrote:
           | Training on ai-generated data isn't a problem, and has been
           | routinely done by everyone for 18 mo +.
           | 
           | The issue is training on 'indiscriminate' ai-generated data.
           | This just leads to more and more degenerate results. No one
           | is doing this however, there is always some kind of filtering
           | to select which generated data to use for training. So the
           | finding of that paper are entirely not surprising, and
           | frankly, intuitive and already well known.
        
         | empath75 wrote:
         | It's _relatively_ easy, I think to filter out sites with a
         | large proportion of low quality ai-generated glurge.
         | 
         | Then you're left with a lot of AI generated or assisted content
         | that has quite often been filtered and modified by humans, so
         | that might mitigate some of the problems that cause model
         | collapse because the filtered content _should_ better reflect
         | reality or desirable output?
        
         | mort96 wrote:
         | I mean a fair bit of content on Reddit and Twitter is machine
         | generated now, right? And content on Reddit and Twitter is
         | being used to train new models, right?
        
         | aezart wrote:
         | I think you're right. When I was experimenting with llama 1, I
         | was able to easily observe that with a short prompt and a long
         | response, the response _rapidly_ degraded the longer it went,
         | because it was seeing and amplifying the patterns in its
         | context window so far.
         | 
         | It is intuitively obvious that these problems would get even
         | worse if the garbage output found its way into the training
         | set, and not just into the context window.
        
       | jksmith wrote:
       | Given a time snapshot and enough computing power, isn't recursion
       | inevitable? It's like running out of known universe given time x.
       | So then we're back creating data without a prior dataset, which
       | is still a human domain.
        
       | ziofill wrote:
       | Very interesting. But wouldn't human preferences still find their
       | way into the datasets of the future?
        
       | asadm wrote:
       | "Breathing in your own exhaust can be fatal"
        
       | throwthrowuknow wrote:
       | So they fine tuned an existing model using its own completions to
       | produce the training set for the next run which uses the fine
       | tuned model as the base. They mention catastrophic forgetting so
       | they are aware of it. I suppose they wanted to get results as
       | quickly as possible but this isn't an accurate model of reality
       | (pun not intended). They've only succeeded in demonstrating
       | something that is well known. If they had made the effort to
       | simulate mitigation of bad data and a growing corpus that
       | included proportionally more synthetic data over time it would
       | have been interesting.
        
       | nostrademons wrote:
       | This has happened with much simpler models than LLMs, eg. Google
       | Suggest became noticeably worse when everybody started using
       | Google Suggest to input their queries, because it was trained on
       | real query logs and those query logs started to simply reproduce
       | the output of the Suggest model. SEO and Webspam have similar
       | problems within Google Search.
       | 
       | More broadly, this is a reflection of Goodhart's Law: "When a
       | measure becomes a target, it ceases to be a good measure." The
       | issue is that any model's purpose is to capture novel, useful
       | data about real human behavior. Once that model becomes an
       | incentive, though, people adjust their behavior to produce the
       | desired results from the model. Authentic behavior disappears,
       | which means there's no useful information content for the model
       | to capture, and future generations of the model instead just
       | reproduce behaviors of the previous generation they were trained
       | on, including quirks. Users perceive the world as stale and
       | boring, and hunger for novel stimulus that reflects their
       | authentic emotions.
       | 
       | You could look at this as a full-employment theorem for
       | entrepreneurs and artists.
        
         | mcguire wrote:
         | From my reading of the paper, this is a pretty good description
         | of the problem they identify.
        
       | mcguire wrote:
       | _Nature_ published a computer science paper???!
       | 
       | " _Given that training a single moderately large model produces
       | twice the American lifetime's worth of CO2 (ref. 15), we opted to
       | not run such an experiment and instead focus on a more realistic
       | setting for a proof of concept._ "
        
       | igorkraw wrote:
       | It should be noted that
       | 
       | 1. this is nothing that should surprise anyone who has an
       | intuition on control theory and the evolution of unconstrained
       | markov chains
       | 
       | 2. there appear to be relatively easy mitigations
       | https://news.ycombinator.com/item?id=41061085 (made a separate
       | post because it might be of independent interest to discuss)
       | 
       | 3. you still won't get beyond the imititation game boundary
       | without exploration & feedback, i.e. the recursive improvement
       | doomers are, as of now, still wrong
        
         | mvdtnz wrote:
         | > 1. this is nothing that should surprise anyone who has an
         | intuition on control theory and the evolution of unconstrained
         | markov chains
         | 
         | You don't even need to know what a markov chain is. It is
         | intuitively obvious to anyone with two brain cells to rub
         | together that AI can't improve by eating its own vomit.
        
         | lemonwaterlime wrote:
         | I've been telling people this for the past few years. They
         | would like to find out the hard way what control theorists
         | already know.
        
       | nurettin wrote:
       | I don't see how this hurts training unless you hurl all
       | hallucinations back at the model.
       | 
       | Alpha zero used a similar approach where it trained against
       | itself and that only made it better. I don't think collapse is
       | real.
        
       | zby wrote:
       | If the model collapse means that the text produced by it is not
       | statistically identical to the garbage that fills the Internet -
       | then I guess a collapse is the goal.
        
       | m3kw9 wrote:
       | Of course it will collapse if you don't verify it, I remember
       | OpenAI talking about its research into having a different model
       | verify that data somehow
        
       ___________________________________________________________________
       (page generated 2024-07-24 23:06 UTC)