[HN Gopher] AI models collapse when trained on recursively gener...
___________________________________________________________________
AI models collapse when trained on recursively generated data
Author : rntn
Score : 173 points
Date : 2024-07-24 15:42 UTC (7 hours ago)
(HTM) web link (www.nature.com)
(TXT) w3m dump (www.nature.com)
| tyingq wrote:
| Which is good background to this story about Reddit locking down
| robots.txt and trying to get money from the AI teams scraping
| their content.
|
| https://news.ycombinator.com/item?id=41057033
| roughly wrote:
| If they're considering Reddit content to be free of generated
| material, I've got bad news for them. It's not quite the
| Chernobyl-grade hole that Pinterest has become, but it's hardly
| "low background".
| tyingq wrote:
| Sure. I think Reddit is aware though, that time is running
| out to get paid for whatever human generated content is there
| that isn't already scraped.
| visarga wrote:
| I still believe reddit is an amazing source. Any article you
| read on reddit, chances are the comments are better than the
| original text. They will debunk the article, present a
| diversity of reactions, and most importantly, they will be
| grounded in public opinion unlike the press which caters to
| money interests.
|
| You just copy-paste a conversation into the LLM and ask for
| an article. For taste, here is one generated from this very
| conversation. https://pastebin.com/raw/JFH6PGqg
| mvdtnz wrote:
| > Any article you read on reddit, chances are the comments
| are better than the original text.
|
| We're talking about reddit dot com here? Seriously? I find
| it difficult to find any comments worth reading at all on
| that website. 99% of the stuff that isn't buried is just
| the same recycled jokes again and again and again.
| daft_pink wrote:
| It's like the ai generated version of index funds.
| roughly wrote:
| Back when I was getting my econ degree, we were taught about the
| Ultimatum game, which goes like this: You get two participants
| who don't know each other and will (ostensibly) never see each
| other again. You give one of them $100, and they make an offer of
| some portion of it to the other. If the other accepts, both
| parties keep their portion - so, if A offers B $20, and B
| accepts, A keeps $80 and B keeps $20, if B rejects, both parties
| get nothing. Standard economic theory suggests A can offer $1 and
| B will accept, because otherwise B gets nothing. Spoiler for
| those of you who haven't seen how standard economic theory plays
| out in real life, that's not how the game went - typically,
| offers below ~$30 or so got rejected, because B was a real
| feeling person who felt like they were getting screwed and opted
| to punish A for doing so. The exception to this - the people who
| would take the $1 offer - were people who had been taught
| economic theory. It turns out you _could_ screw them over and
| they'd pat themselves on the backs for being very wise.
|
| The "tragedy of the commons" is another one of those parts of
| standard economic theory that never actually played out in
| reality - we've got examples from all over the world of
| communities implementing practices and often entire belief
| systems that led them to be responsible stewards of shared
| resources without requiring unilateral ownership of that resource
| and singular acquisition of the benefits of that stewardship, and
| yet first on the lips of every modern capitalist when describing
| why they're at a disadvantage if they're not the ones polluting
| the water supply is the tragedy of the commons.
| asah wrote:
| ...in the real world, A tells B that he "sourced" the deal and
| therefore deserves a bigger cut and in the real world, B agrees
| up to a point (the $30 mark). Over time and rounds of playing
| the game, the A's of the world learn where the line is and
| optimize to stay on the correct side of it, only testing the
| other side 1-2% of the time to see if rules/behavior has
| changed.
| splwjs wrote:
| It's crazy how most political or economic systems would very
| obviously collapse in the real world almost instantly without
| some kind of voluntary moral contract (explicit or implied),
| yet we've got huge clumps of people demonizing one system or
| another based on the context of what happens when you implement
| it in a morally dead societal context.
|
| Like there are a ton of people who smirk at your last paragraph
| and go "nuh uh, hashtag late stage capitalism"
| roughly wrote:
| A hundred percent. I've said this elsewhere, but a primary
| problem for at least American society at this point is we
| don't have a commonly-agreed upon moral system other than the
| market - things like Martin Shkreli buying drugs people need
| to live and jacking the price up are Bad, but we don't have a
| common language for describing why it's immoral, whereas our
| only real common shared language, the market, is basically
| fine with it as long as it's legal. A lot of the market logic
| works fine for society within constraints - optimize your
| costs, but not at the expense of your workers; increase your
| prices if you can, but don't be a ghoul about it; lobby for
| your position, but don't just buy a supreme court judge.
| renewiltord wrote:
| If you iterate the game, it's obvious. I, as the responder,
| control the proposer's income. Extend to infinity with
| knowledge of iteration and you reach symmetry between proposer
| and responder.
| roughly wrote:
| > If you iterate the game, it's obvious.
|
| We're shockingly bad at doing this in modern society. Our
| temporal planning horizon is somewhere between 6 months and 5
| years, whereas our lifespans are around 75-80.
| RyanAdamas wrote:
| This reminds me of Lord of the Flies. The real version of the
| events turned out very differently.
|
| https://www.newsweek.com/real-lord-flies-true-story-boys-isl...
| roughly wrote:
| Rebecca Solnit wrote a book, "A Paradise Built in Hell", on
| how people behave during disasters, and found broadly the
| same thing - contra the prepper myths, most people most of
| the time faced with disaster come together to work
| cooperatively to help each other.
|
| We're a fundamentally social species - we've got smaller
| brains than Neanderthals did, we're not a particularly tough
| species, but we're very, very good at cooperating with each
| other.
| snakeyjake wrote:
| Game theory only applies to sociopaths and economists, but I
| repeat myself.
| samatman wrote:
| > _It turns out you _could_ screw them over and they 'd_
|
| End up with a dollar in their pocket which they otherwise
| wouldn't have.
|
| The Ultimatum game is a useful insight into human psychology:
| for one thing, it tells us who thinks that the defector in this
| equilibrium is better off than a counterfactual cooperator.
|
| Ah, but they have their pride! Ok. My pride is not affected by
| someone else having 99 bucks they didn't earn, and myself $1
| likewise. Maybe that other fellow really needed the money.
| roughly wrote:
| Indeed. You're very wise!
| imtringued wrote:
| I don't know what the hell you're talking about. Your
| argument is incoherent. If you wanted to allocate the money
| according to the individual's utility of money, then a rule
| of thumb of $1 is going to be wrong. You should, given no
| information, assume that both have the same utility of money
| and that the utility of money is diminishing, favouring an
| even split.
| partypete wrote:
| You may be interested in some of the foundational papers
| exploring game theory models similar to the Ultimatum
| game[1][2]. These are known as Iterated Prisoner's Dilemmas.
|
| ---
|
| [1] The Evolution of Cooperation (https://ee.stanford.edu/~hell
| man/Breakthrough/book/pdfs/axel...)
|
| [2] Evolutionary Dynamics of Spatial Games (https://www.science
| direct.com/science/article/abs/pii/016727...)
| ball_of_lint wrote:
| Each player can limit the other's income to $0 - the offerer
| can offer $0 and the receiver can reject any deal.
|
| So then what's optimal? $50 seems obviously fair, but does that
| mean we ought to reject offers of $49 100% of the time? Not
| quite, to limit the opponent's expected income for an offer of
| $49 to $50 instead of the $51 they left for themselves, we can
| use a mixed strategy that only accepts the offer with
| probability 50/51. Extending that gives the opponent a benefit
| curve that is linear as they leave themselves more money up to
| $50 and then flat at $50 afterwards.
|
| That's good, but we can make it better - if we accept offers
| for $X<$50 with probability 50/(100-X) - epsilon*(50-X), then
| their expected benefit curve is smooth and has a peak at $50,
| which is the most we can expect to make except against a
| generous opponent.
|
| After all that, playing this game as stated against an unknown
| opponent there's a lot of uncertainty. Maybe all your opponents
| are entirely irrational and move at random. Maybe all your
| opponents have colluded and decided that $66 for the offerer
| and $34 for the receiver is fair and that's the only deal
| they'll make. But if you think that random actors in the
| universe are reasonably intelligent and can discover the
| equilibrium above with the thought worth putting into this
| Ultimatum game, the receiver strategy above properly aligns
| incentives.
| betenoire wrote:
| Seems analogous to the effect of echo chambers on humans
| AnimalMuppet wrote:
| Or navel-gazing. In fact, that's one of the classically known
| flaws. (So well known that it has many names: ivory tower,
| navel gazing, getting stuck in your own head...)
|
| If you don't compare your thoughts to the outside world, it's
| easy for them to diverge more and more from reality.
| betenoire wrote:
| you are right, navel-gazing describes it perfectly
| tensor wrote:
| It's important to note that outside world means the actual
| world, not the thoughts of other humans. You need a way to
| establish ground truth, which comes from observing the actual
| outcome of actions and experiments.
| hprotagonist wrote:
| I Am Sitting In A Room
| https://en.wikipedia.org/wiki/I_Am_Sitting_in_a_Room
| padraicmahoney wrote:
| This seems extremely interesting, but I don't have the time right
| now to read this in depth (given I would also need to teach
| myself a bunch of technical concepts too).
|
| Anyone willing to weigh in with a _theoretical intuition_? The
| one in the paper is just a little inaccessible to me right now.
| jlos wrote:
| As far as I understand Douglas Hofstadter's Godel, Escher, Bach -
| self-referential recursive structures (strange loops) are the
| foundation of consciousness (among other interesting things).
| I've been watching to see if LLM's becoming self-referential
| actually improves them as opposed to degrades them.
| mensetmanusman wrote:
| The interesting thing about loops is that they can generate
| fields (think motion if current generating a magnetic field).
|
| Consciousness is more like a field than like a particle (which
| are also fields), but we haven't determined how conscious
| fields fit in physics models.
| hiddencost wrote:
| A lot of these papers are wrong. They do something wrong in their
| setup and then claim their conclusion shows show general truth.
|
| Publishing in nature in ML can actually be a red flag, because
| they're really not well equipped to evaluate a lot of claims.
|
| The latest llama model got a lot of its data using labels from
| llama2, and every frontier lab is talking about self training as
| the future.
| slashdave wrote:
| Who are "they"? And do you actually believe the practice of
| publishing unvetted preprints is a good thing in ML research?
| hiddencost wrote:
| Non sequitur? I never said that.
|
| Good venues include main track NeurIPS, ICML, ACL, e.g.
|
| Nature is notorious for publishing PR pieces that don't
| reproduce, and their ML theory publishing has been quite
| poor. They do pretty well on things like AlphaGo, materials
| science, or weather modeling because it's more in their
| wheelhouse and the results don't require a deep understanding
| of info theory or ML practice.
| slashdave wrote:
| Those venues have huge issues with referees. It comes down
| to who is reviewing the work.
|
| The irony in your comment is that it is related to the
| paper we are discussing. There is a big problem with
| poisoning from group-think and self reinforcement in
| current ML research.
| anon291 wrote:
| Is this an artifact of floating point precision or a fundamental
| mathematical truth.
| slashdave wrote:
| Floating point precision is not involved (most LLM models still
| function after floating-point quantization).
|
| I am puzzled that some find this result at all surprising. You
| simply cannot generate information from nothing.
| anon291 wrote:
| I'm not surprised you can't use it to make it better, but one
| might imagine gradients would go to zero as you fed the model
| its own output.
| slashdave wrote:
| No, not even close. Gradients don't come to zero in the
| first place. Training is never perfect.
| anon291 wrote:
| Let's restate. I'd imagine you end up in local minima
| that are difficult to escape using model generated data.
| So sure, non-zero gradients, but if you plot the
| gradients, I would expect them to orbit at that point.
| But it seems like they diverge.
| slashdave wrote:
| Mini-batches and dropout mean that you are constantly
| jumping out of and into other minima during training of
| any type (highly-redundant solution space is an important
| feature of deep learning). This is deliberate and
| necessary to explore the gigantic parameter space of
| these huge LLM models.
| skybrian wrote:
| It's a lossy transformation, so you're losing information each
| time. It's never going to add information.
|
| However, some information is junk that obscures the good stuff.
| It's likely that how they train today is very inefficient
| compared to what's possible, and there will be smarter ways to
| transform preexisting data so that it's a better dataset to
| train on, without losing very much.
|
| Papers like this one show what not to do.
| visarga wrote:
| > there will be smarter ways to transform preexisting data so
| that it's a better dataset to train on, without losing very
| much
|
| Like, take for example search. Instead of training on a bunch
| of scraped texts, you take one prompt, select 10 references,
| and use it to synthesize an answer. Referencing multiple
| texts gives you more than training on them directly. The LLM
| could catch contradictions, observe the distribution of human
| opinions, note if the topic is controversial. And then output
| a wikipedia-like article. Do this billions of times, and you
| got a refined dataset. You can iterate on top, using the
| articles as source and writing meta articles. Or just silly
| studies like writing a paper about "Characters named Charlie
| in literature". You can slice and dice the data in any way,
| and analyze the cross section.
| Kuinox wrote:
| Meanwhile OpenAI, Anthropics, trains on AI generated data to
| improve their models, and it works.
|
| https://openai.com/index/prover-verifier-games-improve-legib...
|
| https://www.anthropic.com/research/claude-character
| echelon wrote:
| I'm long on synthetic data.
|
| If you think about evolution and hill climbing, of course it
| works.
|
| You have a pool of information and you accumulate new
| rearrangements of that information. Fitness selects for the
| best features within the new pool of data (For primates,
| opposable thumbs. For AI art, hands that aren't deformed.) It
| will naturally drift to better optima.
|
| RLHF, synthetic data, and enrichment are all we need.
| ainoobler wrote:
| Are you sure about this? It's well known that cannibalism in
| animals leads to degenerative disorders.
| autokad wrote:
| I think the direct action of a person taking their idea and
| thoughts and going through it many times (making changes /
| updates / fixes) fits better than eating something.
| however, I do think you still some form of validation data
| to ensure these are good changes.
|
| However, I do get the spirit of the article, that as more
| information generated online is done by LLms, the validity
| and use of the output decreases
| ainoobler wrote:
| What exactly is doing the validation?
| autokad wrote:
| depends on what one was doing. could be as simple as re-
| writing a sentence and asking someone if it looks better
| photonthug wrote:
| Not sure why you're downvoted, I think a comparison with
| prions seems apt and interesting, and bad protein copies
| that can replicate is essentially an information process.
| GAN research in recent years showing how you can sabotage a
| working dog/cat classifier with a one pixel change feels
| similar to how the tiniest parts of large systems can
| sometimes undermine the whole completely, albeit with low
| probability. And finally, since models will bootstrap
| models that bootstrap models, inevitably there are already
| subtle issues out there in the wild that may have an
| incubation period of many years before the downstream
| effects are completely clear.
| ainoobler wrote:
| The problem is systemic. People believe that the pursuit
| of monetary and financial profits by corporations will
| lead to the creation of benevolent artificial
| intelligence. I personally think this is essentially a
| religion because it is obvious that the pursuit of
| profits can not actually create anything benevolent, let
| alone intelligence.
| throwup238 wrote:
| _> If you think about evolution and hill climbing, of course
| it works._
|
| You don't even need to go that far. How do most children
| learn? By reading textbooks and listening to lesson plans
| assembled by their teachers from all the relevant content the
| teachers have experienced.
|
| Our education systems are built on synthetic data that is
| created for optimized learning, so that every child doesn't
| have to prove the universe from scratch to learn some basic
| maths.
| samatman wrote:
| That isn't synthetic data in any reasonable or meaningful
| sense of the term.
|
| You could describe a textbook as a synthesis, sure, in a
| sense which absolutely does not track with the 'synthetic'
| in 'synthetic data'.
|
| Unless the textbook is AI-generated, and I expect that in
| 2024, the number of AI-generated textbooks is not zero.
| throwup238 wrote:
| It's an analogy. The learning materials teachers create
| for students is very much like synthetic data, it's just
| not assembled from algorithmic output.
| lanstin wrote:
| kids learn, walking talking reading arithmetic and
| physics, by doing things in the physical world. Adults
| may speak differently to kids than adults, but it's a
| stretch to say it's synthetic. Equivalent to synthetic
| would be a group of kids that just grew up together and
| made up a novel language.
| sunnybeetroot wrote:
| By this reasoning wouldn't all information that you didn't
| discover yourself be synthetic data?
| throwup238 wrote:
| Yeah and that's why we call it "standing on the shoulders
| of giants." Humans went through tons of trial and error
| in every facet of life to get where we are today. We kept
| the stuff that worked and taught it.
|
| But before humans can understand enough language to
| ingest that synthetic data, they do a lot of their own
| discovery based training where they learn about the world
| physically and absorb the language people around them
| use, kind of like throwing random internet data at an
| LLM.
| nick486 wrote:
| the equivalent here would be a child learning from a
| textbook he has written himself.
|
| not sure how effective that would be, if it was his only
| source of learning.
| throwup238 wrote:
| Well that's what the TFA is about. If you
| indiscriminately ingest synthetic data into training -
| the child learning from their own textbook - the model
| collapses.
|
| The SOTA is to use a discriminator (often another LLM or
| ML algo) to select the best output before feeding it into
| the training data. That's what OpenAI, Anthropic, et al
| have been doing. One of them just published a paper about
| it a few weeks ago.
| Salgat wrote:
| Synthetic data has to work if we hope to have ML models that
| can improve themselves in a similar fashion as humans when it
| comes to advancing knowledge.
| randcraw wrote:
| Data created automatically is not the same as human curated
| data, though both are synthetic. Auto-created data often
| suffers from a host of demerits (duplication, bias, error,
| unnatural distribution, irrelevance to learn the intended
| domain, etc, etc). Human curated data usually avoids these
| pitfalls, and thus is far more valuable when training --
| otherwise all human teachers would be equally good. So auto-
| vs curated- data are incomparable when training naive
| neophytes like ML models, or children.
| HPsquared wrote:
| It's important to climb the right hill.
| psb217 wrote:
| And to have a very well-tuned sense up vs down when the
| hill is almost flat...
| kjkjadksj wrote:
| This misunderstands fitness. Its not a sure bet what is most
| optimal is what you see. "Good enough" given environmental
| context is what you see. Just like with certain crystal
| structures in chemistry, you may only be in a localized
| threshold of fitness stability that is not necessarily
| optimal, but separated from another optimal configuration by
| having suboptimal intermediary steps that need more
| activation energy to overcome before falling into a state
| with lower entropy (or more optimal fitness).
|
| In other words you can never be sure if synthetic data is any
| good or if what things gravitate toward are really most
| optimal.
| echelon wrote:
| > Its not a sure bet what is most optimal is what you see.
|
| I wouldn't ever make "most optimal" a criteria. We're
| looking for measurable improvements, not a jump to god
| emperor or apex predator.
|
| > you may only be in a localized threshold of fitness
| stability that is not necessarily optimal, but separated
| from another optimal configuration by having suboptimal
| intermediary steps that need more activation energy to
| overcome before falling into a state with lower entropy (or
| more optimal fitness).
|
| Optimization is like that. But unlike genetics, where we
| can't re-route the recurrent laryngeal nerve or change
| fundamental biochemistry, these are engineered systems
| where we can set up wildly different experiments at any
| time. Just to cite one of many different research threads,
| there's now research now going into developing models from
| small scale training data.
|
| > you can never be sure if synthetic data is any good or if
| what things gravitate toward are really most optimal.
|
| We can know if the synthetic data is _better_. We have
| objective measures, a scientific process, and we 'll always
| be striving for improvement.
| mcswell wrote:
| The paper is not talking about verifiable synthetic data
| generated by some means other than LLMs.
| nyrikki wrote:
| Cheese and Chalk.
|
| It is very different to generate synthetic datasets to assist
| in targeted training , vs ingesting LLM output from web
| scraping.
| abeppu wrote:
| Yes -- said another way, if you're an ML researcher and you
| have human-provided (scraped) data, and an ability to
| generate synthetic data, then until recently, you had a
| controllable parameter: how much of your training data for
| your new model should be synthetic? You can vary this, run
| multiple experiments, and choose how much synthetic data to
| use -- and you can vary the specific configs about how that
| synthetic data is generated.
|
| If synthetic data is mixed into your upstream data sources
| _in a way you cannot control_ , then your ML team loses a
| valuable controllable parameter.
| miki123211 wrote:
| You still have some that control, but in a much more
| indirect way.
|
| There are three kinds of data now, synthetic, pre-2022 and
| current. Everything pre-2022 is definitely written by
| humans, synthetic data is still synthetic, and post-2022 is
| a mix of both.
|
| I wouldn't be surprised if "AI detectors" work somewhat for
| this use case. They're biased, far from accurate and a
| terrible idea if you need to make important decisions (like
| whether to expel a student for cheating), but there's quite
| a large room for errors here.
| uberswe wrote:
| > Everything pre-2022 is definitely written by humans
|
| I'm not sure if methods like article spinning counts as
| written by humans. This is something you could automate
| before AI and it would take a human written article and
| randomly swap words with similar meaning throughout to
| make it seem original.
| derefr wrote:
| Don't forget machine-translated texts, where until ~2017
| the translation was likely done by something much dumber
| / semantically lossy than an LLM, and after 2017 was
| basically done by an early form of LLM (the Transformers
| architecture originating in Google Translate.)
|
| Many historical English-language news reports published
| on the English-language websites of foreign news media
| from non-English-speaking countries, from 1998 (Babelfish
| era) to ~a few months ago, may be unreliable training
| data for this reason.
| mrbungie wrote:
| They do work detecting LLM outputs that are sampled
| "naively" (when the model/user is really not trying to
| pass it as human output).
|
| I copied a prompt translated from spanish to english
| using ChatGPT Plus in a GPT-4o Azure OpenAI Service
| endpoint. It did work in Spanish but didn't run in
| english because the default AOS Content Filters detected
| a jailbreak intent. It was quite weird.
| lawlessone wrote:
| I think this is it.
|
| Generated data is ok if you're curating it to make sure
| nothing bad, wrong or insensible comes in.
|
| Basically still needs a human in the loop.
| surfingdino wrote:
| Then why not remove this crap (LLMs) from the loop
| altogether? How did we get from "AI will replace you" to
| "your new job will be an AIs janitor" in the space of about
| 12 months?
| sieste wrote:
| there is nothing wrong with being a janitor. you could
| also call it "AI editor" instead of you want to insert a
| job title sounds more prestigious. some people find it
| easier and more enjoyable to edit a first draft generated
| by a language model based on instructions than writing
| that first draft themselves.
| the8thbit wrote:
| I gotta say, Claude is a godsend for building out quick
| prototypes of ideas, especially when those ideas require
| domain specific knowledge that you know _a little_ about
| but aren 't specialized in. Which is most interesting
| programming projects.
|
| Sure, I could do it myself, but it would take more time,
| each step would have less momentum, and I'd have to think
| more while I do it. Which, there's a place for that too,
| of course.
| visarga wrote:
| > Sure, I could do it myself, but it would take more
| time, each step would have less momentum, and I'd have to
| think more while I do it. Which, there's a place for that
| too, of course.
|
| You just start faster, but end at the same time. If you
| really need to understand something there is no LLM
| shortcut. I spent hours interrogating Claude, in the same
| time I could have studied from a book and gotten even
| better grounding.
| lawlessone wrote:
| >Then why not remove this crap (LLMs) from the loop
| altogether
|
| Because reading is faster than writing.
|
| Someone could spend a few years or even most of their
| life writing a book that can be read in a matter of hours
| days or weeks.
|
| Humans writing have to proofread their own work. Or
| occasionally even pay someone else to do it.
| abeppu wrote:
| No, bad/wrong/nonsense is _not_ the only risk here. You 're
| missing the main point that the authors are making: the
| shape of the distribution gets changed by this process. A
| model trained on human data will produce fewer high-
| perplexity examples than it was trained on (you can see
| this in Fig 1b, even between generation 0 and 1). In a
| literal information theory sense, these perplexity values
| indicate how much information is in each example. Over
| successive generations models have less actual information
| to learn from even if they have the same volume of text.
| visarga wrote:
| LLMs are milking us of knowledge and skills, repackage
| them and give it back to us. Models interact with the
| internet, humans and code execution. They are exploring.
| Lots of exploring now happens in the chat room, a place
| where ideas are first tried out. With billions of users,
| the volume of information LLMs collect from us is huge.
| We bring references, guidance and feedback right into its
| mouth, the LLM doesn't even need to do anything like
| crawling.
|
| Imagine how many things we know, things we accumulated in
| our life experience, that were never written down
| anywhere. That information was lost to others. But now we
| use LLM assistants, so they get to be in the loop and
| collect tidbits of human life experience that is not
| written on the internet. And soon they will also work on
| audio/video and travel with us everywhere, seeing what we
| show them.
| Pinkthinker wrote:
| I think that maybe we are too harsh in expecting LLMs to
| be perfect. If they are based off of human input that is
| incorrect then we might propagate such errors. But they
| will still be quicker and much more reliable than most
| people. Isn't this good enough? After all, we are willing
| to accept flaws in people, even including the president.
| I suspect that the way forward will be to progressively
| clean the LLM input data as each error gets identified.
| visarga wrote:
| > Basically still needs a human in the loop.
|
| Yes, and big LLM developers have millions of humans in the
| loop. That's why they provide free access, for human in the
| loop filtering & guidance.
|
| If I go to chatGPT and solve a coding task, maybe the first
| 3 ideas don't work and the 4th works. It can do RLHF
| setting the first 3 with negative and the fourth with
| positive score. They just used me to test their model and
| create a datapoint.
|
| Using LLM is useful both ways - for humans, we get
| assistance, and LLMs get feedback for their outputs. This
| seems like the new form of "you are the product".
| mcswell wrote:
| Yeah, I raised the same issue before reading your post;
| ninja'd I am.
|
| I like your "cheese and chalk".
| ddingus wrote:
| I always preferred sugar and shit. Obviously, that is
| profane. But, I consider profanity be seen as the part of
| speech it really is.
| groby_b wrote:
| Profane is, if you will, fucking fine.
|
| But "cheese and chalk" is a great analogy because both
| are sources of calcium, but cheese is much better for the
| human body. It carries useful info.
| progmetaldev wrote:
| Also drives your point home more efficiently. While it
| may be profane, there's far more speech available with
| far less "use" that is intentionally profane to spark a
| reaction without regard to what that reaction may be.
| Shock value for attention, rather than to carry home a
| point.
| HanClinto wrote:
| Keep in mind that the Prover-Verifier game is not that it's
| training on AI-generated data (as if to imitate it) -- rather,
| it's training against a discriminator that verifies for
| correctness (a calculator) and understandability (a smaller,
| less-capable language model). You can think of this as a
| distillation method, but it's not like it's generating large
| amounts of source data and then retraining on it. This method
| only works on specific problems where there is an absolute
| right answer that can be verified with an independent heuristic
| (in this case, a math calculation).
|
| However, there is a lot of potential in the world of self-play
| and adversarial-training to improve the quality of our LLMs
| with true reinforcement learning.
|
| For one recent paper on this topic, also check out SPAG -- I
| found this one to be fascinating:
|
| https://github.com/Linear95/SPAG
|
| I've been keeping notes on this topic in a WIP paper, and if
| you'd like to read my (rambling) ravings about it, you can find
| more info here:
|
| https://github.com/HanClinto/MENTAT
|
| I think that self-play and reinforcement learning are going to
| absolutely be important for the next level of LLM development.
| If you use AI-generated data, then you must have an objective
| metric to verify "goodness". Nothing is free, and simply asking
| an LLM to rate the quality of its own data is not going to cut
| it. I think that's the point of the article.
| visarga wrote:
| > Meanwhile OpenAI, Anthropics, trains on AI generated data to
| improve their models, and it works.
|
| They got a secret ace in their pocket - chat logs created with
| human in the loop. Of course those might still have errors, but
| much fewer. They can infer from a human response if it was
| accepted or not.
|
| I think OpenAI generates at least 1B sessions per month and 2
| Trillion interactive tokens. Those can go into the LLM again
| for analysis and synthetic content generation, or for RLHF with
| the whole conversation as guidance. Having access to the
| following interactions can shed light on previous answers.
|
| Even more, they can correlate chats across days, presumably
| humans try out LLM ideas in reality and return for iteration.
| That way LLMs indirectly get real world grounding.
| mrbungie wrote:
| This is likely one of the main reasons why they're offering
| ChatGPT for free and running ChatGPT Plus at a loss.
| megaman821 wrote:
| There are other ways AI can help train other AI that aren't
| generating data. AI could remove low quality data from a training
| set. It could assist humans in structuring video, 3D and physics
| simulation datasets for the best learning results.
| simonw wrote:
| > We find that indiscriminate use of model-generated content in
| training causes irreversible defects in the resulting models
|
| The key word there is "indiscriminate". All of the big AI labs
| have been training on synthetic data for at least a year at this
| point, but they're doing so deliberately.
|
| I don't think the "model collapse" problem is particularly
| important these days. The people training models seem to have
| that well under control.
| jagged-chisel wrote:
| I find nothing wrong with your statement. I am curious about
| the paper's use of "indiscriminate." I read this as "just feed
| the AI more AI output without care" which one can indeed do
| deliberately.
|
| Seems to me that deliberate _discriminate_ use should yield
| better against expectations.
| __jl__ wrote:
| Came here to say the same. "indiscriminate" doesn't really make
| sense. It's very deliberate.
|
| However, there is one scenario: Scraping of web data. In that
| case, AI labs might know what is model generated.
| mcswell wrote:
| The question (which I raised in a top-level comment before
| reading your post) is whether there is any such thing as
| "discriminate" use of web data. Synthetic data created in the
| same lab as the LLM is discriminate, but what the authors of
| the paper are saying (if I read it correctly) is that scraping
| the web is not currently done in a discriminate way. And it's
| not at all clear to me that there _is_ a discriminate way to
| use web scraping, because you can 't know for sure what's
| human-generated and what's LLM-generated.
| simonw wrote:
| I get the impression that scraping the web isn't nearly as
| important a source of LLM training data as it used to be.
|
| Everyone is trimming down their training data based on
| quality - there are plenty of hints about that in the Llama
| 3.1 paper and Mistral Large 2 announcement.
|
| OpenAI are licensing data from sources like the Associated
| Press.
|
| Andrej Karpathy said this:
| https://twitter.com/karpathy/status/1797313173449764933
|
| > Turns out that LLMs learn a lot better and faster from
| educational content as well. This is partly because the
| average Common Crawl article (internet pages) is not of very
| high value and distracts the training, packing in too much
| irrelevant information. The average webpage on the internet
| is so random and terrible it's not even clear how prior LLMs
| learn anything at all.
| jroesch wrote:
| I think this is roughly correct. My 2c is that folks used
| the initial web data to cold start and bootstrap the first
| few models, but so much of the performance increase we have
| seen at smaller sizes is a shift towards more conscientious
| data creation/purchase/curation/preparation and more
| refined evaluation datasets. I think the idea of scraping
| random text except maybe for the initial language
| understanding pre-training phase will be diminished over
| time.
|
| This is understood in the academic literature as well, as
| people months/years ago were writing papers that a smaller
| amount of high quality data, is worth more than a large
| amount of low quality data (which tracks with what you can
| pick up from an ML 101 education/training).
| markwkw wrote:
| You trim, yes, but AI content surely invades (all?) areas
| of written material. People are increasingly using AI to
| assist their writing. Even it if's for slight editing, word
| choice suggestions.
|
| Even AP doesn't ban the use of LLMs, its standards prohibit
| direct publishing of AI-generated content. I'm sure its
| writers leverage LLMs in some ways in their workflow,
| though. They would probably continue to use these even if
| AP attempted to ban LLMs (human incentives).
| tensor wrote:
| If the AI generated content is filtered for quality or is
| corrected then it will still be good data. The phenomenon
| of model degradation is only in the case where there is
| no outside influence in the generated data.
| progmetaldev wrote:
| I think this is extremely important with AI generated
| content, but seems to be given less and less thought as
| people start to "trust" AI as it seeps into the public
| conscious more. It needs to be reviewed, filtered, and
| fixed where appropriate. After that, it isn't any
| different from reviewing data on your own, and wording it
| in a way that fits the piece you're writing.
| Unfortunately, there's so much trust in AI now that
| people will go ahead and publish content without even
| reading it for the correct tense!
| smeagull wrote:
| They needed to deal with degenerate data on the Web anyway.
| It's always been full of trash and spam.
| progmetaldev wrote:
| I agree with you when it comes to training, but at the same
| time, I think that's also the power we get with the web.
| You can have a voice, even if others don't agree with you.
| I don't think that should be taken away unless you are
| inciting violence.
| olejorgenb wrote:
| At least some of the LLM generated content will be
| vetted/selected for by a human being though.
| mmazing wrote:
| They make it clear in the paper that their primary "real-world"
| concern is that it's difficult to distinguish synthetic data
| from real human interaction when scraping data from the web.
| This will only get worse over time with our current way of
| doing things.
|
| How are they supposed to deliberately train on synthetic data
| when they don't know whether it is (synthetic) or not?
|
| Also, do you not feel that it is presumptuous to dismiss a body
| of work in a few sentences with a "seems fine to me"?
| simonw wrote:
| In this case I wasn't reacting to this specific paper so much
| as to the widespread idea (at least that I've observed among
| AI skeptics) that "model collapse" is a huge problem.
| tremon wrote:
| How do you "discriminate" data gathering at web-scale, though?
| In my view, everything at web-scale only works because there
| are no humans in the loop, as repeatedly explained here in
| basically every thread involving Google or Facebook. Yes, since
| it's a scientific paper they should have defined their usage of
| the word, but I see nothing wrong with the basic premise that
| automation at large-scale implies indiscrimate use of content.
| Lerc wrote:
| Or to consider the inverse of indiscriminate, selection.
|
| Mutation = bad.
|
| Mutation + selection = good.
|
| (given enough iterations)
| msoad wrote:
| wow this is such a good point! Evolution is just that!
| mvdtnz wrote:
| > I don't think the "model collapse" problem is particularly
| important these days. The people training models seem to have
| that well under control.
|
| And you base this on what? Vibes?
| simonw wrote:
| Basically yes. Vibes based on reading between the lines of
| various papers, blog announcements and tweets from people
| better informed than I am.
| simonw wrote:
| The source code that accompanies the paper is available in a zip
| file here: https://zenodo.org/records/10866595
|
| I copied that into a Gist to make it easier to browse here:
| https://gist.github.com/simonw/b3ab1588a681dda821da9fb57290d...
| bjourne wrote:
| The article contains no proof of theorem 3.1 and finding
| counterexamples seems trivial. Adult male weight can be modeled
| by N(85, 20). You can recursively "train" the model on data it
| generates without having it collapse. It will stay stationary as
| long as the samples are large enough.
| NaiveBayesian wrote:
| I believe that counterexample only works in the limit where the
| sample size goes to infinity. Every finite sample will have
| m[?]0 almost surely.(Of course m will still tend to be very
| close to 0 for large samples, but still slightly off)
|
| So this means the sequence of m[?] will perform a kind of
| random walk that can stray arbitrarily far from 0 and is almost
| sure to eventually do so.
| bjourne wrote:
| Fair point about the mean, but I don't see how the random
| walk causes the standard deviation to shrink towards zero.
| lostmsu wrote:
| I agree. The authors generate a dataset of a similar size
| as the original and then train on that continuously (e.g.
| for multiple epochs). That's not what you need to do in
| order to get new model trained on the knowledge of the
| teacher. You need to ask the teacher to generate new
| samples every time, otherwise your generated dataset is not
| very representative of the totality of knowledge of the
| teacher. Generating samples every time would (in infinite
| limit) solve the collapse problem.
| NaiveBayesian wrote:
| Agreed, that's what I struggle to see as well. It's not
| really clear why the variance couldn't stay the same or go
| to infinity instead. Perhaps it does follow from some
| property of the underlying Gamma/Wishart distributions.
| mcguire wrote:
| Does the Supplementary Information (starting on p. 4, for
| example) help?
|
| https://static-content.springer.com/esm/art%3A10.1038%2Fs415...
|
| In your counterexample, can you quantify "as long as the
| samples are large enough"? How many samples do you need to keep
| the s.d. from shrinking?
| bjourne wrote:
| Maybe. "Overall, this only shows us how far on average we go
| from the original distribution, but the process can only
| 'terminate' if the estimated variance at a certain generation
| becomes small enough, i.e. we effectively turn into a delta
| function." Iiuc, variance is modeled as a random walk that
| will sooner or later reach on zero. I'm not sure I buy that
| because the variance "walks" orders of magnitudes slower than
| the mean and is much more robust for large sample sizes.
| FredPret wrote:
| Data in --> slop out.
|
| Slop in --> yikes
| vzaliva wrote:
| I call this "LLM inbreeding." It's a vicious loop where new
| models are trained on AI-generated content, resulting in the
| quality degenerating with each generation.
| tiborsaas wrote:
| I like this analogy. With the Cambrian explosion of LLMs, we
| are getting into safe territory, aren't we? Aren't we?
| dang wrote:
| Related ongoing thread:
|
| _The problem of 'model collapse': how a lack of human data
| limits AI progress_ -
| https://news.ycombinator.com/item?id=41058867 - July 2024 (6
| comments)
| swayvil wrote:
| There's a complexity missing there. It's like the effects of
| incest upon dna. Or an echo chamber upon conversation.
| mcswell wrote:
| I must be missing something. Training on the output of your
| system as if it were validated input seems like an _obvious_ no-
| no. I 'm not talking about using synthetic data (however that
| might be created in this situation), but rather using anything
| and everything found on the web as if it were "real", i.e. as if
| it were human-generated texts rather than the output of the LLM.
|
| In this case of course there are multiple LLMs that are creating
| text which finds its way to the web, but to the extent that the
| output of the different LLMs have commonalities, this still seems
| problematic.
|
| And afaik, there are no metrics or algorithms that reliably
| distinguish between human-generated and LLM-generated text, at
| least not for the current generations of LLMs.
|
| What am I missing?
| visarga wrote:
| > Training on the output of your system as if it were validated
| input seems like an obvious no-no.
|
| Imagine a scientist inventing theories without testing
| anything, and then continuing to build on top. Crazy. Not even
| humans can create absent some kind of feedback or validation
| from outside. That's why we invented the scientific method.
| meroes wrote:
| Isn't that how math works in some respects? In that, there's
| only a hierarchy of consistency (no absolute consistency) for
| most of math. And we just keep building and building. We tried
| the absolute consistency route and found it too limiting.
|
| Maybe that this doesn't work for LLMs is a sign they aren't on
| the path to AGI...
|
| Personally I found LLMs horrendous at this kind of stuff. I'm
| basically a RLHF peon by trade and if I'm ever needing a quick
| way to fool a model, I go to simple logical problems, where it
| can't lean on external structures, only itself. I don't mean
| logical syntax but logical reasoning. I can't share recent
| stuff but a just a few months ago the models I work with failed
| to reason removing 12 cards from a regular deck couldn't remove
| an entire suit. That kind of stuff. Why would I want to make my
| prompt longer and more detailed to provide it extra structure
| (which is logically superfluous) to ensure it gets the right
| answer. Im sure a wordy prompt could get it to the right
| answer. I'm interested in its ability to "reason", not prompt
| engineering.
|
| Given that math is devoid of external structure, I wonder if
| there something to this (it's at least interesting to
| speculate)
| pocketsand wrote:
| You would think so, but people like Sam Altman have suggested
| that they can use AI-generated data to train their own models.
| See here:
|
| https://www.nytimes.com/2024/04/06/technology/tech-giants-ha...
| Voloskaya wrote:
| Training on ai-generated data isn't a problem, and has been
| routinely done by everyone for 18 mo +.
|
| The issue is training on 'indiscriminate' ai-generated data.
| This just leads to more and more degenerate results. No one
| is doing this however, there is always some kind of filtering
| to select which generated data to use for training. So the
| finding of that paper are entirely not surprising, and
| frankly, intuitive and already well known.
| empath75 wrote:
| It's _relatively_ easy, I think to filter out sites with a
| large proportion of low quality ai-generated glurge.
|
| Then you're left with a lot of AI generated or assisted content
| that has quite often been filtered and modified by humans, so
| that might mitigate some of the problems that cause model
| collapse because the filtered content _should_ better reflect
| reality or desirable output?
| mort96 wrote:
| I mean a fair bit of content on Reddit and Twitter is machine
| generated now, right? And content on Reddit and Twitter is
| being used to train new models, right?
| aezart wrote:
| I think you're right. When I was experimenting with llama 1, I
| was able to easily observe that with a short prompt and a long
| response, the response _rapidly_ degraded the longer it went,
| because it was seeing and amplifying the patterns in its
| context window so far.
|
| It is intuitively obvious that these problems would get even
| worse if the garbage output found its way into the training
| set, and not just into the context window.
| jksmith wrote:
| Given a time snapshot and enough computing power, isn't recursion
| inevitable? It's like running out of known universe given time x.
| So then we're back creating data without a prior dataset, which
| is still a human domain.
| ziofill wrote:
| Very interesting. But wouldn't human preferences still find their
| way into the datasets of the future?
| asadm wrote:
| "Breathing in your own exhaust can be fatal"
| throwthrowuknow wrote:
| So they fine tuned an existing model using its own completions to
| produce the training set for the next run which uses the fine
| tuned model as the base. They mention catastrophic forgetting so
| they are aware of it. I suppose they wanted to get results as
| quickly as possible but this isn't an accurate model of reality
| (pun not intended). They've only succeeded in demonstrating
| something that is well known. If they had made the effort to
| simulate mitigation of bad data and a growing corpus that
| included proportionally more synthetic data over time it would
| have been interesting.
| nostrademons wrote:
| This has happened with much simpler models than LLMs, eg. Google
| Suggest became noticeably worse when everybody started using
| Google Suggest to input their queries, because it was trained on
| real query logs and those query logs started to simply reproduce
| the output of the Suggest model. SEO and Webspam have similar
| problems within Google Search.
|
| More broadly, this is a reflection of Goodhart's Law: "When a
| measure becomes a target, it ceases to be a good measure." The
| issue is that any model's purpose is to capture novel, useful
| data about real human behavior. Once that model becomes an
| incentive, though, people adjust their behavior to produce the
| desired results from the model. Authentic behavior disappears,
| which means there's no useful information content for the model
| to capture, and future generations of the model instead just
| reproduce behaviors of the previous generation they were trained
| on, including quirks. Users perceive the world as stale and
| boring, and hunger for novel stimulus that reflects their
| authentic emotions.
|
| You could look at this as a full-employment theorem for
| entrepreneurs and artists.
| mcguire wrote:
| From my reading of the paper, this is a pretty good description
| of the problem they identify.
| mcguire wrote:
| _Nature_ published a computer science paper???!
|
| " _Given that training a single moderately large model produces
| twice the American lifetime's worth of CO2 (ref. 15), we opted to
| not run such an experiment and instead focus on a more realistic
| setting for a proof of concept._ "
| igorkraw wrote:
| It should be noted that
|
| 1. this is nothing that should surprise anyone who has an
| intuition on control theory and the evolution of unconstrained
| markov chains
|
| 2. there appear to be relatively easy mitigations
| https://news.ycombinator.com/item?id=41061085 (made a separate
| post because it might be of independent interest to discuss)
|
| 3. you still won't get beyond the imititation game boundary
| without exploration & feedback, i.e. the recursive improvement
| doomers are, as of now, still wrong
| mvdtnz wrote:
| > 1. this is nothing that should surprise anyone who has an
| intuition on control theory and the evolution of unconstrained
| markov chains
|
| You don't even need to know what a markov chain is. It is
| intuitively obvious to anyone with two brain cells to rub
| together that AI can't improve by eating its own vomit.
| lemonwaterlime wrote:
| I've been telling people this for the past few years. They
| would like to find out the hard way what control theorists
| already know.
| nurettin wrote:
| I don't see how this hurts training unless you hurl all
| hallucinations back at the model.
|
| Alpha zero used a similar approach where it trained against
| itself and that only made it better. I don't think collapse is
| real.
| zby wrote:
| If the model collapse means that the text produced by it is not
| statistically identical to the garbage that fills the Internet -
| then I guess a collapse is the goal.
| m3kw9 wrote:
| Of course it will collapse if you don't verify it, I remember
| OpenAI talking about its research into having a different model
| verify that data somehow
___________________________________________________________________
(page generated 2024-07-24 23:06 UTC)