[HN Gopher] The Curse of Recursion: Training on generated data m...
___________________________________________________________________
The Curse of Recursion: Training on generated data makes models
forget (2023)
Author : surprisetalk
Score : 107 points
Date : 2024-12-01 05:21 UTC (6 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| patrickhogan1 wrote:
| This argument seems more like the data generated was bad. There
| are examples where AI has surpassed humans by using simulated
| data (AlphaZero - where it played against itself to become the
| best at Go).
|
| It also seems to happen most on small networks. Which makes
| sense.
|
| Additionally, humans create simulated stories like Dune, Lord of
| the Rings, or Harry Potter, which introduce fictional concepts,
| yet these stories still result in trainable data.
| thaumasiotes wrote:
| > Additionally, humans create simulated stories like Dune, Lord
| of the Rings, or Harry Potter, which introduce fictional
| concepts, yet these stories still result in trainable data.
|
| No, they don't, not in any sense where they are "simulated
| data". Dune is simulated data about what life would be like on
| Arrakis, and if you train a model to make predictions about
| that question, your model will be worthless trash. (Doesn't
| matter whether you train it on Dune or not.) Dune is _real_
| data about how English is used.
| ninetyninenine wrote:
| It's also data around science fiction. With broad spectrum
| data from both dune and contextual data most LLMs know that
| dune is from a fictional novel.
| raincole wrote:
| > humans create simulated stories like Dune, Lord of the Rings,
| or Harry Potter
|
| People _really_ anthropomorphize LLM to a full circle, don 't
| they?
| patrickhogan1 wrote:
| So you are saying that if I generate stories on different
| worlds as new data, a model cannot learn from that?
|
| This isn't anthropomorphizing - it's generating data.
| Generating data is not a uniquely human endevor.
|
| What created Mars? That is data.
|
| What created star systems we cannot see?
| banku_brougham wrote:
| this is not a serious argument, please forgive me for saying
| dudeinjapan wrote:
| Thank you for making this comment, because it exposes some
| logical gaps.
|
| Firstly, Go, Chess, and other games have objective rules and
| win criteria. (There is no "subjective opinion" as to whether
| Fischer or Spassky won their match.)
|
| Language, the output of LLMs, does not have an objective
| function. Consider the following to sentences:
|
| "My lips, two blushing pilgrims, ready stand."
|
| "My red lips are ready to kiss your red lips."
|
| Both are grammatically correct English sentences and both mean
| basically the same thing, but clearly the former by Shakespeare
| has a subjective poetic quality which the latter lacks. Even if
| we make evaluation rules to target (for example, "do not repeat
| phrases", "use descriptive adjectives", etc.) AI still seems to
| favor certain (for example "delve") that are valid but not
| commonly used in human-originated English. There is then a
| positive feedback loop where these preferences are used to
| further train models, hence the next generation of models have
| no way of knowing whether the now frequent usage of "delve" is
| a human-originated or AI-originated phenomenon.
|
| Lastly, regarding works of fiction, the concern is less about
| the content of stories--though that is also a concern--but more
| about the quality of language. (Consider above alternate take
| on Romeo and Juliet, for example.)
| patrickhogan1 wrote:
| So you are arguing that the world does not have objective
| rule criteria, like Physics?
|
| And that an AI could not model the world and then run
| simulations and have each simulation generate data and learn
| from that, similar to AlphaZero.
|
| Here is a possible objective win environment
|
| Model complex multicellular organisms that become capable of
| passing the Turing test that also self-replicate.
| dudeinjapan wrote:
| My argument here is narrowly scoped to human language and
| literature. (We already know the objective rule criteria of
| life is 42.)
|
| It may very well be possible for an AI to read all of
| literature, and figure out what makes Hemingway, Tolstoy,
| and Dylan "good writing" vs. "bad writing". That has not
| yet been achieved. The problem, as the OP implies, is that
| by polluting the universe of literature with current-gen AI
| output, we may making the task of generating "good writing"
| in the future harder.
|
| Then again, maybe not. Perhaps we have enough pre-AI works
| that we can train on them versus the mountains of AI
| generating schlock, and determine the objective function.
| patrickhogan1 wrote:
| You seem to restate the argument of the submitted
| research paper in a narrow interpretation that is much
| different than its main conclusion of Model Collapse from
| synthetic data creation. But I will follow you down this
| interpretation.
|
| Why does a human have to judge something as good for it
| to be functionally useful?
|
| Humans never came up with the move that AlphaZero used to
| win 1 of the 4 out of 5 games it won. Was that a bad
| move? Did AlphaZero devolve to Model Collapse because it
| made that move?
| patrickhogan1 wrote:
| And what's interesting here is people will get annoyed
| with this.
|
| Am I pro-human? Absolutely
|
| Can non-human outputs be functionally valuable to humans
| and not per se harmful? Absolutely, we live in a
| biosphere that humans didn't create or if humans did
| create it, it was humans more evolved than humans we are
| aware of.
| alterom wrote:
| The dignified way to describe the problem at hand is alluding to
| Brouwer's fixed-point theorem[1], with white noise as the fixed
| point.
|
| The more practical way is alluding to The Human Centipede[2].
|
| Either way, the feed-back loop doesn't result in a good output.
|
| [1] https://en.wikipedia.org/wiki/Brouwer_fixed-point_theorem
|
| [2]
| https://en.wikipedia.org/wiki/The_Human_Centipede_(First_Seq...
| benchmarkist wrote:
| Also the data processing inequality:
| https://en.wikipedia.org/wiki/Data_processing_inequality?use...
| cma wrote:
| And yet I prefer now to early big bang era of the universe,
| though technically reversible.
| benchmarkist wrote:
| The universe is not a Markov chain, in fact, no one knows
| what it is but locally we do know that entropy increases
| and the inevitable endpoint in our corner of the universe
| is complete annihilation. Your preferences are completely
| irrelevant in the local scheme of things.
| banku_brougham wrote:
| My intuition is the public, users, nor the industry will take
| this problem seriously. To me this paper sounds a thunderclap.
| dale_glass wrote:
| It's not a real problem, in my understanding. It's a "this
| kills cancer in a petri dish" sort of thing.
|
| Yes, it makes sense that if your algorithm is at all lossy,
| passing outputs through it again compounds the loss.
|
| The reality though that this doesn't happen on a large scale.
| JPEG isn't destroying all imagery because we're not stupid
| enough to constantly compound compression losses.
|
| Most AI output is generated for some sort of purpose. If we're
| making images then we're throwing out the bad ones, retouching
| blemishes, and also still making new works entirely by hand.
| m11a wrote:
| Well, if we're using the output of AIs, writing blog post
| articles with it and using AI to post comments on Reddit/X
| etc (as some do), and a few years later OpenAI et al refresh
| their datasets to train a new model, then you're doing
| exactly that aren't you? Using lossy model outputs to put
| into the function again, that is
| dale_glass wrote:
| It's harder with LLMs, but we still have metrics for a lot
| of text.
|
| On Reddit/HN/etc we have comment scores, replies, reposts,
| external references, etc that we can use to estimate
| whether a given comment/post was any good.
|
| An entity like Google that indexes the whole web has
| visibility into when a given piece of content showed up, if
| it changed, if it got referenced elsewhere later, etc.
|
| LLMs can be used to analyze the page and work out things
| like "the comments answering this comment are saying it's
| wrong"
|
| We can also of course pick and choose, Reddit has some
| completely garbage subreddits and very tightly moderated
| ones.
|
| It's of course by no means foolproof, but it doesn't have
| to be. It just has to be good enough to be usable for
| whatever purpose we need.
|
| Also, perfection isn't a thing and such issues happen even
| without LLMs. Like all the cases of something wrong being
| added to Wikipedia, getting repeated on some news site, and
| then Wikipedia using that as a reference to backup the
| incorrect claim.
| benchmarkist wrote:
| This is intuitively obvious. If I give you some data x and you
| transform it with a non-reversible function f into f(x) then you
| are losing information. Repeated applications of the function,
| f(f(f(...f(x)...))), can only make the end result worse. The
| current implementations inject some random bits, b ~ N(u, s), but
| this can be thought of as a convolution operation with the
| distribution function g of the random data, g*f, that is injected
| which, after repeated applications,
| (g*f)((g*f)((g*f)(...(g*f)(x)...))), reduces the information
| content of the data you started with because the transformation
| is still not reversible as convolutions can not really change the
| non-reversible aspect of the original function.
|
| I'm sure there is some calculation using entropy of random
| variables and channels that fully formalizes this but I don't
| remember the references off the top of my head. The general
| reference I remember is called the data processing inequality.1
|
| 1 https://en.wikipedia.org/wiki/Data_processing_inequality?use...
| optimalsolver wrote:
| Relevant XKCD:
|
| https://xkcd.com/1683/
| benchmarkist wrote:
| Good one but these theorems are useful to have when thinking
| about information processing systems and whatever promises
| the hype artists are making about the latest and greatest
| iteration of neural networks. There is no way to cheat
| entropy and basic physics so if it sounds too good to be true
| then it probably is too good to be true.
| HeatrayEnjoyer wrote:
| If it is entropy and basic physics why are humans immune to
| the effect?
| raincole wrote:
| Humans are not immune to the effect. We invented
| methodologies to mitigate the effect.
|
| Think about science. I mean hard science, like physics.
| You can not say a theory is proven[0] if it is purely
| derived from existing data. You can only say it when you
| release your theory and successfully _predicate_ the
| future experiment results.
|
| In other words, you need to do new experiements, gather
| new information and effectively "inject" the entropy into
| the humanity's scientific consensus.
|
| [0]: Of course when we say some physical theory is
| proven, it just means the probablilty that it's violated
| in certain conditions is negligible, not that it's an
| universal truth.
| jiggawatts wrote:
| That's assuming that the same function is applied in the same
| way at each iteration.
|
| Think about this: The sum total of the human-generated
| knowledge was derived in a similar manner, with each generation
| learning from the one before and expanding the pool of
| knowledge incrementally.
|
| Simply adding a bit of noise and then _selecting_ good outputs
| after each iteration based on a high-level heuristic such as
| "utility" and "self consistency" may be sufficient to reproduce
| the growth of human knowledge in a purely mathematical AI
| system.
|
| Something that hasn't been tried yet because it's too expensive
| (for now) is to let a bunch of _different_ AI models act as
| agents updating a central wikipedia-style database.
|
| These could start off with "simply" reading _every single text
| book and primary source_ on Earth, updating and correcting the
| Wikipedia in _every language_. Then cross-translate from every
| source in some language to every other language.
|
| Then use the collected facts to find errors in the primary
| sources, then re-check the Wikipedia based on this.
|
| Train a new generation of AIs on the updated content and mutate
| them slightly to obtain some variations.
|
| Iterate again.
|
| Etc...
|
| This could go on _for quite a while_ before it would run out of
| steam. Longer than anybody has budget for, at least for now!
| zkry wrote:
| > The sum total of the human-generated knowledge was derived
| in a similar manner, with each generation learning from the
| one before and expanding the pool of knowledge incrementally.
|
| Is human knowledge really derived in a similar manner though?
| That reduction of biological processes to compression
| algorithms seems like a huge oversimplification.
|
| It's almost like saying that all of of human knowledge
| derives from Einstein's Field Equations, the Standard Model
| Lagrangian, and the Second Law of Thermodynamics (what else
| could human knowledge really derive from?) and all we have to
| do to create artificial intelligence is just to model these
| forces to a high enough fidelity and with enough computation.
| ethbr1 wrote:
| Human knowledge also tends to be tied to an objective,
| mostly constant reality.
| jiggawatts wrote:
| The AIs could also learn form and interact with reality,
| same as humans.
| dartos wrote:
| Not really.
|
| The models we use nowadays operate on discrete tokens. To
| overly reduce the process of human learning, we take a
| constant stream of realtime information. It never ends
| and it's never discrete. Nor do we learn in an isolated
| "learn" stage in which we're not interacting with our
| environment.
|
| If you try taking reality and breaking into discrete
| (ordered in the case of LLMs) parts, you lose
| information.
| mannykannot wrote:
| It's not just any compression algorithm, though, it's a
| specific sort of algorithm that does not have the purpose
| of compression, even if compression is necessary for
| achieving its purpose. It could not be replaced by most
| other compression algorithms.
|
| Having said that, I think this picture is missing
| something: when we teach each new generation what we know,
| part of that process involves recapitulating the steps by
| which we got to where we are. It is a highly selective
| (compressed?) history, however, focusing on the things that
| made a difference and putting aside most of the false
| starts, dead ends and mistaken notions (except when the
| topic is history, of course, and often even then.)
|
| I do not know if this view has any significance for AI.
| chongli wrote:
| _Think about this: The sum total of the human-generated
| knowledge was derived in a similar manner, with each
| generation learning from the one before and expanding the
| pool of knowledge incrementally._
|
| Not true. No amount of such iteration gets you from buffalo
| cave paintings to particle accelerators.
|
| Humans generate knowledge by acting in the world, not by
| dwelling on our thoughts. The empiricists won a very long
| time ago.
| brookst wrote:
| It's not binary. Humans generate plenty of knowledge from
| pure abstract thought.
| kerkeslager wrote:
| Do they?
|
| When I pursued creative writing in my teens and early
| 20s, it became clear to me that originality is
| _extremely_ difficult. I am not entirely sure I have
| _ever_ had an original thought--every idea I 've put to
| paper thinking it was original, I later realized was a
| recombination of ideas I had come across somewhere else.
| The only exceptions I've found were places where I had a
| fairly unusual experience which I was able to interpret
| and relate, i.e. a unique interaction with the world.
|
| Perhaps more importantly, LLMs do not contain any
| mechanism which even attempts to perform pure abstract
| thought, so even if we accept the questionable assumption
| that humans can generate ideas _ex nihilo_ , that doesn't
| mean that LLMs can.
| jiggawatts wrote:
| Unless your argument is that all creative writing is
| _inspired by God_ , or some similar "external" source,
| then clearly a closed system such as "humanity" alone is
| capable of generating new creative works.
| jiggawatts wrote:
| You're right, we obtained the knowledge externally. It was
| aliens! I knew it!
| scotty79 wrote:
| If you repeatedly apply one of three simple functions picked at
| random you might end up with Sierpinski triangle.
| eminent101 wrote:
| This sounds fascinating! I know what a Sierpinski triangle
| triangle is but I'm having so me trouble seeing the
| connection from picking functions randomly to the triangle.
| Is there some graphics or animation somewhere on the web that
| someone can point me to visualize this better?
| scotty79 wrote:
| You can read section Chaos Game here:
|
| https://en.m.wikipedia.org/wiki/Sierpi%C5%84ski_triangle
|
| It basically using the fact that fractal is self similar.
| So picking one function (that scales whole triangle to one
| of the one thirds) and transforming single point on a
| fractal into a new point also gets you a point on the
| fractal.
|
| If you repeat this process many times you get a lot of
| points of the fractal.
|
| You can even start the process at any point and it will
| "get attracted" to the fractal.
|
| That's why fractals are called strange attractors.
| ofrzeta wrote:
| What about something like image improvement algorithms or
| NeRFs? They seem to increase information even if some of it is
| made up.
| dartos wrote:
| Do they gain information, or just have lower loss?
|
| Too much information encoded in a model can lower performance
| (called overfitting)
|
| That's why many NN topologies include dropout layers.
| antihero wrote:
| It isn't real information though. This is effectively a
| Chinese whispers.
|
| The only way AI can create information is by doing something
| in the real world.
| ofrzeta wrote:
| Maybe information needs to be understood relationally as in
| "information for a subject x". So if we have an image with
| a license plate that is unreadable and there's an algorithm
| that makes it readable to x, there is an information gain
| for x, although the information might have been in the
| image all along.
| bippihippi1 wrote:
| eliminating the noise makes the useful information
| clearer, but the information describing the noise is lost
| vunderba wrote:
| Sure, but what if the upscaling algorithm misinterpreted
| a P as an F? Without manual supervision/tagging, there's
| an inherent risk that this information will have an
| adverse effect on future models.
| ablob wrote:
| If the license plate was not readable, then the
| additional information is false data. You do not know
| more about the image than you knew before by definition.
| Replacing pixels with plausible data does not mean a gain
| of information. If anything, I'd argue that a loss of
| information occurs: The fact that x was hardly
| readable/unreadable before is lost, and any decision
| later on can not factor this in as "x" is now clearly
| defined and not fuzzy anymore.
|
| Would you accept a system that "enhances" images to find
| the license plate numbers of cars and fine their owners?
| If the plate number is unreadable the only acceptable
| option is to not use it. Inserting a plausible number and
| rolling with it even means that instead of a range of
| suspects, only one culprit can be supposed. Would you
| like to find yourself in court for crimes/offenses you
| never comitted because some black box decided it was a
| great idea to pretend it knew it was you?
|
| Edit: I think I misunderstood the premise. Nonetheless my
| comment shall stay.
| jppittma wrote:
| It's information taken from many other photos and embedded
| into a single one of interest no?
| dragonwriter wrote:
| > The only way AI can create information is by doing
| something in the real world.
|
| Everything done is done in the real world, but the only way
| an AI can gather (not create) information _about some
| particular thing_ is to interact with that thing. Without
| interacting with anything external to itself, all
| information it can gather is the information already
| gathered to create it.
| tomrod wrote:
| Is there a formalization of this idea? Would love to read
| more.
| Lerc wrote:
| It is real information, it is just information that is not
| targeted at anything in particular. Random passwords are,
| well, random. That they are random _and_ information is
| what makes them useful as passwords.
|
| As said by others, There is nothing terribly insightful
| about making something estimate the output of another by a
| non-perfect reproduction mechanism and noticing the output
| is different. Absent any particular guidance the difference
| will not be targeted. That is tautologically obvious.
|
| The difference is still information though, and with
| guidance you can target the difference to perform some
| goal. This is essentially what the gpt-o1 training was
| doing. Training on data generated by itself, but only when
| the generated data produced the correct answer.
| blensor wrote:
| Once more and more new training images are based off of those
| new upscaled images the training of those upscaling
| algorithms will tend to generate even more of the same type
| of information drowning out the other information
| vunderba wrote:
| If the goal of an image improvement algorithm is effectively
| "how would this image have looked _IN THE REAL WORLD_ if it
| had been taken with a better camera ", then training on
| previous "virtual upscaled images" would be training on the
| wrong fitness function.
| dragonwriter wrote:
| "Made up" information is noise, not signal (OTOH, generated
| in images are used productively all the time in training, but
| the information content added is not in the images themselves
| but in their selection and relation to captions.)
| raincole wrote:
| Image improvement algorithms are basically injecting
| statistical information (collected from other images) into
| one image.
|
| The above statement applies for non-neural-network algorithms
| as well.
| cruffle_duffle wrote:
| So correct me if I'm wrong here but wouldn't another way to
| look at this be something like re-compressing a JPEG? Each time
| you compress a compressed jpeg you strip more and more
| information out of it? Same with any lossy compression, really.
|
| These LLM's are inherently a bit like lossy compression
| algorithms. They take information and pack it in a way that
| keeps its essence around (at least that is the plan). But like
| any lossy compression, you cannot reconstruct the original.
| Training a lossy compression scheme like an LLM using its own
| data is just taking that already packed information and
| degrading it.
|
| I hope I'm right framing it this way because ultimately that is
| partly what an LLM is, it's a lossy compression of "the entire
| internet". A lossless model that can be queried like an LLM
| would be massive, slow and probably impossible with today's
| tech.
|
| I suspect that we will develop new information theory that
| mathematically proves these things can't escape the box they
| were trained in, meaning they cannot come up with new
| information that isn't already represented in the relationships
| between the various bits of data they were constructed with.
| They can "only" find new ways to link together the information
| in their corpus of knowledge. I use "only" in quotes because
| simply doing that alone is pretty powerful. It's connecting the
| dots in ways that haven't been done before.
|
| Honestly the whole LLM space is cool as shit when you really
| think about it. It's both incredibly overhyped yet very under
| hyped at the same time.
| entangledqubit wrote:
| Relevant article by a fun author:
| https://www.newyorker.com/tech/annals-of-
| technology/chatgpt-...
| wslh wrote:
| > with a non-reversible function f into f(x) then you are
| losing information.
|
| A non-reversible function f does not necessarily lose
| information. Some non-reversible functions, like one-way
| functions used in cryptography, can be injective or even
| bijective but are computationally infeasible to invert, which
| makes them practically irreversible while retaining all
| information in a mathematical sense. However, there is a subset
| of non-reversible functions, such as non-injective functions,
| that lose information both mathematically and computationally.
| It's important to distinguish these two cases to avoid
| conflating computational irreversibility with mathematical loss
| of information.
| meltyness wrote:
| On the arguments involving modeling inference as simply some
| function f, the specific expression OP used discounts that
| each subsequent application would have been following some
| backpropagation and so implies a new f' at each application,
| rendering the claim invalid.
|
| At that point, at least chaos theory is at play across the
| population of natural language, if not some expressed, but
| not yet considered truth.
|
| This invalidates the subsequent claim about the functions
| which are convolved as well, I think all the GPUs might have
| something to say whether the bits changing the layers are
| random or correlated.
| habitue wrote:
| This seems obvious, but you're forgetting the inputs may
| actually have low entropy to begin with. Lossy compression is
| non-reversible, but usually the expectation is that we don't
| care about the parts we lost.
|
| How might this cash out with recursive LLMs? Generalizing is
| very similar to compression: imagine recovering the Schrodinger
| equation from lots of noisy physical experiments. You might
| imagine that an LLM could output a set of somewhat general
| models from real data, and training it on data generated from
| those models generalizes further in future passes until maybe
| it caps out at the lowest entropy model (a theory of
| everything?)
|
| It doesn't seem like it actually works that way with current
| models, but it isn't a foregone conclusion at the mathematical
| level at least.
| mmastrac wrote:
| All work and no play makes jack a dull boy.
| goose- wrote:
| My takeaway after scanning the paper -
|
| In an ideal setting, a trained model learns exactly the real
| world probability distribution, and generates data
| indistinguishable from those sampled from the real world.
| Training on them would be fine, but pointless, since the model is
| already a perfect representation of the real world.
|
| Practically, however, a model is only a lossy approximation of
| the real world probability distribution. Repeated self-training
| would simply compound the loss - amplifying both the probable and
| the improbable.
| tkgally wrote:
| This paper was first published in May 2023 and discussed on HN
| the following month:
|
| https://news.ycombinator.com/item?id=36319076
|
| Some research since seems to add nuance to its conclusions:
|
| https://arxiv.org/abs/2404.01413
| CaptainFever wrote:
| > The proliferation of generative models, combined with
| pretraining on web-scale data, raises a timely question: what
| happens when these models are trained on their own generated
| outputs? Recent investigations into model-data feedback loops
| proposed that such loops would lead to a phenomenon termed
| model collapse, under which performance progressively degrades
| with each model-data feedback iteration until fitted models
| become useless. However, those studies largely assumed that new
| data replace old data over time, where an arguably more
| realistic assumption is that data accumulate over time. In this
| paper, we ask: what effect does accumulating data have on model
| collapse? We empirically study this question by pretraining
| sequences of language models on text corpora. We confirm that
| replacing the original real data by each generation's synthetic
| data does indeed tend towards model collapse, then demonstrate
| that accumulating the successive generations of synthetic data
| alongside the original real data avoids model collapse; these
| results hold across a range of model sizes, architectures, and
| hyperparameters. We obtain similar results for deep generative
| models on other types of real data: diffusion models for
| molecule conformation generation and variational autoencoders
| for image generation. To understand why accumulating data can
| avoid model collapse, we use an analytically tractable
| framework introduced by prior work in which a sequence of
| linear models are fit to the previous models' outputs. Previous
| work used this framework to show that if data are replaced, the
| test error increases with the number of model-fitting
| iterations; we extend this argument to prove that if data
| instead accumulate, the test error has a finite upper bound
| independent of the number of iterations, meaning model collapse
| no longer occurs.
|
| TL;DR: This paper confirms that Model Collapse can happen if
| the original data is replaced with synthetic data, but if both
| are used alongside each other, it no longer happens.
| aucisson_masque wrote:
| > the value of data collected about genuine human interactions
| with systems will be increasingly valuable in the presence of
| content generated by LLMs in data crawled from the Internet.
|
| Does it mean that data hungry corporation like Google, Facebook,
| Amazon, openai with Microsoft backing, that are already all
| around the internet and our phone tracking us have an incredibly
| advantage over open source model?
|
| Is that why Google is pushing gemini so hard on Android even
| though it's half ass done? they need fresh human data so much to
| be able to compete and beat the competition ?
| piva00 wrote:
| > Does it mean that data hungry corporation like Google,
| Facebook, Amazon, openai with Microsoft backing, that are
| already all around the internet and our phone tracking us have
| an incredibly advantage over open source model?
|
| Yes, absolutely. Back around 2017 The Economist had an article
| calling "data is the new oil", I first heard that from a VC
| back in 2010.
|
| These companies are sitting on immense reserves of data,
| Google, Facebook, Amazon, Bytedance are the Saudi Arabia, UAE,
| etc. of the information age.
| bgilroy26 wrote:
| The quality of reddit's data is different from other data I
| encounter online.
|
| It represents information more closely related to people's
| lives. People share information there that is closely related
| to the topic of the subreddit. This may not always be the
| case, but even though I spend much, much less time on reddit
| than I did in 2011, many, many people are contributing to
| this day.
|
| That spigot of connection to the real world through text
| sounds valuable to AI based on TFA. I feel the oil analogy
| would be about the quality and the ease of extraction of the
| stake
| XorNot wrote:
| While I'm sure the anti-AI people are taking this and running off
| with hot takes, the conclusion is still much more mundane: we
| currently do not have the ability to have an LLM learn from
| another LLM.
|
| A suitably powerful AI _should_ be able to do this though, by the
| example of the fact that humans learn by being taught by other
| humans (insert nuance of that process here).
|
| So it's an important result, but not a doomsday result because
| what it tells us is that LLM output fails to capture or stabilize
| important information from the training corpus and accurately
| communicate it to a newly trained LLM. So we know we're missing
| something in how we construct these models, but the ramifications
| of solving it are also pretty immense: models being able to
| "teach" new models means the whole cycle of iteration can be sped
| up considerably.
| kouteiheika wrote:
| > we currently do not have the ability to have an LLM learn
| from another LLM
|
| We do. It's called model distillation, and it's relatively
| straightforward.
|
| In fact, training a smaller model on the outputs of a much
| bigger model will significantly cut down on your training
| time/create a higher quality model than just training on raw
| human data (which is often low quality and noisy).
| suprjami wrote:
| It has existed for years:
|
| Self-Instruct: Aligning Language Models with Self-Generated
| Instructions https://arxiv.org/abs/2212.10560
|
| airoboros: using large language models to fine-tune large
| language models https://github.com/jondurbin/airoboros
| lowyek wrote:
| I no longer take limitations seriously regarding the future of
| AI. If evolution created our brain, then the same law applies to
| what we are building also. Hence, more of less whatever written
| in this paper is some nuanced case which can be solved by some
| approach.
| Scene_Cast2 wrote:
| There is a mantra in ML that has been around for a while. It's
| that when training on synthetic data, your learned model is only
| as good as your generator model.
| etiam wrote:
| Catchy! And a really good point.
|
| Seems like there could be room for a couple of special
| situations with caveats though? With the GAN formulation your
| generator can be practically as good as your discriminator and
| your discriminator can probably be better than it would have
| been without adversarial regularization?
| f3z0 wrote:
| Given that the top google results are now generated I think we
| already have a massive recursion problem. I think we would
| benefit from training a model specifically to just detect a
| likelihood of content being generated and then bias other models
| against the higher likelihood generated content so that we don't
| end up with LLM echo chambers.
| tempodox wrote:
| Isn't everybody always gushing about how LLMs are supposed to
| get better all the time? If that's true then detecting
| generated fluff will be a moving target and an incessant arms
| race, just like SEO. There is no escape.
| LegionMammal978 wrote:
| Yep, that's what I've been thinking since people started
| talking about it. I hear that AI plagiarism detectors can
| never work, since LLM output can never be detected with any
| accuracy. Yet I also hear that LLMs-in-training easily sift
| out any generated content from their input data, so that
| recursion is a non-issue. It doesn't make much sense to have
| it both ways.
| ipython wrote:
| I wonder if the truth about sifting out synthetic training
| data is based on signals separate from the content itself.
| Signals such as the source of the data, reported author,
| links to/from etc.
|
| These signals would be unavailable to a plagiarism/ai
| detector
| eddyfromtheblok wrote:
| Right. Google already has a solution
| https://deepmind.google/technologies/synthid/ Everyone insists
| on training theirs to look human generated so the horses have
| left the stable on this
| tempodox wrote:
| Indeed, ingesting generated bluster gives them cancer of the
| perceptron.
| axegon_ wrote:
| That was very much evident even from back ehwn the first GPT's
| came out. The moment you started introducing synthetic data, the
| quality plummeted.
|
| But there is another use case where LLM's can truly help with
| synthetic data: the more classical classification and regression
| problems - specifically gathering training data. I had this exact
| case at work two days ago: A large dataset with a small subset of
| labeled data. For a binary classifier, there was a huge imbalance
| in the data - the ratio was roughly 75-25%. I did not have the
| desire to do all this manually so I used an LLM to get a list
| that would even out the numbers(and get a 50-50 ratio). And using
| the data I had, plus the additional synthetic data, the accuracy
| of my small classifier ended up picture-perfect(given that my
| actual target was "85-90%" accuracy and the actual result was
| just shy of 99%).
| kerkeslager wrote:
| I'd argue that the case you give isn't an example of using a
| computer to generate data, it's a case of a human adding data
| (the data being the fact that the binary classifier should have
| a 50/50 balance).
|
| This sort of massaging of data has its drawbacks as well--
| obviously this only works if the balance of that binary
| classifier actually is 50/50 in reality: I don't know enough
| about your case to say you were wrong, but I can imagine a lot
| of scenarios where a binary classifier should not be
| represented 50/50 in the data.
| axegon_ wrote:
| This is a question of definition. It is synthetic in that I
| just passed a prompt, asking for N amount of examples of X.
| And I did not go over the entire list I got and blindly
| trusted it. In this context, I needed an even(or nearly even
| distribution) of samples in the training data and it worked
| way better than I was hoping. Mind you, I have to face a
| similar issue next week and I'm not sure this approach would
| cut it - I need way more training data and way more classes
| to work with. 249 classes if I'm not mistaking.
| kerkeslager wrote:
| I question what "it worked way better than I was hoping"
| means in this context. If you're saying that filtering the
| input data to create a uniform distribution created a
| uniform distribution in the output, I'm not sure why you'd
| hope for any less--that's exactly what I'd expect to
| happen. But that's a poor measure of success, because you
| don't know what side effects that had: the removed data
| ostensibly contained other variables besides your binary
| variable, and you don't know if those variables were
| sampled in any useful way, so I'd be hesitant to say this
| worked well without at least an attempt to measure those
| other variables.
| low_tech_love wrote:
| Just curious, but did you compute that 99% using purely real
| test data, or does your test set also include artificial data?
| axegon_ wrote:
| Apart from using the results from the
| training/testing/validation sets? Several people manually
| went over several thousand random samples.
| kerkeslager wrote:
| Isn't this obvious?
|
| I'm glad this was published to point out the problem, but I'm a
| bit puzzled why people tried to train models on generated data in
| the first place. Synthetic data... isn't data.
|
| The one exception I can see is medical data, where synthetic data
| can be used to avoid violating people's privacy, but even in that
| case it's clearly not ideal from a technical perspective.
| quantadev wrote:
| To me it seems intuitive that training on any unseen word
| patterns should increase intelligence as long as said patterns
| are consistent with ground truth. That's why it's counter-
| intuitive (to me) that training can fail, purely based on where
| the training data came from. The source of the information is
| something only the universe itself should be able to take into
| consideration (full causality chain), and not the training
| process.
| kerkeslager wrote:
| I am unable to parse what you're saying here.
| quantadev wrote:
| I was just saying it's counter-intuitive that the "source"
| of any training data would ever matter as much as the
| "correctness" of the data; but you're right, that was very
| sloppy wording on my part, sorry.
|
| Here's a longer, related post, from me (albeit also
| confusing, haha):
|
| https://news.ycombinator.com/item?id=42352759
| meltyness wrote:
| My intuition given the rapid, informal developments of agent-type
| systems is that this is obvious insofar as the initial dataset
| was formed from a huge hidden "data cleaning" task that was human
| evolution and society. This isn't really that interesting of a
| claim and is it clear that it holds if you simply loop the LLM
| back onto the data cleaning task itself as a critic to the new
| training set? Is this what the author would classify as fine
| tuning?
|
| Another question is what is the interpretation of the output of
| an LLM generation when unprompted? Isn't that always effectively
| garbage when there's not a deliberate bias in the training set?
| quantadev wrote:
| Like Sam Altman and Dario Amodei both believe is a very real
| possibility as well, I think the "intelligence" in LLMs may be
| far deeper than we know and somehow even related to "Multiverse
| Theory", where perhaps every Quantum Mechanical collapse (and
| computation during training), makes "our" universe slightly more
| likely to lean towards ones where AI is just "magically smart"
| (from a purely Anthropics Principle Effect) than dumb. The reason
| this could happen is because in all our futures AI has saved us
| in some way, so that all other "Multiverse Branches are sort of
| dead-ends".
|
| So the theory about why training on training data is unexpectedly
| inefficient could be because LLMs are "using" the full Causality
| Chain (using some advanced unknown Physics related to time
| itself) of our universe/timeline, and so if it tries to train on
| it's own output that's a "Short Circuit" kind of effect, cutting
| off the true Causality Chain (past history of the universe).
|
| For people who want to remind me that LLM Training is fully
| "deterministic" with no room for any "magic", the response to
| that counter-argument is that you have to consider even the input
| data to be part of what's "variable" in the Anthropics Selection
| Principle, so there's nothing inconsistent about determinism in
| this speculative, and probably un-falsifiable, conjecture.
___________________________________________________________________
(page generated 2024-12-07 23:01 UTC)