[HN Gopher] Characterizing Emergent Phenomena in Large Language ...
___________________________________________________________________
Characterizing Emergent Phenomena in Large Language Models
Author : fofoz
Score : 64 points
Date : 2022-12-19 12:07 UTC (10 hours ago)
(HTM) web link (ai.googleblog.com)
(TXT) w3m dump (ai.googleblog.com)
| xpe wrote:
| Wikipedia has a fine definition of what _emergent_ means:
|
| > In philosophy, systems theory, science, and art, emergence
| occurs when an entity is observed to have properties its parts do
| not have on their own, properties or behaviors that emerge only
| when the parts interact in a wider whole.
|
| The linked article uses this definition:
|
| > we discuss the phenomena of emergent abilities, which we define
| as abilities that are not present in small models but are present
| in larger models
|
| The concept in the paper has to do with capabilities / abilities
| that grow non-linearly as a function of model size. This is
| distinctly different from _emergent behavior_ in systems theory.
|
| <opinion>The authors and reviewers could find a better word for
| their concept. There is no need to muddle the concept.</opinion>
|
| Furthermore, the idea that networks of certain sizes are
| necessary for certain kinds of representational abilities is not
| new. Perhaps a term exists already?
| xpe wrote:
| This comment says it more eloquently than I did:
| https://news.ycombinator.com/item?id=34051845
| mjburgess wrote:
| > we discuss the phenomena of emergent abilities, which we define
| as abilities that are not present in small models but are present
| in larger models
|
| Reading anything by major researchers in AI feels like an
| adversarial battle where they're trying to misuse as much
| technical scientific and philosophical language as possible and
| we adjacent are trying to hold the line.
|
| In philosophy and esp. the philosophy of science, emergence is a
| relation between a whole and its parts such that a property of
| the whole does not obtain just in virtue of properties of its
| parts taken in isolation. "Emergence" has this prior positive,
| semi-magical, scientific association which confuses the issue in
| this case.
|
| No properties of the LLM obtain from its parts differently as
| parameters scale, the mechanism is the same. The performance
| differs not due to emergence, but due to the "modelling gap"
| present between the statistical structure of free text and that
| of mathematics. With enough examples, the gap closes... indeed,
| you can model the addition function (add(x, y) = x + y) just by
| an infinite sample of its domains.
|
| A better technical term here might be "scale-dependent
| capabilities". For LLM, simple arithmetic is extremely scale
| dependent, whereas basic text generation is less-so. The reason
| for this seems obvious, as given above... so the use of the term
| "emergence" here I interpert as more PRish mystification.
| PartiallyTyped wrote:
| > that a property of the whole does not obtain just in virtue
| of properties of its parts taken in isolation
|
| Thank you. I have been expressing variants of this for a while.
| A paper that comes to mind is OpenAI's hide and seek. They
| claim that cooperation is emergent behaviour, but each agent is
| playing it's own version of prisoner's dilemma, and thus learn
| to cooperate.
| visarga wrote:
| That model was not learning from language, it was learning
| from a simulation. When you can use a simulation to produce
| training data it is possible to have a model discover new
| abilities all on its own, like AlphaGo.
| ot wrote:
| I believe the term is reasonably appropriate here.
|
| The abilities being described here are "emergent" in the sense
| that the model was not specifically trained for them, but they
| show up anyway. Your example is about modeling a specific
| function and having its accuracy increase with model
| complexity, which is classical ML formulation, but this is not
| what is happening here.
|
| LLMs are trained on a very simple task: given a natural text,
| predict the next word. But as model complexity and training set
| sizes increase, they start exhibiting more sophisticated
| abilities, such as basic forms of reasoning, and contextual
| memory.
|
| In your definition, the parts of the whole are "lots of
| statistics about text" and the emergent property is "semantic
| reasoning".
|
| Scale is inevitably a part of this: somewhere else in the
| thread you mention that "liquidity" is an emergent property of
| H2O, but if you take a handful of H2O molecules they don't
| behave as a liquid.
| YeGoblynQueenne wrote:
| Thanks for bringing some sense into the debate. It's
| scandalising to see how the machine learning research community
| is so ready to jump on to such ... innovative uses of
| terminology.
|
| Take "few shot learning", for instance. OpenAI's preprint
| introducing GPT-3 was titled "Large Language Models are Few
| Shot Learners" [1]. This was promptly adopted by the community,
| even though LLMs need first to be trained on millions of
| examples before they can accept "few", and they don't even
| _learn_ from those few (because no weight updates). So it 's
| not really "few shot" and it's not really "learning", and yet,
| here we are. "Large Language Models are few-shot learners" and
| nobody bats an eyelid anymore.
|
| Which is not to say we have to shut up and take it. I
| personally draw inspiration from the tale of the little child
| who pointed out the King's sartorial negligence.
|
| ________________
|
| [1] That, in itself, is a title designed to claim some
| groundbreaking progress not just in technological capabilities
| but also in scientific understanding. LLMs are few-shot
| learners, man! Few-Shot!
| mgraczyk wrote:
| I don't see the issue here. The exact same definition of "few
| shot learning" has been used for at least 20 years. Nothing
| changed with the GPT-3 paper.
|
| The definition is something like
|
| Given a task, a few shot learner is an algorithm that
| generalizes well with only a small number of training
| examples for that task.
|
| The same definition is what I'm familiar with from undergrad.
| Do you know of a different definition that precedes GPT-3?
| YeGoblynQueenne wrote:
| I reference the GPT-3 preprint because it's the source of
| the latest twist to the meaning of "few-shot learning".
|
| >> Given a task, a few shot learner is an algorithm that
| generalizes well with only a small number of training
| examples for that task.
|
| I don't know where this definition comes from and I'd
| prefer if you had a more solid reference than your
| recollection of your undregraduate years, but it doesn't
| really matter because what you describe is not what LLMs
| do.
|
| The input prompts to GPT-3 and friends are not "training
| examples". Training examples are labelled instances of a
| concept to be learned- that's PAC-Learning. LLM prompts are
| not examples, and they're not used in training. They're
| input sequences that an already-trained model completes
| with a sequence of tokens with maximal probability given
| the input.
|
| That's indeed nothing new, it's how language generation
| with a language model works, and has always worked. But I
| don't remember ever hearing anyone referring to the input
| sequences given to Hidden Markov Models or Probabilistic
| Context-Free Grammars as "training examples", or the
| process of generating their completions referred to as
| "learning", let alone "few-shot learning". And yet, this
| kind of generation is exactly what LLMs do, too. Except of
| course LLMs have much smoother, larger models than any HMM
| or PCFG ever trained. But as the OP is arguing, that's not
| a qualitative difference, only a quantitative one, and
| renaming it is misleading. Doubly so if the renaming walks
| roughshod over long-established terminology, like "few
| shot", "learning" or "examples".
|
| Btw, the OpenAI GPT-3 preprint gives a definition of their
| "few shot" setting but it's informal, long-wided and
| overall a vague mess, so it's really no surprise that so
| much confusion and chaos is generated as a result.
| mgraczyk wrote:
| I disagree with what you're saying here.
|
| > The input prompts to GPT-3 and friends are not
| "training examples". Training examples are labelled
| instances of a concept to be learned- that's PAC-
| Learning. LLM prompts are not examples, and they're not
| used in training.
|
| In the PAC learning setting, these are training examples
| because you use the labels to select a function, in this
| case the conditional output given the few shot examples.
|
| Whether or not you actually update any weights has
| nothing to do with "few shot learning" and never has. In
| the PAC setting there are no weights, just model
| functions that depend on the training data.
|
| EDIT: The reason you didn't hear people refer to the
| inputs of HMMs as "training examples" is because HMMs are
| very poor few-shot learners. That's why GPT-3 is
| interesting, because it is a good few-shot learner.
|
| You could use an HMM as a few shot learner, but computing
| the results is expensive and the results are not good for
| most tasks.
| visarga wrote:
| > Remarkably, conditioning the model on such an "example-
| based specification" effectively enables the model to
| adapt on-the-fly to novel tasks whose distributions of
| inputs vary significantly from the training distribution.
| The idea that simply minimizing the negative log loss of
| a single next-word-prediction objective implies this
| apparent optimization of many more arbitrary tasks -
| amounting to a paradigm of "learning" during inference
| time - is a powerful one, and one that raises many
| questions.
|
| http://ai.stanford.edu/blog/in-context-learning/
| mjburgess wrote:
| The issue is that the aim is to model the one shot of
| animals, not some other target .
|
| Animals are one shot learners because we're in causally
| direct sensory-motor contact with reality, such that we can
| dibiguatw it live.
|
| No train/predict system can disambiguate the causal origin
| of data and so can never be one shot.
|
| What they're targeting is a triviality within a system of
| trivialities and misdiscrubjng it
| mgraczyk wrote:
| But then you're saying that the definition has always
| been wrong, which is a very different claim.
|
| I personally think that claiming a definition has always
| been wrong is vacuous. Just substitute the word in your
| head for something else if you don't like it.
| xpe wrote:
| Well said. How did the authors and reviewers miss this?
| mjburgess wrote:
| The transition from useless to useful ML models no doubt
| often seems magical to researchers. But it follows just from
| the distribution of the training data and from the degree of
| its compression by the function approximation algorithm
| they're using.
|
| What's "magical" is not their system, but rather that the
| vast library of text they use for training has useful
| properties which can be approximated.
|
| What researchers are observing is more like the illusion of a
| "phase transition" in the quality of approximations. This
| illusion arises because we have discontinuous standards for
| the approximations.
|
| Ie., when assessing free text prediction by LLMs there's very
| very many ways for them to generate an acceptable answer. For
| mathematics, there's only one acceptable way.
|
| If we applied the same standard/goal to both, no such
| apparent "quality transition" would occur. LLMs would be
| exposed as equally good, or equally bad, at prediction
| regardless of scale.
| xpe wrote:
| Interesting arguments. It seems plausible and insightful.
| IMO, your analysis here deserves a longer write-up. Is it
| something you are working on?
| mjburgess wrote:
| Any person with a "scientific attitude" in this field
| would find it incredibly easy to observer that the
| training target for natural language is,
|
| f(Q) = {A1..An} -- n being very very large
|
| and the target for mathematics is,
|
| g(Q) = A1
|
| And the model they're using approximates with,
|
| m(Q) = A_guess
|
| So it's incredibly easy to model f with m because A_guess
| just has to be close to one of A1...n; and its very hard
| to model mathematics because it has to be _only_ A1.
|
| The reason articles like this are written isn't because
| people dont know this; it's because they just do not have
| a sceptical attitude. And that's a problem I can't fix.
|
| If they'd approached this issue with the goal of "finding
| nothing surprising about this behaviour", ie., trying to
| make it maximally consistent with existing (basic,
| trivial, widely-taught) theory, they'd reach this
| conclusion in under 5min.
|
| The problem is their goal is always _to find something
| surprising_ so they can release some PR about it. There
| 's nothing I can write to fix this problem, it's edemic
| in the whole field.
|
| It makes ML/AI research much more like psychology than
| physics. Alas!
| baandang wrote:
| From Introduction To The Theory Of Complex Systems:
|
| * Complex systems can exhibit a rich phase structure and have
| a huge variety of macrostates that often cannot be inferred
| from the properties of the elements. This is sometimes
| referred to as emergence.
|
| This is the term and as common a term in complex systems as
| there is.
|
| "scale-dependent capabilities" implies inference from
| elements.
|
| I think some people just don't like the very idea even though
| it is not unlike the concept of stochastic process. I would
| think the same reasons to not like the concept of emergence
| applies to stochastic process. Murray Gell-Mann though
| couldn't even raise complex systems above arguing if
| emergence is magical thinking so it is probably a lost cause.
|
| Such an interesting field that always ends up as this
| conversation.
| naasking wrote:
| > "scale-dependent capabilities" implies inference from
| elements.
|
| Don't confuse the post-hoc explanation with an a priori
| inference. We can post-hoc explain water's emergent
| liquidity property using modern quantum theories, but that
| doesn't mean we could have inferred it if given
| Schrodinger's equation and the atomic structure of hydrogen
| and oxygen.
|
| "Scale-dependent capability" is a post-hoc explanation that
| of course looks obvious _in hindsight_ , just like
| liquidity and pressure looks obvious in hindsight once you
| understand electromagnetism and atomic theory.
| seydor wrote:
| Eh eh . imagine how neuroscientists feel
| mjburgess wrote:
| Well, of late, NS's have been fond of misusing "hallucinate"
| likewise which _means_ a non-veridical perception. And they
| 're using it to mean a veridical _constructed_ perception.
|
| Leading everyone down a path of ever-more mysticism.
|
| It would be nice if neuroscientists spoke out against both
| their own mystical PR and that of AI, but I don't hear it
| much.
| oidar wrote:
| I think the word confabulate would be a better match for
| what ChatGPT does. When people confabulate they are VERY
| confident about their invented retellings. This matches the
| attitude that ChatGPT comes across as when it makes up
| shit.
| fumeux_fume wrote:
| Is is that big of a deal? The authors explain their definition
| of emergent abilities at the beginning of the paper.
| mjburgess wrote:
| It's a mystification of what's going on -- the term makes it
| harder to understand, not easier. It's prone to popular
| misunderstanding, and it seems even to confuse the
| researchers themselves.
| HarHarVeryFunny wrote:
| I'm OK with the term emergent used here - it seems the best
| word to describe non-trivial capabiities/properties that
| weren't designed in. At least prior to the first LLMs, I think
| most people would just expect them to be capable of doing
| literally what they are trained to do - predict (locally
| plausible) next word(s), and this is certainly all that is
| "designed in" if we're talking about a basic LLM (vs one with
| further RL-based "alignment", etc).
|
| Of course we can also appreciate that to get REALLY REALLY good
| at "predict next word" would require intelligence/understanding
| of what is being generated, but I think the point here is would
| anyone - the model designers in particular - have expected that
| the transformer architecture (+ scale) is all it would take to
| become so good at this task? I don't think "attention is all
| you need" was really anticipating transformers reading API docs
| and generating code to perform requested functions! One might
| have expected it to take a much more elaborate and evolved
| architecture to achieve this level of capability!
|
| So, to me, it seems entirely appropriate to describe these
| models as having emergent capabilities - things they are
| capable of doing that are really so far above and beyond
| "predict next word", that it seems churlish and inaccurate to
| describe them as designed it (or even just confidently
| predicted).
| rafaelero wrote:
| What a waste of time to be worried about how people are using a
| word.
| maria2 wrote:
| Why should we let philosophy define technical terms for ML?
| Many words have many meanings. Welcome to the imprecision of
| human language.
|
| As a slight tangent, I really hate this type of comment that
| inevitably appears on many HN submissions. Personally, I find
| it distracting when the main conversation happening on an
| article is a pedantic discussion on whether or not a word means
| what the author thinks it means.
| naasking wrote:
| I think "scale-dependent capability" is a much more precise
| term for what they're describing, but I'm not sure that that
| term doesn't fall under the general umbrella of emergent
| properties. The opening paragraphs of emergent properties in
| philosophy [1] cites a number of examples that I would argue
| are comparable to LLM suddenly becoming able to do arithmetic
| past a certain scale.
|
| > In philosophy and esp. the philosophy of science, emergence
| is a relation between a whole and its parts such that a
| property of the whole does not obtain just in virtue of
| properties of its parts taken in isolation.
|
| This is not a settled definition. I think everyone can agree
| that this applies epistemologically, where studying the parts
| in isolation cannot always yield sufficient information to
| predict macroscopic properties, but to claim that all
| properties of the whole do not reduce to the properties of its
| parts is controversial.
|
| For instance, it seems unlikely that we would have predicted
| H2O's dipole moment and the phenomenon of surface tension just
| from studying hydrogen atoms and oxygen atoms in isolation, but
| it would be incorrect to say that surface tension is not the
| result of the properties of hydrogen and oxygen in isolation.
| We simply cannot discover all the relevant properties without
| studying them together.
|
| Edit: to clarify, H2O's dipole moment seems obvious _in
| hindsight_ when we have a good model for what 's going on, and
| analogously, LLM's ability to do arithmetic seems obvious as a
| scale-dependent property in hindsight, but that doesn't mean it
| was obvious that this would happen before it was created.
|
| [1] https://plato.stanford.edu/entries/properties-emergent/
| mjburgess wrote:
| Well it's scale dependence is an illusion.
|
| It's uniformly able to model Q->SingleAnswer problems, and
| uniformly able to model Q-> ManyAnswer problems.
|
| Basic arithmetic is the former kind, and it becomes _useful
| to people_ at large scales, ie., accurate on basic
| arithmetic.
|
| This dual condition of "useful-to-people" is the thing
| introducing the illusion, since it changes depending on what
| we're modelling. The system isnt acquring any new property.
|
| Consider a researcher putting a book on a thin ice-sheet, and
| then putting a car on it. Here, they're concluding the ice
| has different properties in each case -- but it doesnt.
| naasking wrote:
| > This dual condition of "useful-to-people" is the thing
| introducing the illusion, since it changes depending on
| what we're modelling. The system isnt acquring any new
| property.
|
| I have a couple of possible responses to this, but maybe
| the most obvious is that I'm not sure why "useful to
| people" can't qualify as a new property.
|
| For instance, a system that suddenly becomes useful to
| people can be transformative to society, which can lead to
| new emergent social or economic properties at the societal
| scale. To conclude that "useful to people" is not a
| meaningful property, aren't you basically implying that
| something that suddenly becomes useful cannot even in
| principle lead to new societal scale emergent properties?
| That seems dubious. Edit: or you're implying that emergent
| properties are not reducible to interactions between
| constituent properties, which also seems dubious.
|
| For a concrete example, the internet probably falls into
| this category. It has transformed society and led to new
| emergent properties at societal scales, but computers
| didn't suddenly acquire any new computational properties,
| or new properties to manipulate bits. Only the scale of
| their deployment changed, and that scaling was itself
| useful to people, and this led to new societal properties.
| That arguably can't happen unless "useful to people" is
| itself a meaningful property, no?
| darawk wrote:
| > Consider a researcher putting a book on a thin ice-sheet,
| and then putting a car on it. Here, they're concluding the
| ice has different properties in each case -- but it doesnt.
|
| This is just a linguistic shell game with the meaning of
| the word "property". You could just as easily say the
| difference between the mind of a human and a monkey is a
| matter of degree, and therefore going from one to the other
| does not gain any novel "property".
|
| It should be obvious that the degree of a property can
| fundamentally change its nature, and that there is no hard
| distinction between "properties" and degrees of things. The
| difference between a tickle and a gunshot are matters of
| "degree", but that fact is of near zero semantic utility.
| mjburgess wrote:
| Emergence is about intrinsic properties, observer-
| independent properties.
|
| If emergence were about observer-relative properties it
| would be a meaningless term. My shoe gets an "emergent
| property" to hold my door open when I put it in a door
| way.
|
| This is mumbojumbo.
|
| Systems acquiring observer-relative "properties" are all
| well and good, but the claim here is a much stronger one.
|
| This gross misuse of language amounts to saying that
| "models with enough parameters to accurately approximate
| a function" have "emergent properties" that "models
| without enough parameters" do not have.
|
| This is a deeply silly way to describe the conditions
| under which one function can approximate another, and the
| rate at which that aproximation converges to something
| useful.
| naasking wrote:
| > Emergence is about intrinsic properties, observer-
| independent properties. If emergence were about observer-
| relative properties it would be a meaningless term.
|
| I'm going to address this in case it was also intended to
| reply to my other comment about "useful to people"
| possibly being a property.
|
| "Useful to people" _would be_ an observer-independent
| property, if it 's a property at all. An alien species
| analyzing humanity would come to the same conclusions as
| humans about whether some system, like the internet, was
| useful to people. This would be evident by whether the
| use of that system spread.
|
| As for whether it's "intrinsic", I'm not sure how you're
| applying this. As you said in a later comment, "Liquidity
| isn't [a property] of water without a container/pressure
| environment". In other words, liquidity isn't an
| intrinsic property of H2O. Moreover, the only reason we
| identify and created a label for "liquidity" is because
| it's useful to people, which is the very criterion that
| you're claiming should not be applied to describe some
| surprising scaling behaviour of LLMs.
|
| I just don't think you've made the distinction you're
| attempting to make clear, because there is parity of
| reasoning between the allegedly emergent properties you
| describe and those in the article.
| darawk wrote:
| Let's make this concrete. What in your mind is a specific
| example of a concrete system with an emergent property,
| then?
| mistermann wrote:
| > What in your mind is a specific example of a concrete
| system with an emergent property, then?
|
| Not sure if you're joking, but the brain (and
| consciousness/mind as an emergent phenomenon) is the
| classic example. Even better, it is what's causing the
| fundamental problems in this very conversation, due to
| the inconsistent manner in which it translates terms into
| meaning (ie: "emergent", "is"), typically not
| realizing[1] that meaning and reality are similar to the
| speed of light in that their appearances[2] vary
| depending upon the frame of reference of the observer.
|
| I am fairly optimistic that AI is going to "force
| humanity's cultural hands" such that we will eventually
| have to grapple with this long known but rarely discussed
| phenomenon. Though, I anticipate censorship, propaganda,
| _and consciousness_ will play heavy roles in such
| discussions and screw everything up, as is usually the
| case with fundamentally important ideas.
|
| [1] _During realtime cognition_ , regardless of whether
| substantial abstract knowledge is in the person's
| possession.
|
| [2] _And maybe even the things themselves_ , depending on
| how (or _from where_ ) you look at it - I am still
| undecided on this.
| mjburgess wrote:
| Liquidity is not a property of h2o molecules but it is of
| water. Liquidity isn't of water without a
| container/pressure environment.
|
| The trajectories of particles of air are underdetermined
| by their own intrinsic properties so a tornado cannot be
| reduced to some mere aggregate of them.
|
| Emergence is an ontological relationship between
| properties of objects --- it isn't a mathematical
| property of a data distribution nor of an approximation
| function.
|
| The very use of the term has crated all this confusion
| right here.
|
| Would anyone who thought NNs were showing emergence be
| also content to find out that the reason for this
| 'emergence' was just that in the case of so-called
| 'emergence' our expectations of performance were
| changing?
|
| Do we call it 'emergence' when we replace estimating data
| using mean() to estimating with ax+b ?
|
| There's definitely an illusion here, but one quite easy
| to spot if you weren't in the game of peddling illusions.
| aaroninsf wrote:
| Two observations,
|
| There is utility and need for some convention of feature
| description, for the case of the external behavior of the
| system being correct, for some domain, despite lack of
| specific training. It is reasonable for users of such
| systems to say that the internal states and
| representation don't matter, _so long as the behavior is
| correct_ ; and in cases like these we will benefit from
| _some_ consensus on how to talk about such things. Fine
| with me if some new term is applied.
|
| More of interest to me though is that it is not at all
| clear to me that _genuine_ emergence is not possible
| through scaling (independent of whether it _is_ in any
| given existing LLM). Because the optimal (most compact)
| correct representation for a lot of e.g. language output,
| is exactly that which benefits from abstraction.
|
| What reason is there to believe that the abstractions
| derived at higher levels (of the network generally but
| not necessarily, depends on the architecture) do not
| encode non-linear problem spaces in the world, which are
| "real" emergence?
|
| I.e. if the way some network learns arithmetic is to
| settle on an internal weighting that performs
| computation, rather than "memorizing assertions", me, I
| would call that "emergent."
|
| But I'm happy to use some other term should one, er,
| emerge.
| darawk wrote:
| > Liquidity is not a property of h2o molecules but it is
| of water
|
| The ability to speak English is not a property of
| floating points, but it is of certain, very specific
| large tensors of them. What's the difference?
|
| > Emergence is an ontological relationship between
| properties of objects --- it isn't a mathematical
| property of a data distribution nor of an approximation
| function.
|
| I don't see a hard distinction between ontological
| relationships and data distributions. All information is
| fundamentally statistical. Our access to ontology is
| forever and always mediated by "data distributions".
|
| One could, of course posit that there are fundamental,
| non-statistical ontological things out there. However,
| the liquidity of water being an ontological relationship
| while the English-speaking of GPT not being so is merely
| a hypothesis, not an objective fact of the universe, at
| least not as far as I can tell.
| seydor wrote:
| The X axis here is the training Flops but what about parameter
| size and how does it account for the different architectures.
| Comparing apples to shoelaces may not be a fruitful approach or
| indicative of what to expect from ever-expanding scale. Also , is
| it emergence or overfitting
| CGamesPlay wrote:
| Do these scale-dependent (I like this adjective better than
| "emergent") properties survive model distillation? It may be that
| our training/optimization processes are inefficient and require
| these scales to achieve, but the underlying model may not
| actually require the number of parameters that we are giving
| them. I haven't read any of the papers about distillation yet,
| does anyone know if this has been tested?
| visarga wrote:
| Good question, my guess is that you can't distill chain-of-
| thought or zero shot prompting in small models, they got to be
| 15-20B parameters or larger. Maybe someone has a link to a
| related paper?
| lossolo wrote:
| For all models smaller than 62B, direct prompting outperforms
| CoT. The first model where CoT outperforms direct prompting
| is Flan-cont-PaLM 62B on BBH. For 540B models, there are more
| settings where CoT outperforms direct prompting, but not all
| of them. Also, the number can be smaller than 540B. In Suzgun
| et. al. 2022, the authors show that the 175B InstructGPT and
| 175B Codex also have better CoT performance than direct
| prompting. Combining all the results, we get the two numbers
| 62B and 175B. So yes indeed, to enter the game of scale you
| do need a ticket to larger models than average.
|
| However, there are also other large models like OPT, BLOOM,
| and the first version of GPT-3. They all have 175B, yet their
| CoT performance is significantly worse, or even cannot do
| CoT.
|
| source: https://yaofu.notion.site/A-Closer-Look-at-Large-
| Language-Mo...
| evrimoztamur wrote:
| Has there been any efforts in processing calculation prompts,
| where instead of letting it internally 'compute', it's trained to
| identify equations and process them with an external calculator
| instead (perhaps one which outputs not only the result but the
| individual steps too)?
| vutekst wrote:
| Yes: https://twitter.com/goodside/status/1581805503897735168
| visarga wrote:
| Language models with toys. The calculator, Python REPL, search
| engine, database, simulation, games, other AI's can easily
| blend with large language models lifting some weight off their
| shoulders.
|
| For example, for a physics question the LM could write a small
| simulation, run the simulation and interpret results back to
| the user. That's possible when models can do code execution.
| obiefernandez wrote:
| Been wondering the same
| djoldman wrote:
| Paper: https://openreview.net/forum?id=yzkSU5zdwD
| ttctciyf wrote:
| There's a quite accessible IAS presentation[1] from another
| Google researcher on _Solving Quantitative Reasoning Problems
| with Language Models_ which gives some likely related background
| on having language models solve this type of math problem,
| including the "chain of thought" technique mentioned here.
|
| I found it pretty interesting and as something of an ML skeptic
| was a bit surprised at the degree of coherence shown in
| "reasoning" examples similar to the ones in the linked article.
|
| 1: https://www.youtube.com/watch?v=qV4Ku5L4BuMt
___________________________________________________________________
(page generated 2022-12-19 23:01 UTC)