[HN Gopher] Characterizing Emergent Phenomena in Large Language ...
       ___________________________________________________________________
        
       Characterizing Emergent Phenomena in Large Language Models
        
       Author : fofoz
       Score  : 64 points
       Date   : 2022-12-19 12:07 UTC (10 hours ago)
        
 (HTM) web link (ai.googleblog.com)
 (TXT) w3m dump (ai.googleblog.com)
        
       | xpe wrote:
       | Wikipedia has a fine definition of what _emergent_ means:
       | 
       | > In philosophy, systems theory, science, and art, emergence
       | occurs when an entity is observed to have properties its parts do
       | not have on their own, properties or behaviors that emerge only
       | when the parts interact in a wider whole.
       | 
       | The linked article uses this definition:
       | 
       | > we discuss the phenomena of emergent abilities, which we define
       | as abilities that are not present in small models but are present
       | in larger models
       | 
       | The concept in the paper has to do with capabilities / abilities
       | that grow non-linearly as a function of model size. This is
       | distinctly different from _emergent behavior_ in systems theory.
       | 
       | <opinion>The authors and reviewers could find a better word for
       | their concept. There is no need to muddle the concept.</opinion>
       | 
       | Furthermore, the idea that networks of certain sizes are
       | necessary for certain kinds of representational abilities is not
       | new. Perhaps a term exists already?
        
         | xpe wrote:
         | This comment says it more eloquently than I did:
         | https://news.ycombinator.com/item?id=34051845
        
       | mjburgess wrote:
       | > we discuss the phenomena of emergent abilities, which we define
       | as abilities that are not present in small models but are present
       | in larger models
       | 
       | Reading anything by major researchers in AI feels like an
       | adversarial battle where they're trying to misuse as much
       | technical scientific and philosophical language as possible and
       | we adjacent are trying to hold the line.
       | 
       | In philosophy and esp. the philosophy of science, emergence is a
       | relation between a whole and its parts such that a property of
       | the whole does not obtain just in virtue of properties of its
       | parts taken in isolation. "Emergence" has this prior positive,
       | semi-magical, scientific association which confuses the issue in
       | this case.
       | 
       | No properties of the LLM obtain from its parts differently as
       | parameters scale, the mechanism is the same. The performance
       | differs not due to emergence, but due to the "modelling gap"
       | present between the statistical structure of free text and that
       | of mathematics. With enough examples, the gap closes... indeed,
       | you can model the addition function (add(x, y) = x + y) just by
       | an infinite sample of its domains.
       | 
       | A better technical term here might be "scale-dependent
       | capabilities". For LLM, simple arithmetic is extremely scale
       | dependent, whereas basic text generation is less-so. The reason
       | for this seems obvious, as given above... so the use of the term
       | "emergence" here I interpert as more PRish mystification.
        
         | PartiallyTyped wrote:
         | > that a property of the whole does not obtain just in virtue
         | of properties of its parts taken in isolation
         | 
         | Thank you. I have been expressing variants of this for a while.
         | A paper that comes to mind is OpenAI's hide and seek. They
         | claim that cooperation is emergent behaviour, but each agent is
         | playing it's own version of prisoner's dilemma, and thus learn
         | to cooperate.
        
           | visarga wrote:
           | That model was not learning from language, it was learning
           | from a simulation. When you can use a simulation to produce
           | training data it is possible to have a model discover new
           | abilities all on its own, like AlphaGo.
        
         | ot wrote:
         | I believe the term is reasonably appropriate here.
         | 
         | The abilities being described here are "emergent" in the sense
         | that the model was not specifically trained for them, but they
         | show up anyway. Your example is about modeling a specific
         | function and having its accuracy increase with model
         | complexity, which is classical ML formulation, but this is not
         | what is happening here.
         | 
         | LLMs are trained on a very simple task: given a natural text,
         | predict the next word. But as model complexity and training set
         | sizes increase, they start exhibiting more sophisticated
         | abilities, such as basic forms of reasoning, and contextual
         | memory.
         | 
         | In your definition, the parts of the whole are "lots of
         | statistics about text" and the emergent property is "semantic
         | reasoning".
         | 
         | Scale is inevitably a part of this: somewhere else in the
         | thread you mention that "liquidity" is an emergent property of
         | H2O, but if you take a handful of H2O molecules they don't
         | behave as a liquid.
        
         | YeGoblynQueenne wrote:
         | Thanks for bringing some sense into the debate. It's
         | scandalising to see how the machine learning research community
         | is so ready to jump on to such ... innovative uses of
         | terminology.
         | 
         | Take "few shot learning", for instance. OpenAI's preprint
         | introducing GPT-3 was titled "Large Language Models are Few
         | Shot Learners" [1]. This was promptly adopted by the community,
         | even though LLMs need first to be trained on millions of
         | examples before they can accept "few", and they don't even
         | _learn_ from those few (because no weight updates). So it 's
         | not really "few shot" and it's not really "learning", and yet,
         | here we are. "Large Language Models are few-shot learners" and
         | nobody bats an eyelid anymore.
         | 
         | Which is not to say we have to shut up and take it. I
         | personally draw inspiration from the tale of the little child
         | who pointed out the King's sartorial negligence.
         | 
         | ________________
         | 
         | [1] That, in itself, is a title designed to claim some
         | groundbreaking progress not just in technological capabilities
         | but also in scientific understanding. LLMs are few-shot
         | learners, man! Few-Shot!
        
           | mgraczyk wrote:
           | I don't see the issue here. The exact same definition of "few
           | shot learning" has been used for at least 20 years. Nothing
           | changed with the GPT-3 paper.
           | 
           | The definition is something like
           | 
           | Given a task, a few shot learner is an algorithm that
           | generalizes well with only a small number of training
           | examples for that task.
           | 
           | The same definition is what I'm familiar with from undergrad.
           | Do you know of a different definition that precedes GPT-3?
        
             | YeGoblynQueenne wrote:
             | I reference the GPT-3 preprint because it's the source of
             | the latest twist to the meaning of "few-shot learning".
             | 
             | >> Given a task, a few shot learner is an algorithm that
             | generalizes well with only a small number of training
             | examples for that task.
             | 
             | I don't know where this definition comes from and I'd
             | prefer if you had a more solid reference than your
             | recollection of your undregraduate years, but it doesn't
             | really matter because what you describe is not what LLMs
             | do.
             | 
             | The input prompts to GPT-3 and friends are not "training
             | examples". Training examples are labelled instances of a
             | concept to be learned- that's PAC-Learning. LLM prompts are
             | not examples, and they're not used in training. They're
             | input sequences that an already-trained model completes
             | with a sequence of tokens with maximal probability given
             | the input.
             | 
             | That's indeed nothing new, it's how language generation
             | with a language model works, and has always worked. But I
             | don't remember ever hearing anyone referring to the input
             | sequences given to Hidden Markov Models or Probabilistic
             | Context-Free Grammars as "training examples", or the
             | process of generating their completions referred to as
             | "learning", let alone "few-shot learning". And yet, this
             | kind of generation is exactly what LLMs do, too. Except of
             | course LLMs have much smoother, larger models than any HMM
             | or PCFG ever trained. But as the OP is arguing, that's not
             | a qualitative difference, only a quantitative one, and
             | renaming it is misleading. Doubly so if the renaming walks
             | roughshod over long-established terminology, like "few
             | shot", "learning" or "examples".
             | 
             | Btw, the OpenAI GPT-3 preprint gives a definition of their
             | "few shot" setting but it's informal, long-wided and
             | overall a vague mess, so it's really no surprise that so
             | much confusion and chaos is generated as a result.
        
               | mgraczyk wrote:
               | I disagree with what you're saying here.
               | 
               | > The input prompts to GPT-3 and friends are not
               | "training examples". Training examples are labelled
               | instances of a concept to be learned- that's PAC-
               | Learning. LLM prompts are not examples, and they're not
               | used in training.
               | 
               | In the PAC learning setting, these are training examples
               | because you use the labels to select a function, in this
               | case the conditional output given the few shot examples.
               | 
               | Whether or not you actually update any weights has
               | nothing to do with "few shot learning" and never has. In
               | the PAC setting there are no weights, just model
               | functions that depend on the training data.
               | 
               | EDIT: The reason you didn't hear people refer to the
               | inputs of HMMs as "training examples" is because HMMs are
               | very poor few-shot learners. That's why GPT-3 is
               | interesting, because it is a good few-shot learner.
               | 
               | You could use an HMM as a few shot learner, but computing
               | the results is expensive and the results are not good for
               | most tasks.
        
               | visarga wrote:
               | > Remarkably, conditioning the model on such an "example-
               | based specification" effectively enables the model to
               | adapt on-the-fly to novel tasks whose distributions of
               | inputs vary significantly from the training distribution.
               | The idea that simply minimizing the negative log loss of
               | a single next-word-prediction objective implies this
               | apparent optimization of many more arbitrary tasks -
               | amounting to a paradigm of "learning" during inference
               | time - is a powerful one, and one that raises many
               | questions.
               | 
               | http://ai.stanford.edu/blog/in-context-learning/
        
             | mjburgess wrote:
             | The issue is that the aim is to model the one shot of
             | animals, not some other target .
             | 
             | Animals are one shot learners because we're in causally
             | direct sensory-motor contact with reality, such that we can
             | dibiguatw it live.
             | 
             | No train/predict system can disambiguate the causal origin
             | of data and so can never be one shot.
             | 
             | What they're targeting is a triviality within a system of
             | trivialities and misdiscrubjng it
        
               | mgraczyk wrote:
               | But then you're saying that the definition has always
               | been wrong, which is a very different claim.
               | 
               | I personally think that claiming a definition has always
               | been wrong is vacuous. Just substitute the word in your
               | head for something else if you don't like it.
        
         | xpe wrote:
         | Well said. How did the authors and reviewers miss this?
        
           | mjburgess wrote:
           | The transition from useless to useful ML models no doubt
           | often seems magical to researchers. But it follows just from
           | the distribution of the training data and from the degree of
           | its compression by the function approximation algorithm
           | they're using.
           | 
           | What's "magical" is not their system, but rather that the
           | vast library of text they use for training has useful
           | properties which can be approximated.
           | 
           | What researchers are observing is more like the illusion of a
           | "phase transition" in the quality of approximations. This
           | illusion arises because we have discontinuous standards for
           | the approximations.
           | 
           | Ie., when assessing free text prediction by LLMs there's very
           | very many ways for them to generate an acceptable answer. For
           | mathematics, there's only one acceptable way.
           | 
           | If we applied the same standard/goal to both, no such
           | apparent "quality transition" would occur. LLMs would be
           | exposed as equally good, or equally bad, at prediction
           | regardless of scale.
        
             | xpe wrote:
             | Interesting arguments. It seems plausible and insightful.
             | IMO, your analysis here deserves a longer write-up. Is it
             | something you are working on?
        
               | mjburgess wrote:
               | Any person with a "scientific attitude" in this field
               | would find it incredibly easy to observer that the
               | training target for natural language is,
               | 
               | f(Q) = {A1..An} -- n being very very large
               | 
               | and the target for mathematics is,
               | 
               | g(Q) = A1
               | 
               | And the model they're using approximates with,
               | 
               | m(Q) = A_guess
               | 
               | So it's incredibly easy to model f with m because A_guess
               | just has to be close to one of A1...n; and its very hard
               | to model mathematics because it has to be _only_ A1.
               | 
               | The reason articles like this are written isn't because
               | people dont know this; it's because they just do not have
               | a sceptical attitude. And that's a problem I can't fix.
               | 
               | If they'd approached this issue with the goal of "finding
               | nothing surprising about this behaviour", ie., trying to
               | make it maximally consistent with existing (basic,
               | trivial, widely-taught) theory, they'd reach this
               | conclusion in under 5min.
               | 
               | The problem is their goal is always _to find something
               | surprising_ so they can release some PR about it. There
               | 's nothing I can write to fix this problem, it's edemic
               | in the whole field.
               | 
               | It makes ML/AI research much more like psychology than
               | physics. Alas!
        
           | baandang wrote:
           | From Introduction To The Theory Of Complex Systems:
           | 
           | * Complex systems can exhibit a rich phase structure and have
           | a huge variety of macrostates that often cannot be inferred
           | from the properties of the elements. This is sometimes
           | referred to as emergence.
           | 
           | This is the term and as common a term in complex systems as
           | there is.
           | 
           | "scale-dependent capabilities" implies inference from
           | elements.
           | 
           | I think some people just don't like the very idea even though
           | it is not unlike the concept of stochastic process. I would
           | think the same reasons to not like the concept of emergence
           | applies to stochastic process. Murray Gell-Mann though
           | couldn't even raise complex systems above arguing if
           | emergence is magical thinking so it is probably a lost cause.
           | 
           | Such an interesting field that always ends up as this
           | conversation.
        
             | naasking wrote:
             | > "scale-dependent capabilities" implies inference from
             | elements.
             | 
             | Don't confuse the post-hoc explanation with an a priori
             | inference. We can post-hoc explain water's emergent
             | liquidity property using modern quantum theories, but that
             | doesn't mean we could have inferred it if given
             | Schrodinger's equation and the atomic structure of hydrogen
             | and oxygen.
             | 
             | "Scale-dependent capability" is a post-hoc explanation that
             | of course looks obvious _in hindsight_ , just like
             | liquidity and pressure looks obvious in hindsight once you
             | understand electromagnetism and atomic theory.
        
         | seydor wrote:
         | Eh eh . imagine how neuroscientists feel
        
           | mjburgess wrote:
           | Well, of late, NS's have been fond of misusing "hallucinate"
           | likewise which _means_ a non-veridical perception. And they
           | 're using it to mean a veridical _constructed_ perception.
           | 
           | Leading everyone down a path of ever-more mysticism.
           | 
           | It would be nice if neuroscientists spoke out against both
           | their own mystical PR and that of AI, but I don't hear it
           | much.
        
             | oidar wrote:
             | I think the word confabulate would be a better match for
             | what ChatGPT does. When people confabulate they are VERY
             | confident about their invented retellings. This matches the
             | attitude that ChatGPT comes across as when it makes up
             | shit.
        
         | fumeux_fume wrote:
         | Is is that big of a deal? The authors explain their definition
         | of emergent abilities at the beginning of the paper.
        
           | mjburgess wrote:
           | It's a mystification of what's going on -- the term makes it
           | harder to understand, not easier. It's prone to popular
           | misunderstanding, and it seems even to confuse the
           | researchers themselves.
        
         | HarHarVeryFunny wrote:
         | I'm OK with the term emergent used here - it seems the best
         | word to describe non-trivial capabiities/properties that
         | weren't designed in. At least prior to the first LLMs, I think
         | most people would just expect them to be capable of doing
         | literally what they are trained to do - predict (locally
         | plausible) next word(s), and this is certainly all that is
         | "designed in" if we're talking about a basic LLM (vs one with
         | further RL-based "alignment", etc).
         | 
         | Of course we can also appreciate that to get REALLY REALLY good
         | at "predict next word" would require intelligence/understanding
         | of what is being generated, but I think the point here is would
         | anyone - the model designers in particular - have expected that
         | the transformer architecture (+ scale) is all it would take to
         | become so good at this task? I don't think "attention is all
         | you need" was really anticipating transformers reading API docs
         | and generating code to perform requested functions! One might
         | have expected it to take a much more elaborate and evolved
         | architecture to achieve this level of capability!
         | 
         | So, to me, it seems entirely appropriate to describe these
         | models as having emergent capabilities - things they are
         | capable of doing that are really so far above and beyond
         | "predict next word", that it seems churlish and inaccurate to
         | describe them as designed it (or even just confidently
         | predicted).
        
         | rafaelero wrote:
         | What a waste of time to be worried about how people are using a
         | word.
        
         | maria2 wrote:
         | Why should we let philosophy define technical terms for ML?
         | Many words have many meanings. Welcome to the imprecision of
         | human language.
         | 
         | As a slight tangent, I really hate this type of comment that
         | inevitably appears on many HN submissions. Personally, I find
         | it distracting when the main conversation happening on an
         | article is a pedantic discussion on whether or not a word means
         | what the author thinks it means.
        
         | naasking wrote:
         | I think "scale-dependent capability" is a much more precise
         | term for what they're describing, but I'm not sure that that
         | term doesn't fall under the general umbrella of emergent
         | properties. The opening paragraphs of emergent properties in
         | philosophy [1] cites a number of examples that I would argue
         | are comparable to LLM suddenly becoming able to do arithmetic
         | past a certain scale.
         | 
         | > In philosophy and esp. the philosophy of science, emergence
         | is a relation between a whole and its parts such that a
         | property of the whole does not obtain just in virtue of
         | properties of its parts taken in isolation.
         | 
         | This is not a settled definition. I think everyone can agree
         | that this applies epistemologically, where studying the parts
         | in isolation cannot always yield sufficient information to
         | predict macroscopic properties, but to claim that all
         | properties of the whole do not reduce to the properties of its
         | parts is controversial.
         | 
         | For instance, it seems unlikely that we would have predicted
         | H2O's dipole moment and the phenomenon of surface tension just
         | from studying hydrogen atoms and oxygen atoms in isolation, but
         | it would be incorrect to say that surface tension is not the
         | result of the properties of hydrogen and oxygen in isolation.
         | We simply cannot discover all the relevant properties without
         | studying them together.
         | 
         | Edit: to clarify, H2O's dipole moment seems obvious _in
         | hindsight_ when we have a good model for what 's going on, and
         | analogously, LLM's ability to do arithmetic seems obvious as a
         | scale-dependent property in hindsight, but that doesn't mean it
         | was obvious that this would happen before it was created.
         | 
         | [1] https://plato.stanford.edu/entries/properties-emergent/
        
           | mjburgess wrote:
           | Well it's scale dependence is an illusion.
           | 
           | It's uniformly able to model Q->SingleAnswer problems, and
           | uniformly able to model Q-> ManyAnswer problems.
           | 
           | Basic arithmetic is the former kind, and it becomes _useful
           | to people_ at large scales, ie., accurate on basic
           | arithmetic.
           | 
           | This dual condition of "useful-to-people" is the thing
           | introducing the illusion, since it changes depending on what
           | we're modelling. The system isnt acquring any new property.
           | 
           | Consider a researcher putting a book on a thin ice-sheet, and
           | then putting a car on it. Here, they're concluding the ice
           | has different properties in each case -- but it doesnt.
        
             | naasking wrote:
             | > This dual condition of "useful-to-people" is the thing
             | introducing the illusion, since it changes depending on
             | what we're modelling. The system isnt acquring any new
             | property.
             | 
             | I have a couple of possible responses to this, but maybe
             | the most obvious is that I'm not sure why "useful to
             | people" can't qualify as a new property.
             | 
             | For instance, a system that suddenly becomes useful to
             | people can be transformative to society, which can lead to
             | new emergent social or economic properties at the societal
             | scale. To conclude that "useful to people" is not a
             | meaningful property, aren't you basically implying that
             | something that suddenly becomes useful cannot even in
             | principle lead to new societal scale emergent properties?
             | That seems dubious. Edit: or you're implying that emergent
             | properties are not reducible to interactions between
             | constituent properties, which also seems dubious.
             | 
             | For a concrete example, the internet probably falls into
             | this category. It has transformed society and led to new
             | emergent properties at societal scales, but computers
             | didn't suddenly acquire any new computational properties,
             | or new properties to manipulate bits. Only the scale of
             | their deployment changed, and that scaling was itself
             | useful to people, and this led to new societal properties.
             | That arguably can't happen unless "useful to people" is
             | itself a meaningful property, no?
        
             | darawk wrote:
             | > Consider a researcher putting a book on a thin ice-sheet,
             | and then putting a car on it. Here, they're concluding the
             | ice has different properties in each case -- but it doesnt.
             | 
             | This is just a linguistic shell game with the meaning of
             | the word "property". You could just as easily say the
             | difference between the mind of a human and a monkey is a
             | matter of degree, and therefore going from one to the other
             | does not gain any novel "property".
             | 
             | It should be obvious that the degree of a property can
             | fundamentally change its nature, and that there is no hard
             | distinction between "properties" and degrees of things. The
             | difference between a tickle and a gunshot are matters of
             | "degree", but that fact is of near zero semantic utility.
        
               | mjburgess wrote:
               | Emergence is about intrinsic properties, observer-
               | independent properties.
               | 
               | If emergence were about observer-relative properties it
               | would be a meaningless term. My shoe gets an "emergent
               | property" to hold my door open when I put it in a door
               | way.
               | 
               | This is mumbojumbo.
               | 
               | Systems acquiring observer-relative "properties" are all
               | well and good, but the claim here is a much stronger one.
               | 
               | This gross misuse of language amounts to saying that
               | "models with enough parameters to accurately approximate
               | a function" have "emergent properties" that "models
               | without enough parameters" do not have.
               | 
               | This is a deeply silly way to describe the conditions
               | under which one function can approximate another, and the
               | rate at which that aproximation converges to something
               | useful.
        
               | naasking wrote:
               | > Emergence is about intrinsic properties, observer-
               | independent properties. If emergence were about observer-
               | relative properties it would be a meaningless term.
               | 
               | I'm going to address this in case it was also intended to
               | reply to my other comment about "useful to people"
               | possibly being a property.
               | 
               | "Useful to people" _would be_ an observer-independent
               | property, if it 's a property at all. An alien species
               | analyzing humanity would come to the same conclusions as
               | humans about whether some system, like the internet, was
               | useful to people. This would be evident by whether the
               | use of that system spread.
               | 
               | As for whether it's "intrinsic", I'm not sure how you're
               | applying this. As you said in a later comment, "Liquidity
               | isn't [a property] of water without a container/pressure
               | environment". In other words, liquidity isn't an
               | intrinsic property of H2O. Moreover, the only reason we
               | identify and created a label for "liquidity" is because
               | it's useful to people, which is the very criterion that
               | you're claiming should not be applied to describe some
               | surprising scaling behaviour of LLMs.
               | 
               | I just don't think you've made the distinction you're
               | attempting to make clear, because there is parity of
               | reasoning between the allegedly emergent properties you
               | describe and those in the article.
        
               | darawk wrote:
               | Let's make this concrete. What in your mind is a specific
               | example of a concrete system with an emergent property,
               | then?
        
               | mistermann wrote:
               | > What in your mind is a specific example of a concrete
               | system with an emergent property, then?
               | 
               | Not sure if you're joking, but the brain (and
               | consciousness/mind as an emergent phenomenon) is the
               | classic example. Even better, it is what's causing the
               | fundamental problems in this very conversation, due to
               | the inconsistent manner in which it translates terms into
               | meaning (ie: "emergent", "is"), typically not
               | realizing[1] that meaning and reality are similar to the
               | speed of light in that their appearances[2] vary
               | depending upon the frame of reference of the observer.
               | 
               | I am fairly optimistic that AI is going to "force
               | humanity's cultural hands" such that we will eventually
               | have to grapple with this long known but rarely discussed
               | phenomenon. Though, I anticipate censorship, propaganda,
               | _and consciousness_ will play heavy roles in such
               | discussions and screw everything up, as is usually the
               | case with fundamentally important ideas.
               | 
               | [1] _During realtime cognition_ , regardless of whether
               | substantial abstract knowledge is in the person's
               | possession.
               | 
               | [2] _And maybe even the things themselves_ , depending on
               | how (or _from where_ ) you look at it - I am still
               | undecided on this.
        
               | mjburgess wrote:
               | Liquidity is not a property of h2o molecules but it is of
               | water. Liquidity isn't of water without a
               | container/pressure environment.
               | 
               | The trajectories of particles of air are underdetermined
               | by their own intrinsic properties so a tornado cannot be
               | reduced to some mere aggregate of them.
               | 
               | Emergence is an ontological relationship between
               | properties of objects --- it isn't a mathematical
               | property of a data distribution nor of an approximation
               | function.
               | 
               | The very use of the term has crated all this confusion
               | right here.
               | 
               | Would anyone who thought NNs were showing emergence be
               | also content to find out that the reason for this
               | 'emergence' was just that in the case of so-called
               | 'emergence' our expectations of performance were
               | changing?
               | 
               | Do we call it 'emergence' when we replace estimating data
               | using mean() to estimating with ax+b ?
               | 
               | There's definitely an illusion here, but one quite easy
               | to spot if you weren't in the game of peddling illusions.
        
               | aaroninsf wrote:
               | Two observations,
               | 
               | There is utility and need for some convention of feature
               | description, for the case of the external behavior of the
               | system being correct, for some domain, despite lack of
               | specific training. It is reasonable for users of such
               | systems to say that the internal states and
               | representation don't matter, _so long as the behavior is
               | correct_ ; and in cases like these we will benefit from
               | _some_ consensus on how to talk about such things. Fine
               | with me if some new term is applied.
               | 
               | More of interest to me though is that it is not at all
               | clear to me that _genuine_ emergence is not possible
               | through scaling (independent of whether it _is_ in any
               | given existing LLM). Because the optimal (most compact)
               | correct representation for a lot of e.g. language output,
               | is exactly that which benefits from abstraction.
               | 
               | What reason is there to believe that the abstractions
               | derived at higher levels (of the network generally but
               | not necessarily, depends on the architecture) do not
               | encode non-linear problem spaces in the world, which are
               | "real" emergence?
               | 
               | I.e. if the way some network learns arithmetic is to
               | settle on an internal weighting that performs
               | computation, rather than "memorizing assertions", me, I
               | would call that "emergent."
               | 
               | But I'm happy to use some other term should one, er,
               | emerge.
        
               | darawk wrote:
               | > Liquidity is not a property of h2o molecules but it is
               | of water
               | 
               | The ability to speak English is not a property of
               | floating points, but it is of certain, very specific
               | large tensors of them. What's the difference?
               | 
               | > Emergence is an ontological relationship between
               | properties of objects --- it isn't a mathematical
               | property of a data distribution nor of an approximation
               | function.
               | 
               | I don't see a hard distinction between ontological
               | relationships and data distributions. All information is
               | fundamentally statistical. Our access to ontology is
               | forever and always mediated by "data distributions".
               | 
               | One could, of course posit that there are fundamental,
               | non-statistical ontological things out there. However,
               | the liquidity of water being an ontological relationship
               | while the English-speaking of GPT not being so is merely
               | a hypothesis, not an objective fact of the universe, at
               | least not as far as I can tell.
        
       | seydor wrote:
       | The X axis here is the training Flops but what about parameter
       | size and how does it account for the different architectures.
       | Comparing apples to shoelaces may not be a fruitful approach or
       | indicative of what to expect from ever-expanding scale. Also , is
       | it emergence or overfitting
        
       | CGamesPlay wrote:
       | Do these scale-dependent (I like this adjective better than
       | "emergent") properties survive model distillation? It may be that
       | our training/optimization processes are inefficient and require
       | these scales to achieve, but the underlying model may not
       | actually require the number of parameters that we are giving
       | them. I haven't read any of the papers about distillation yet,
       | does anyone know if this has been tested?
        
         | visarga wrote:
         | Good question, my guess is that you can't distill chain-of-
         | thought or zero shot prompting in small models, they got to be
         | 15-20B parameters or larger. Maybe someone has a link to a
         | related paper?
        
           | lossolo wrote:
           | For all models smaller than 62B, direct prompting outperforms
           | CoT. The first model where CoT outperforms direct prompting
           | is Flan-cont-PaLM 62B on BBH. For 540B models, there are more
           | settings where CoT outperforms direct prompting, but not all
           | of them. Also, the number can be smaller than 540B. In Suzgun
           | et. al. 2022, the authors show that the 175B InstructGPT and
           | 175B Codex also have better CoT performance than direct
           | prompting. Combining all the results, we get the two numbers
           | 62B and 175B. So yes indeed, to enter the game of scale you
           | do need a ticket to larger models than average.
           | 
           | However, there are also other large models like OPT, BLOOM,
           | and the first version of GPT-3. They all have 175B, yet their
           | CoT performance is significantly worse, or even cannot do
           | CoT.
           | 
           | source: https://yaofu.notion.site/A-Closer-Look-at-Large-
           | Language-Mo...
        
       | evrimoztamur wrote:
       | Has there been any efforts in processing calculation prompts,
       | where instead of letting it internally 'compute', it's trained to
       | identify equations and process them with an external calculator
       | instead (perhaps one which outputs not only the result but the
       | individual steps too)?
        
         | vutekst wrote:
         | Yes: https://twitter.com/goodside/status/1581805503897735168
        
         | visarga wrote:
         | Language models with toys. The calculator, Python REPL, search
         | engine, database, simulation, games, other AI's can easily
         | blend with large language models lifting some weight off their
         | shoulders.
         | 
         | For example, for a physics question the LM could write a small
         | simulation, run the simulation and interpret results back to
         | the user. That's possible when models can do code execution.
        
         | obiefernandez wrote:
         | Been wondering the same
        
       | djoldman wrote:
       | Paper: https://openreview.net/forum?id=yzkSU5zdwD
        
       | ttctciyf wrote:
       | There's a quite accessible IAS presentation[1] from another
       | Google researcher on _Solving Quantitative Reasoning Problems
       | with Language Models_ which gives some likely related background
       | on having language models solve this type of math problem,
       | including the  "chain of thought" technique mentioned here.
       | 
       | I found it pretty interesting and as something of an ML skeptic
       | was a bit surprised at the degree of coherence shown in
       | "reasoning" examples similar to the ones in the linked article.
       | 
       | 1: https://www.youtube.com/watch?v=qV4Ku5L4BuMt
        
       ___________________________________________________________________
       (page generated 2022-12-19 23:01 UTC)