[HN Gopher] The AI Scaling Hypothesis
___________________________________________________________________
The AI Scaling Hypothesis
Author : andreyk
Score : 92 points
Date : 2022-10-07 16:29 UTC (6 hours ago)
(HTM) web link (lastweekin.ai)
(TXT) w3m dump (lastweekin.ai)
| kelseyfrog wrote:
| There's a deeper more troubling problem being exposed here - deep
| learning systems are at least an order of magnitude less data
| efficient than the systems they hope to replicate.
|
| GPT-3 175B is trained with 499 Billion tokens[1]. Let's assume
| token = word for the sake of this argument[2]. The average adult
| person reads at a rate of 238wmp[3]. Then a human who reads
| 24hrs/day from birth until their 18th birthday would read a total
| of 2.2B billion words[4], or 0.45% of the words GPT-3 was trained
| on.
|
| Human's simply do much more with much less. So what gives? I
| don't disagree that we still haven't reached the end of what
| scaling can do, but there is a creeping suspicion that we've
| gotten something fundamentally wrong on the way there.
|
| 1. https://lambdalabs.com/blog/demystifying-gpt-3/
|
| 2. GPT-based models use BPE and while we would dive into the
| actual dictionary of tokens and make a word-token relationship,
| we both agree that although this isn't a 1-to-1 relationship it
| won't change the conclusion
| https://huggingface.co/docs/transformers/tokenizer_summary
|
| 3. https://psyarxiv.com/xynwg/
|
| 4. 238*60*24*365*18=2,251,670,400
| chaxor wrote:
| You're right about reference [2], which can alter things by ~1
| order of magnitude (words are usually ~3-10 tokens).
| Additionally as others have pointed out, we don't live
| _entirely in the text world_. So, we have the nice benefit of
| understanding objects from visual and proprioceptive inputs,
| which is huge. The paucity of data argument made well-known by
| Noam Chomsky et al is certainly worth discussing in academia;
| however, I am not as moved by these arguments of the stark
| differences in input required between humans and ML as I once
| was. In image processing for example, sending 10k images in
| rapid succession with no other proprioceptive inputs, time
| dependencies, or agent-driven exploration of spaces puts these
| systems at an enormous disadvantage to learn certain phenomenon
| (classes of objects or otherwise).
|
| Of course there are differences between the systems, but I'm
| beginning to be more skeptical that saying that the newer ML
| system can't learn as much as biological systems given the
| _same input_ (obviously this is where a lot is hidden).
| kelseyfrog wrote:
| Thank you for the tokens-to-words factor! Much appreciated.
|
| I'm definitely in agreement that multi-task models represent
| an ability to learn more than any one specialized model, but
| I think it's a bit of an open question whether multi-task
| learning alone can fully close the digital-biological gap. Of
| course I'd be very happy to be proven wrong on this though by
| empirical evidence in my lifetime :)
| [deleted]
| gamegoblin wrote:
| Humans take in a tremendously high bitrate of data via other
| senses and are able to _connect_ those to the much lower amount
| of language input such that the language can go much further.
|
| GPT-3 is learning everything it knows about the entire universe
| _just from text_.
|
| Imagine we received a 1TB information dump from a civilization
| that lives in an alternate universe with entirely different
| physics. How much could we learn just from this information
| dump?
|
| And from our point of view, it could be absurdly exotic. Maybe
| their universe doesn't have gravity or electromagnetic
| radiation. Maybe the life forms in that universe spontaneously
| merge consciousnesses with other life forms and separate
| randomly, so whatever writing we have received is in a style
| that assumes the reader can effortlessly deduce that the author
| is actually a froth of many consciousnesses. And in the grand
| spectrum of how weird things could get, this "exotic" universe
| I have described is really basically identical to our own,
| because my imagination is limited.
|
| Learning about a whole exotic universe from just an info dump
| is the task of GPT-3. For instance, tons of our writing takes
| for granted that solid objects don't pass through each other. I
| dropped the book. Where is the book? On the floor. Very few
| bits of GPT-3's training set includes the statements "a book is
| a solid object", "the floor is a solid object", "solid objects
| don't pass through each other", but it can infer this principle
| and others like it.
|
| From this point of view, its shortcomings make a lot of sense.
| Some things GPT fails at are obvious to us having grown up in
| this universe. I imagine we're going to see an explosion of
| intelligence once researchers figure out how to feed AI systems
| large swaths of YouTube and such, because then they will have a
| much higher bandwidth way to learn about the universe and how
| things interact, connecting language to physical reality.
| saynay wrote:
| One of the more interesting things I have seen recently is
| the combination of different domains in models / datasets.
| The top network of Stable Diffusion combines text-based
| descriptions with image-based descriptions, where the model
| learns to represent either text or images in the same
| embedding; a picture, or a caption for that picture, lead to
| similar embeddings.
|
| Effectively, this can broaden the context the network can
| learn. There are relationships that are readily apparent to
| something that learned images that might not be apparent to
| something trained only on text, or vis-versa.
|
| It will be interesting to see where that goes. Will it be
| possible to make a singular multi-domain encoder, that can
| take a wide range of inputs and create an embedding (an
| "mental model" of the input), and have this one model be
| usable as the input for a wide variety of tasks? Can
| something trained on multi-domains learn new concepts faster
| than a network that is single-domain?
| Teever wrote:
| I would love to see a model trained on blueprints or a
| model trained on circuit diagrams.
|
| text2blueprint or wav2schematic could produce some
| interesting things.
| Jensson wrote:
| They haven't even figured out basic math, so not sure
| what you would expect to find there. They aren't smart
| enough to generate structure that doesn't already exist.
| visarga wrote:
| Depends on the method. Evolutionary methods can
| absolutely find structure that we missed, and they often
| go hand in hand with learning. Like AlphaGo move 37.
| cma wrote:
| Google's Imagen was trained on about as many images as a 6
| year old would have seen over their lifetime at 24fps and a
| whole lot more text. It can draw a lot better and probably
| has a better visual vocabulary but is also way outclassed in
| many ways.
|
| Paucity of the stimulus is a real problem and may mean our
| starting point architecture from genetics has a lot of
| learning built in than just a bunch of uninitialized weights
| randomly connected. A newborn animal can often get up and
| walk right away in many species.
|
| https://www.youtube.com/watch?v=oTNA8vFUMEc
|
| Humans have a giant head at birth and muscles too weak, but
| can swim around like little seals pretty quickly after birth.
| visarga wrote:
| > our starting point architecture from genetics has a lot
| of learning built in
|
| I don't doubt that evolution provided us with great priors
| to help us be fast learners, but there are two more things
| to consider.
|
| One is scale - the brain is still 10,000x more complex than
| large language models. We know that smaller models need
| more training data, thus our brain being many orders of
| magnitude larger than GPT-3 naturally learns faster.
|
| The second is social embedding - we are not isolated, our
| environment is made of human beings, similarly an AI would
| need to be trained as part of human society, or even as
| part of an AI society, but not alone.
| gamegoblin wrote:
| Definitely. I do think video is _much_ more important than
| images, because video implicitly encodes physics, which is
| a huge deal.
|
| And, as you say, there are probably some
| structural/architectural improvements to be made in the
| neural network as well. The mammalian brain has had a few
| hundred million years to evolve such a structure.
|
| It also remains unclear how important learning causal
| influence is. These networks are essentially "locked in"
| from inception. They can only take the world in. Whereas
| animals actively probe and influence their world to learn
| causality.
| [deleted]
| akiselev wrote:
| The mammalian brain have had a few hundred million years
| to evolve _neural plasticity_ [1] which is the key
| function missing in AI. The brain's structure isn't set
| in stone but develops over one's lifetime and can even
| carry out major restructuring on a short time scale in
| some cases of massive brain damage.
|
| Neural plasticity is the algorithm running on top of our
| neural networks that optimizes their structure as we
| learn so not only do we get more data, but our brains get
| better tailored to handle that kind of data. This process
| continues from birth to death and physical
| experimentation in youth is a key part of that
| development, as is social experimentation in social
| animals.
|
| I think "it remains unclear" only to the ML field, from
| the perspective of neuroscientists, current neural
| networks aren't even superficially at the complexity of
| axon-dendrite connections with ion channels and threshold
| potentials, let alone the whole system.
|
| A family member's doctoral thesis was on the potentiation
| of signals and based on my understanding if it, every
| neuron takes part in the process with its own "memory" of
| sorts and the potentiation she studied was just one tiny
| piece of the neural plasticity story. We'd need to turn
| every component in the hidden layers of a neural network
| into it's own massive NN with its own memory to even
| begin to approach that kind of complexity.
|
| [1] https://en.m.wikipedia.org/wiki/Neuroplasticity
| alasdair_ wrote:
| This is a fantastically good point. I think things will get
| even more interesting once the ML tools have access to more
| than just text, audio and image/video information. They will
| be able to draw inferences that humans will generally be
| unaware of. For example, maybe something happens in the
| infrared range that humans are generally oblivious to, or
| maybe inferences can be drawn based on how radio waves bounce
| around an object.
|
| "The universe" according to most human experience misses SO
| much information and it will be interesting to see what
| happens once we have agents that can use all this extra stuff
| in realtime and "see" things we cannot.
| visarga wrote:
| The hypothesis that you can't learn some things from text -
| you need real life experience, is intuitive and I used to
| think it's true. But there are interesting results from just
| a few days ago saying that text by itself is also enough:
|
| > We test a stronger hypothesis: that the conceptual
| representations learned by text only models are functionally
| equivalent (up to a linear transformation) to those learned
| by models trained on vision tasks. Specifically, we show that
| the image representations from vision models can be
| transferred as continuous prompts to frozen LMs by training
| only a single linear projection.
|
| Linearly Mapping from Image to Text Space -
| https://arxiv.org/abs/2209.15162
| ummonk wrote:
| The claim isn't that you can't learn it from text, but
| rather that this is why models require so much text to
| train on - because they're learning the stuff that humans
| learn from video.
| thrown_22 wrote:
| > Humans take in a tremendously high bitrate of data via
| other senses and are able to connect those to the much lower
| amount of language input such that the language can go much
| further.
|
| They don't. Human bitrates are quite low all things
| considered. The eyes which by far produce them most
| information only have a bitrate equivalent to ~2kbps:
|
| http://www.princeton.edu/~wbialek/our_papers/ruyter+laughlin.
| ..
|
| The rest of the input nerves don't bring us over 20kpbs.
|
| The average image recognition system has access to more data
| and can tell the difference between a cat and a banana. A
| human has somewhat more capability than that.
| andreyk wrote:
| I think comparing to humans is a bit of a distraction, unless
| what you care about is replicating the way human intelligence
| works in AI. The mechanisms by which learning is done (in these
| cases self-supervised and supervised learning) are not at all
| the same as humans have, so it's unsurprising the qualitative
| aspects are different.
|
| It may be argued we need more human-like learning mechanisms.
| Then again, if we need internet-scale data to achieve human-
| level general intelligence, so what? If it works it works. Of
| course, the comparison has some value in terms of knowing what
| can be improved and so on, especially for RL. But I wouldn't
| call this a 'troubling problem'.
| MonkeyMalarky wrote:
| Humans also have millions of years of evolution that have
| effectively pre-trained the structure and learning ability of
| the brain. A baby isn't born knowing a language but is born
| with the ability to efficiently learn them.
| peteradio wrote:
| Indeed, there is a certain hardcoding that can efficiently
| synthesize language. Doesn't that beg the question... what is
| the missing hardcoding for AI that could enable it to
| synthesize via much smaller samples?
| myownpetard wrote:
| There is a great paper, Weight Agnostic Neural Networks
| [0], that explores this topic. They experiment with using a
| single shared weight for a network while using an
| evolutionary algorithm to find architectures that are
| themselves biased towards being effective on specific
| problems.
|
| The upshot is that once you've found an architecture that
| is already biased towards solving a specific problem, then
| the training of the weights is faster and results in better
| performance.
|
| From the abstract, "...In this work, we question to what
| extent neural network architectures alone, without learning
| any weight parameters, can encode solutions for a given
| task.... We demonstrate that our method can find minimal
| neural network architectures that can perform several
| reinforcement learning tasks without weight training. On a
| supervised learning domain, we find network architectures
| that achieve much higher than chance accuracy on MNIST
| using random weights."
|
| [0] https://arxiv.org/abs/1906.04358
| Der_Einzige wrote:
| This btw is an example of a whole field called "extreme
| learning"
|
| https://en.m.wikipedia.org/wiki/Extreme_learning_machine
| visarga wrote:
| The brain has about 1T synapses and GPT-3 has 175B parameters,
| even though a parameter is much simpler than a synapse. So the
| scale of the brain is at least 5700x that of GPT-3. It seems
| normal to have to compensate by using 200x more training data.
| 6gvONxR4sf7o wrote:
| What's missing is interaction/causation, and the reason is that
| we can scale things more easily without interaction in the data
| gathering loop. Training a model with data gathering in the
| loop requires gathering more data every time the model takes a
| learning step. It's slow and expensive. Training a model on
| pre-existing data is much simpler, and it's unclear whether
| we've reached the limits of that yet.
|
| My prediction is we'll get 'good enough for prod' without
| interactive data, which will let us put interactive systems in
| the real world at scale, at which point the field's focus will
| be able to shift.
|
| One way to look at it is active learning. We all know the game
| where I think of a number between 0 and 100 and you have to
| guess it, and I'll tell you if it's higher or lower. You'll
| start by guessing 50, then maybe 25, and so on, bisecting the
| intervals. If you want to get within +/-1 of the number I'm
| thinking of, you need about six data points. On the other hand,
| if you don't do this interactively, and just gather a bunch of
| data before seeing any answers, to get within +/-1, and need 50
| data points. The interactivity means you can refine your
| questions in response to whatever you've learned, saving huge
| amounts of time.
|
| Another way to look at it is like randomized controlled trials.
| To learn a compact idea (more X means more Y), you can
| randomize X and gather just enough data on Y to be confident
| that the relationship isn't a coincidence. The alternative
| (observational causal inference) is famously harder. You have
| to look at a bunch of X's and Y's, and also all the Z's that
| might affect them, and then get enough data to be confident in
| this entire structure you've put together involving lots of
| variables.
|
| The way ML has progressed is really a function of what's easy.
| If you want a model to learn to speak english, do you want it
| to be embodied in the real world for two years with humans
| teaching it full time how the world and language relate? Or is
| it faster to just show it a terabyte of english?
|
| tl;dr observational learning is much much harder than
| interactive learning, but we can scale observational learning
| in ways we can't scale interactive learning.
| gxt wrote:
| Because the whole industry is wrong. ML is incapable of general
| intelligence, because that's not what intelligence is. ML is
| the essential component with which one interfaces with the
| universe, but it's not intelligence, and never will be.
| Symmetry wrote:
| Humans are using less data but we throw drastically more
| compute at the problem during learning.
| edf13 wrote:
| Simple - humans aren't learning by reading and understanding a
| word at a time (or token)...
|
| They are taking many thousands (millions?) of inputs every
| minute for their surroundings
| edf13 wrote:
| Just reminded myself of Jonny 5 needing input...
|
| https://youtu.be/Y9lwQKv71FY
| idiotsecant wrote:
| >deep learning systems are at least an order of magnitude less
| data efficient than the systems they hope to replicate.
|
| While true on the surface, you have to also consider that there
| is a _vast_ quantity of training data expressed in our DNA. Our
| 'self' is a conscious thought, sure, but it's also unconscious
| action and instinct, all of which is indirect lived experience
| of our forebear organisms. The ones that had a slightly better
| twitch response to the feel of an insect crawling on their arm
| were able to survive the incident, etc. Our 'lizard brains' are
| the result of the largest set of training data we could
| possibly imagine - the evolutionary history of life on earth.
| c3534l wrote:
| Brains do not actually work very similarly to artificial neural
| networks. The connectionist approach is no longer favored, and
| human brains are not arranged in regular grids of fully
| interconnected layers. ANNs were inspired by how people thought
| the brain worked more than 50 years ago. Of course, ANNs are
| meant to work and solve practical problems with the technology
| we have. They're not simulations.
| machina_ex_deus wrote:
| I agree. If you look at animals it's also clear that scaling
| hypothesis breaks somewhen, as all measures of brain size
| (brain mass ratio, etc.) fail to capture intelligence. And
| animals have natural neutral networks.
|
| If you think about it, neutral networks have roamed the earth
| for millions of years - including generic algorithm for
| optimizing the hardware. And yet only extremely recently
| something like humans happened. Why?
|
| The amount of training and processing power which happened
| naturally through evolution beats current AI research by
| several orders of magnitude. Yes, evolution isn't intelligent
| design. But the current approach to AI isn't intelligent design
| either.
| godelski wrote:
| As a ML Vision researcher, I find these scaling hypothesis claims
| quite ridiculous. I understand that the NLP world has made large
| strides by adding more attention layers, but I'm not an NLP
| person and I suspect there's more than just more layers. We won't
| even talk about the human brain and just address that "scaling is
| sufficient" hypothesis.
|
| With vision, pointing to Parti and DALLE as scaling is quite
| dumb. They perform similarly but are DRASTICALLY different in
| size. Parti has configurations with 350M, 750M, 3B, and 20B
| parameters. DALLE has 3.5. Imagen uses T5-XXL which alone has 11B
| parameters, just in the text part.
|
| Not only this, there are major architecture changes. If scaling
| was all you needed then all these networks would still be using
| CNNs. But we shifted to transformers. THEN we have shifted to
| diffusion based models. Not to mention that Parti, DALLE, and
| Parti have different architectures. It isn't just about scale.
| Architecture matters here.
|
| And to address concerns, diffusion (invented decades ago) didn't
| work because we just scaled it up. It worked because of
| engineering. It was largely ignored previously because no one got
| it to work better than GANs. I think this lesson should really
| stand out. That we need to consider the advantages and
| disadvantages of different architectures and learn how to make
| ALL of them work effectively. In that manner we can combine them
| to work in ideal ways. Even Le Cun is coming to this point of
| view despite previously being on the scaling side.
|
| But maybe you NLP folks disagree. But the experience in vision is
| far more rich than just scaling.
| panabee wrote:
| this is well articulated. another key point: dall-e 2 uses 70%
| _fewer_ parameters than dalle-e 1 while offering far higher
| quality.
|
| from wikipedia (https://en.wikipedia.org/wiki/DALL-E):
|
| DALL-E's model is a multimodal implementation of GPT-3 with 12
| billion parameters which "swaps text for pixels", trained on
| text-image pairs from the Internet. DALL-E 2 uses 3.5 billion
| parameters, a smaller number than its predecessor.
| andreyk wrote:
| I agree - I think scaling laws and scaling hypothesis are quite
| distinct personally. Scaling hypothesis is 'just go bigger with
| what we have and we'll get AGI', vs scaling laws are 'for these
| tasks and these models types, these are the empirical trends in
| performance we see'. I think scaling laws are still really
| valuable for vision research, but as you say we should not just
| abandon thinking about things beyond scaling even if we observe
| good scaling trends.
| godelski wrote:
| Yeah I agree with this position. It is also what I see within
| my own research. But also in my own research I see the vast
| importance of architecture search. This may not be what the
| public sees, but I think it is well known to the research
| community or anyone with hands on experience with these types
| of models.
| andreyk wrote:
| Co-author here, happy to answer any questions/chat about stuff we
| did not cover in this overview!
| puttycat wrote:
| Hi! Great post. See my comment below about scaling down.
| andreyk wrote:
| Thanks! I'll take a look, those do look interesting.
| benlivengood wrote:
| It would be great to see more focus on Chinchilla's result that
| most large models were quite undertrained with respect to
| optimal reduction in test loss.
| andreyk wrote:
| agreed, we did not discuss that sufficiently
| [deleted]
| 3vidence wrote:
| I think something that has concerned me with the concept of
| scaling to AGI is the concept of "adversarial examples". Small
| tweaks that can be made to cause unpredictable behavior in the
| system. At a high level these are caused by unexpected paths in
| high dimensional model weight space that don't align with our
| intuition. This problem in general seems to get worse as the size
| of the weights grow.
|
| From a value perspective a very high fidelity model with
| extremely unexpected behavior seems really low value since you
| need a human there full time to make sure that the model doesn't
| go haywire that 1-5% of the time
| mjburgess wrote:
| "Scaling" means increasing the number of parameters. _Parameters_
| are just the database of the system. At 300GB of parameters, we
| 're talking models which remember compressed versions of all
| books ever written.
|
| This is not a path to "AGI", this is just building a search
| engine with a little better querying power.
|
| "AI" systems today are little more than superpositions of google
| search results, with their parameters being a compression of
| billions of images/documents.
|
| This isnt even on the road to intelligence, let alone an instance
| of it. "General intelligence" does not solve problems by
| induction over billions of examples of their prior solutions.
|
| And exponential scaling in the amount of such remembering
| required is a fatal trajectory for AI, and likewise and
| indication that it doesnt deserve the term.
|
| No intelligence is exponential in an answer-space, indeed, i'd
| say that's *the whole point* of intelligence!
|
| We already know that if you compress all possible {(Question,
| Answer)} pairs, you can "solve" any problem trivially.
| MathYouF wrote:
| The tone of this betrays a possibly more argumentative than
| collaborative conversation style than that which I may want to
| engage with further (as seems common I've noticed amongst anti-
| connectionists), but I did find one point intersting for
| discussion.
|
| > Parameters are just the database of the system.
|
| Would any equations parameters be considered just the database
| then? C in E=MC^2, 2 in a^2+b^2=c^2?
|
| I suppose those numbers are basically a database, but the
| relationships (connections) they have to the other variables
| (inputs) represent a demonstrable truth about the universe.
|
| To some degree every parameter in a nn is also representing
| some truth about the universe. How general and compact that
| representation is currently is not known (likely less than we'd
| like of both traits).
| jsharf wrote:
| I'm not anti-connectionist, but if I were to put myself in
| their shoes, I'd respond by pointing out that in E=MC^2, C is
| a value which directly correlates with empirical results. If
| all of humanity were to suddenly disappear, a future advanced
| civilization would re-discover the same constant, though
| maybe with different units. Their neural networks, on the
| other hand, probably would be meaningfully different.
|
| Also, the C in E=MC^2 has units which define what it means in
| physical terms. How can you define a "unit" for a neural
| network's output?
|
| Now, my thoughts on this are contrary to what I've said so
| far. Even though neural network outputs aren't easily defined
| currently, there's some experimental results showing neurons
| in neural networks demonstrating symbolic-like higher-level
| behavior:
|
| https://openai.com/blog/multimodal-neurons/
|
| Part of the confusion likely comes from how neural networks
| represent information -- often by superimposing multiple
| different representations. A very nice paper from Anthropic
| and Harvard delved into this recently:
|
| https://transformer-circuits.pub/2022/toy_model/index.html
| ctoth wrote:
| Related: Polysemanticity and Capacity in Neural Networks
| https://arxiv.org/abs/2210.01892
| mjburgess wrote:
| There's a very literal sense in which NN parameters are just
| a db. As in, it's fairly trivial to get copyrighted verbatim
| output from a trained NN (eg., quake source code from git co-
| pilot, etc.).
|
| "Connectionists" always want to reduce everything to formulae
| with no natural semantics and then equivocate this with
| science. Science isnt mathematics. Mathematics is just a
| short hand for a description of the world _made true_ by the
| semantics of that description.
|
| E=mc^2 isnt true because it's a polynomial, and it doesnt
| _mean_ a polynomial, and it doesnt have "polynomial
| properties" because it isnt _about_ mathematics. It 's
| _about_ the world.
|
| E stands for energy, m for mass, and c for a geometric
| constant of spacetime. If they were to stand for other
| properties of the world, in general, the formulae would be
| false.
|
| I find this "connectionist supernaturalism" about mathematics
| deeply irritating, it has all the hubris and numerology of
| religions but wandering around in a stolen lab coat. Hence
| the tone.
|
| What can one say or feel in the face of the overtaking of
| science by pseudoscience? It seems plausible to say now,
| today, more pseudoscientific papers are written than
| scientific ones. A generation of researchers are doing little
| more than analysing ink-blot patterns and calling them
| "models".
|
| The insistence, without explanation, that this is a
| reasonable activity pushes one past tolerance on these
| matters. It's exasperating... from psychometrics to AI, the
| whole world of intellectual life has been taken over by a
| pseudoscientific analysis of non-experimental post-hoc
| datasets.
| politician wrote:
| This discussion (the GP and your response) perhaps suggests
| that a way to evaluate the intelligence of an AI may need to
| be more than the generation of some content, but also
| citations and supporting work for that content. I guess I'm
| suggesting that the field could benefit from a shift towards
| explainability-first models.
| benlivengood wrote:
| That's why the Chinchilla paper (given a single paragraph in
| the article) is so important; it gives a scaling equation that
| puts a limit on the effect of increasing parameters. Generally
| for the known transformer models the reduction in loss from
| having infinite parameters is significantly less that the
| reduction in loss from training on infinite data. Most large
| models are very undertrained.
| aaroninsf wrote:
| Everything of interest in ML networks is occurring in the
| abstractions that emerge in training in deep multi-layer
| networks.
|
| At the crudest level this immediately provides for more than
| canned lookup as asserted; analogical reasoning is a much-
| documented emergent property.
|
| But analogies are merely the simplest, first-order abstraction,
| which are easy for humans to point at.
|
| Inference and abstraction across multiple levels, means the
| behaviors of these systems is utterly unlike simple stores. One
| clear demonstration of this is the effective "compression" of
| image gen networks. They don't compress images. For lack of any
| better vocabulary, they understand them well enough produce
| them.
|
| The hot topic is precisely whether there are boundaries to what
| sorts of implicit reasoning can occur through scale, and, what
| other architectures need to be present to effect agency and
| planning of the kind hacked at in traditional symbolic systems
| AI.
|
| It might be worthwhile to read contemporary work to get up to
| speed. Things are already a lot weirder than we have had time
| to internalize.
| AstralStorm wrote:
| Can they be said to understand the images if a style transfer
| model they produce is image dependent with an unstable
| threshold boundary? Or when they make an error similar to
| pareidolia all the time, seeing faces where there are none?
| When they do not understand how to paint even roughly fake
| text?
| swid wrote:
| 300 GB is nothing compared to the vastness of information in
| the universe (hence it fitting on a disk). AI is approximating
| a function, and the function they are now learning to
| approximate is us.
|
| From [1], with my own editing...
|
| When comparing the difference between now and human performance
|
| > ...[huamns] can achieve closer to 0.7 bits per character .
| What is in that missing >0.4?
|
| > Well--everything! Everything that the model misses. While
| just babbling random words was good enough at the beginning, at
| the end, it needs to be able to reason our way through the most
| difficult textual scenarios requiring causality or commonsense
| reasoning... every time that it lacks the theory of mind to
| compress novel scenes describing the Machiavellian scheming of
| a dozen individuals at dinner jockeying for power as they
| talk...
|
| > If we trained a model which reached that loss of <0.7, which
| could predict text indistinguishable from a human, whether in a
| dialogue ...how could we say that it doesn't truly understand
| everything?
|
| [1] https://www.gwern.net/Scaling-hypothesis
| andreyi wrote:
| puttycat wrote:
| Good overview.
|
| At the other extreme, some recent works [1,2] show why it's
| sometime better to scale down instead of up, especially for some
| humanlike capabilities like generalization:
|
| [1]
| https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00489...
|
| [2] https://arxiv.org/abs/1906.04358
| MathYouF wrote:
| If greater parameterization leads to memorization rather than
| generalisation it's likely a failure in our current
| architectures and loss formulations rather than an inherent
| benefit of "fewer parameters" improving generalizaiton. Other
| animals do not generalize better than humans despite having
| fewer neurons (or their generalizaitons betray a
| misunderstanding of the number and depth of subcategories there
| are for things, like when a dog barks at everything that passes
| by the window).
___________________________________________________________________
(page generated 2022-10-07 23:00 UTC)