[HN Gopher] Advancing AI theory with first-principles understand...
___________________________________________________________________
Advancing AI theory with first-principles understanding of deep
neural networks
Author : pcaversaccio
Score : 116 points
Date : 2021-06-19 09:07 UTC (13 hours ago)
(HTM) web link (ai.facebook.com)
(TXT) w3m dump (ai.facebook.com)
| cs702 wrote:
| I just added it to my reading list. Thankfully, the authors
| require only knowledge only of undergrad math (basic linear
| algebra and multivariate calculus), and prioritize "intuition
| above formality":
|
| > While this book might look a little different from the other
| deep learning books that you've seen before, we assure you that
| it is appropriate for everyone with knowledge of linear algebra,
| multivariable calculus, and informal probability theory, and with
| a healthy interest in neural networks. Practitioner and theorist
| alike, we want all of you to enjoy this book... _we've strived
| for pedagogy in every choice we've made, placing intuition above
| formality._
|
| Kudos to them for taking this approach!
|
| I'm looking forward to diving in!
| exporectomy wrote:
| Incredibly, Sadi Carnot in the 1800's also wanted a theory of
| deep learning [intro quote, p2]
| amelius wrote:
| What do you mean "p2", the article is not paginated?
| bserge wrote:
| Paragraph, probably.
| agnosticmantis wrote:
| They meant page 2 of the manuscript linked in the article:
|
| https://deeplearningtheory.com/PDLT.pdf
| cs702 wrote:
| I trust you're joking. Carnot wanted a theory that could
| explain _the steam-powered machines of the industrial age_ --
| just as now we want a theory that could explain the AI-powered
| machines of the information age. The authors mention as much in
| the introduction.
| cinntaile wrote:
| He's not joking though, they really say that in the PDF
| linked below. "Steam navigation brings nearer together the
| most distant nations. ... their theory is very little
| understood, and the attempts to improve them are stil l
| directed almost by chance. ...We propose now to submit these
| questions to a deliberate examination. -Sadi Carnot,
| commenting on the need for a theory of deep learning"
|
| It's probably a typo, it'll get corrected. It should probably
| say "a theory of thermodynamics" or something similar instead
| of deep learning.
| typon wrote:
| It's a joke lighten up
| cinntaile wrote:
| Are you the author of the pdf or do you know the author
| so that you know for sure? If not then it could be a
| typo, I'm not sure why that's such a controversial
| statement?
| frooxie wrote:
| It looks like dry humor to me.
| dr_dshiv wrote:
| My favorite theoretical description of multilayer networks comes
| from the first multilayer network, the 1986 harmonium [1]. It
| used a free energy minimization model (in the paper it is called
| harmony maximization), which is both concise, natural and
| effective. I find the paper very well written and insightful --
| even today.
|
| I haven't fully read the current paper, but it doesn't mention
| "free energy"-- which seems odd given their emphasis on
| thermodynamics and first principles.
|
| [1]
| https://www.researchgate.net/publication/239571798_Informati...
| plaidfuji wrote:
| This all makes sense, but at the same time feels a bit
| paradoxical to me. We're developing a first-principles theory to
| understand the mechanics of massive empirical models. Isn't that
| kind of ironic?
|
| If the problem we're solving is "faster iteration and design of
| optimal deep learning models for any particular problem", I would
| have thought the ML-style approach would be to solve that..
| through massive empiricism. Develop a generalized dataset
| simulator, a language capable of describing the design space of
| all deep learning models, and build a mapping between dataset
| characteristics <> optimal model architecture. Maybe that's the
| goal and I haven't dug deep enough. Just feels funny that all of
| our raw computing power has led us back to the need for more
| concise theory. Maybe that's a fundamental law of some kind.
| 6gvONxR4sf7o wrote:
| Sufficient understanding will always trump guesswork when the
| understanding exists (kinda tautologically). Hence the push to
| find it.
| trhway wrote:
| We had a manager with Theoretical Physics education from a top
| school who seriously suggested to ultimately solve QA by
| building a system to run the program through all the possible
| combinations of branches/etc.
| oddity wrote:
| I think this is to be expected. The oscillation between first-
| principles and empirical models falls out of the scientific
| method: see a few datapoints, develop a predictive theory, try
| to prove the theory wrong with new datapoints, and reiterate
| for alternative explanations with fewer assumptions, greater
| predictive power, etc...
|
| This happens even in pure mathematics, just at a more abstract
| level: start with conjectures seen on finite examples, prove
| some limited infinite cases, eventually prove or disprove the
| conjecture entirely.
|
| Current DL models are so huge they've outpaced the scale where
| our existing first-principles tools (like linear algebra) can
| efficiently predict the phenomena we see when we use them. The
| space has gotten larger, but human brains haven't, so if we
| still want humans to be productive, we need to develop a more
| efficient theory. Empirical models explaining empirical models
| might work, but not for humans.
| [deleted]
| mark_l_watson wrote:
| Excellent article, and the little bit of the paper I read, so
| far, stresses building intuition. Andrew Ng's classes stressed
| building intuition also.
|
| Even though I have largely earned my living doing deep learning
| in the last six years, I believe that hybrid AI will get us to
| AGI. Getting a first principles understanding of DL models is
| complementary to building hybrid AI systems.
|
| EDIT: I am happy to have a work in progress PDF of the
| manuscript, but I wish authors would release PDF, ePub, and
| Kindle formats also. I spread my reading across devices and with
| a PDF I need to remember where I left off reading and navigate
| there.
| Vlados099 wrote:
| Checkout calibre for converting to othe formats:
| https://calibre-ebook.com/
| thewarrior wrote:
| What is hybrid AI ?
| mark_l_watson wrote:
| Combination of at least: DL and good old fashioned symbolic
| AI.
| alok-g wrote:
| I just finished looking through the manuscript
| [https://deeplearningtheory.com/PDLT.pdf]. Mathematics is heavy
| for me especially for a quick read, albeit a great thing I see is
| that the authors have reduced dependencies on external literature
| by inlining the various derivations and proofs instead of just
| providing references.
|
| ## _The epilogue section (page 387 of the book, 395 in the PDF)
| is giving a good overview, presented below per my own
| understanding:_
|
| Networks with a very large number of parameters, much larger than
| the size of the training data, should as such overfit. The number
| of parameters is conventionally taken as a measure of model
| complexity. Having a very large network can allow it to perform
| well on the training data by just memorizing it and perform
| poorly on unseen data. Somehow these very large networks are
| empirically performing well still in achieving generalization,
| i.e., these are recognizing good patterns from the training data.
|
| The authors show that model complexity (or ability to generalize
| well I would say) for such large networks is dependent on its
| depth-to-width ratio:
|
| * When the network is much wider than deeper (the ratio
| approaches zero), the neurons in the network don't have as many
| "data-dependent couplings". My understanding from this is that
| while the large width gives the network power in terms of number
| of parameters, it has lessor opportunity for a correspondingly
| large number of feature transformations. While the network can
| still _fit_ the training data well [2, 3], it may not generalize
| well. In the authors ' words, when the depth-to-width ratio is
| close to zero (page 394), "such networks are not really deep"
| (even if depth is much more than two) "and they do not learn
| representations."
|
| * On the opposite end, when the network is very deep (ratio going
| closer to one or larger), {I'm rephrasing the authors from my
| limited understanding} the network needs non-Gaussian description
| of the model parameter space, which makes it "not tractable" and
| not practically useful for machine learning.
|
| While it makes intuitive sense that the network's capability to
| find good patterns and representations depends on the depth-to-
| width ratio, the authors have supplied the mathematical
| underpinnings behind this as briefly summarized above. My
| previous intuition was that having a larger number of layers
| allows for more feature transformations, giving the network a
| higher ease of learning. The new understanding via the authors'
| work is that if for the same number of layers, the width is
| increased, the network now has a _harder_ job to learn feature
| transformations commensurate with now larger number of neurons.
|
| ## _My own commentary and understanding (some from before looking
| at the manuscript)_
|
| If the size of the network is very small, the network won't be
| able to _fit_ the training data well. A network with a larger
| size would generally have more 'representation' power, allowing
| it to know more complex patterns.
|
| The ability to _fit_ the training data is of course however
| different from ability to generalize to unseen data. Merely
| adding more representation power can allow it to overfit. As the
| network size starts exceeding the size of the training data, it
| could have a tendency to just memorize the training data without
| generalizing, unless something is done to prevent that.
|
| So as the size of the network is increased with the intentions of
| giving it more representation power, we need something more such
| that the network first learns the most common patterns (highest
| compression, but lossy) and then keeps on learning progressively
| more intricate patterns (now lessor compression, more accurate).
|
| My intuition so far was that achieving this was an aspect of the
| training _algorithm_ and cell design innovations and also of the
| depth-to-width ratio. The authors however show that this depends
| on the depth-to-width ratio and in the way specified. It is still
| counter-intuitive to me that algorithmic innovation may not play
| a role in this, or perhaps I am misunderstanding the work.
|
| So now the 'representation power' of the network and its ability
| to _fit_ the training data itself would generally increase with
| the size of the network. However, its ability to _learn_ good
| representations and generalize depends on the depth-to-width
| ratio. Loosely speaking then, to increase model accuracy on
| training data itself, model size may need to be increased while
| keeping the aspect ratio constant at least as far as the training
| data size is larger, whereas to improve generalization and
| finding good representations for a given model size, the aspect
| ratio should be tuned.
|
| Intuitively, I think that under a pathological case where the
| network is so large that merely its width (as opposed to width
| times depth) is exceeding the size of the training data, then
| even if the depth-to-width ratio is chosen according to the
| guidance from the authors (page 394 in the book) the model would
| still fail to learn well.
|
| Finally, I wonder what the implications of the work is for
| networks with temporal or spatial weight-sharing like
| convolutional networks, recurrent and recursive networks,
| attention, transformers, etc. For example, for recurrent neural
| networks, the effective depth of the network depends on how long
| the input data sequence was. I.e., the depth-to-width ratio could
| be varying simply because input length is varying. The learning
| from the authors' work I think should directly apply if each time
| step is treated as a training sample on its own, i.e.,
| backpropagation through time is not considered. However, I wonder
| if the authors' work still presents some challenges on how long
| could the input sequences be as the non-Gaussian aspect may start
| coming into the picture.
|
| As time permits, I would read the manuscript in more detail. I'm
| hopeful however that other people may achieve that faster and
| help me understand better. :-)
|
| ## _References:_
|
| [1]
| https://en.wikipedia.org/wiki/Universal_approximation_theore...
|
| [2] http://neuralnetworksanddeeplearning.com/chap4.html
___________________________________________________________________
(page generated 2021-06-19 23:01 UTC)