[HN Gopher] Advancing AI theory with first-principles understand...
       ___________________________________________________________________
        
       Advancing AI theory with first-principles understanding of deep
       neural networks
        
       Author : pcaversaccio
       Score  : 116 points
       Date   : 2021-06-19 09:07 UTC (13 hours ago)
        
 (HTM) web link (ai.facebook.com)
 (TXT) w3m dump (ai.facebook.com)
        
       | cs702 wrote:
       | I just added it to my reading list. Thankfully, the authors
       | require only knowledge only of undergrad math (basic linear
       | algebra and multivariate calculus), and prioritize "intuition
       | above formality":
       | 
       | > While this book might look a little different from the other
       | deep learning books that you've seen before, we assure you that
       | it is appropriate for everyone with knowledge of linear algebra,
       | multivariable calculus, and informal probability theory, and with
       | a healthy interest in neural networks. Practitioner and theorist
       | alike, we want all of you to enjoy this book... _we've strived
       | for pedagogy in every choice we've made, placing intuition above
       | formality._
       | 
       | Kudos to them for taking this approach!
       | 
       | I'm looking forward to diving in!
        
       | exporectomy wrote:
       | Incredibly, Sadi Carnot in the 1800's also wanted a theory of
       | deep learning [intro quote, p2]
        
         | amelius wrote:
         | What do you mean "p2", the article is not paginated?
        
           | bserge wrote:
           | Paragraph, probably.
        
           | agnosticmantis wrote:
           | They meant page 2 of the manuscript linked in the article:
           | 
           | https://deeplearningtheory.com/PDLT.pdf
        
         | cs702 wrote:
         | I trust you're joking. Carnot wanted a theory that could
         | explain _the steam-powered machines of the industrial age_ --
         | just as now we want a theory that could explain the AI-powered
         | machines of the information age. The authors mention as much in
         | the introduction.
        
           | cinntaile wrote:
           | He's not joking though, they really say that in the PDF
           | linked below. "Steam navigation brings nearer together the
           | most distant nations. ... their theory is very little
           | understood, and the attempts to improve them are stil l
           | directed almost by chance. ...We propose now to submit these
           | questions to a deliberate examination. -Sadi Carnot,
           | commenting on the need for a theory of deep learning"
           | 
           | It's probably a typo, it'll get corrected. It should probably
           | say "a theory of thermodynamics" or something similar instead
           | of deep learning.
        
             | typon wrote:
             | It's a joke lighten up
        
               | cinntaile wrote:
               | Are you the author of the pdf or do you know the author
               | so that you know for sure? If not then it could be a
               | typo, I'm not sure why that's such a controversial
               | statement?
        
             | frooxie wrote:
             | It looks like dry humor to me.
        
       | dr_dshiv wrote:
       | My favorite theoretical description of multilayer networks comes
       | from the first multilayer network, the 1986 harmonium [1]. It
       | used a free energy minimization model (in the paper it is called
       | harmony maximization), which is both concise, natural and
       | effective. I find the paper very well written and insightful --
       | even today.
       | 
       | I haven't fully read the current paper, but it doesn't mention
       | "free energy"-- which seems odd given their emphasis on
       | thermodynamics and first principles.
       | 
       | [1]
       | https://www.researchgate.net/publication/239571798_Informati...
        
       | plaidfuji wrote:
       | This all makes sense, but at the same time feels a bit
       | paradoxical to me. We're developing a first-principles theory to
       | understand the mechanics of massive empirical models. Isn't that
       | kind of ironic?
       | 
       | If the problem we're solving is "faster iteration and design of
       | optimal deep learning models for any particular problem", I would
       | have thought the ML-style approach would be to solve that..
       | through massive empiricism. Develop a generalized dataset
       | simulator, a language capable of describing the design space of
       | all deep learning models, and build a mapping between dataset
       | characteristics <> optimal model architecture. Maybe that's the
       | goal and I haven't dug deep enough. Just feels funny that all of
       | our raw computing power has led us back to the need for more
       | concise theory. Maybe that's a fundamental law of some kind.
        
         | 6gvONxR4sf7o wrote:
         | Sufficient understanding will always trump guesswork when the
         | understanding exists (kinda tautologically). Hence the push to
         | find it.
        
         | trhway wrote:
         | We had a manager with Theoretical Physics education from a top
         | school who seriously suggested to ultimately solve QA by
         | building a system to run the program through all the possible
         | combinations of branches/etc.
        
         | oddity wrote:
         | I think this is to be expected. The oscillation between first-
         | principles and empirical models falls out of the scientific
         | method: see a few datapoints, develop a predictive theory, try
         | to prove the theory wrong with new datapoints, and reiterate
         | for alternative explanations with fewer assumptions, greater
         | predictive power, etc...
         | 
         | This happens even in pure mathematics, just at a more abstract
         | level: start with conjectures seen on finite examples, prove
         | some limited infinite cases, eventually prove or disprove the
         | conjecture entirely.
         | 
         | Current DL models are so huge they've outpaced the scale where
         | our existing first-principles tools (like linear algebra) can
         | efficiently predict the phenomena we see when we use them. The
         | space has gotten larger, but human brains haven't, so if we
         | still want humans to be productive, we need to develop a more
         | efficient theory. Empirical models explaining empirical models
         | might work, but not for humans.
        
       | [deleted]
        
       | mark_l_watson wrote:
       | Excellent article, and the little bit of the paper I read, so
       | far, stresses building intuition. Andrew Ng's classes stressed
       | building intuition also.
       | 
       | Even though I have largely earned my living doing deep learning
       | in the last six years, I believe that hybrid AI will get us to
       | AGI. Getting a first principles understanding of DL models is
       | complementary to building hybrid AI systems.
       | 
       | EDIT: I am happy to have a work in progress PDF of the
       | manuscript, but I wish authors would release PDF, ePub, and
       | Kindle formats also. I spread my reading across devices and with
       | a PDF I need to remember where I left off reading and navigate
       | there.
        
         | Vlados099 wrote:
         | Checkout calibre for converting to othe formats:
         | https://calibre-ebook.com/
        
         | thewarrior wrote:
         | What is hybrid AI ?
        
           | mark_l_watson wrote:
           | Combination of at least: DL and good old fashioned symbolic
           | AI.
        
       | alok-g wrote:
       | I just finished looking through the manuscript
       | [https://deeplearningtheory.com/PDLT.pdf]. Mathematics is heavy
       | for me especially for a quick read, albeit a great thing I see is
       | that the authors have reduced dependencies on external literature
       | by inlining the various derivations and proofs instead of just
       | providing references.
       | 
       | ## _The epilogue section (page 387 of the book, 395 in the PDF)
       | is giving a good overview, presented below per my own
       | understanding:_
       | 
       | Networks with a very large number of parameters, much larger than
       | the size of the training data, should as such overfit. The number
       | of parameters is conventionally taken as a measure of model
       | complexity. Having a very large network can allow it to perform
       | well on the training data by just memorizing it and perform
       | poorly on unseen data. Somehow these very large networks are
       | empirically performing well still in achieving generalization,
       | i.e., these are recognizing good patterns from the training data.
       | 
       | The authors show that model complexity (or ability to generalize
       | well I would say) for such large networks is dependent on its
       | depth-to-width ratio:
       | 
       | * When the network is much wider than deeper (the ratio
       | approaches zero), the neurons in the network don't have as many
       | "data-dependent couplings". My understanding from this is that
       | while the large width gives the network power in terms of number
       | of parameters, it has lessor opportunity for a correspondingly
       | large number of feature transformations. While the network can
       | still _fit_ the training data well [2, 3], it may not generalize
       | well. In the authors ' words, when the depth-to-width ratio is
       | close to zero (page 394), "such networks are not really deep"
       | (even if depth is much more than two) "and they do not learn
       | representations."
       | 
       | * On the opposite end, when the network is very deep (ratio going
       | closer to one or larger), {I'm rephrasing the authors from my
       | limited understanding} the network needs non-Gaussian description
       | of the model parameter space, which makes it "not tractable" and
       | not practically useful for machine learning.
       | 
       | While it makes intuitive sense that the network's capability to
       | find good patterns and representations depends on the depth-to-
       | width ratio, the authors have supplied the mathematical
       | underpinnings behind this as briefly summarized above. My
       | previous intuition was that having a larger number of layers
       | allows for more feature transformations, giving the network a
       | higher ease of learning. The new understanding via the authors'
       | work is that if for the same number of layers, the width is
       | increased, the network now has a _harder_ job to learn feature
       | transformations commensurate with now larger number of neurons.
       | 
       | ## _My own commentary and understanding (some from before looking
       | at the manuscript)_
       | 
       | If the size of the network is very small, the network won't be
       | able to _fit_ the training data well. A network with a larger
       | size would generally have more  'representation' power, allowing
       | it to know more complex patterns.
       | 
       | The ability to _fit_ the training data is of course however
       | different from ability to generalize to unseen data. Merely
       | adding more representation power can allow it to overfit. As the
       | network size starts exceeding the size of the training data, it
       | could have a tendency to just memorize the training data without
       | generalizing, unless something is done to prevent that.
       | 
       | So as the size of the network is increased with the intentions of
       | giving it more representation power, we need something more such
       | that the network first learns the most common patterns (highest
       | compression, but lossy) and then keeps on learning progressively
       | more intricate patterns (now lessor compression, more accurate).
       | 
       | My intuition so far was that achieving this was an aspect of the
       | training _algorithm_ and cell design innovations and also of the
       | depth-to-width ratio. The authors however show that this depends
       | on the depth-to-width ratio and in the way specified. It is still
       | counter-intuitive to me that algorithmic innovation may not play
       | a role in this, or perhaps I am misunderstanding the work.
       | 
       | So now the 'representation power' of the network and its ability
       | to _fit_ the training data itself would generally increase with
       | the size of the network. However, its ability to _learn_ good
       | representations and generalize depends on the depth-to-width
       | ratio. Loosely speaking then, to increase model accuracy on
       | training data itself, model size may need to be increased while
       | keeping the aspect ratio constant at least as far as the training
       | data size is larger, whereas to improve generalization and
       | finding good representations for a given model size, the aspect
       | ratio should be tuned.
       | 
       | Intuitively, I think that under a pathological case where the
       | network is so large that merely its width (as opposed to width
       | times depth) is exceeding the size of the training data, then
       | even if the depth-to-width ratio is chosen according to the
       | guidance from the authors (page 394 in the book) the model would
       | still fail to learn well.
       | 
       | Finally, I wonder what the implications of the work is for
       | networks with temporal or spatial weight-sharing like
       | convolutional networks, recurrent and recursive networks,
       | attention, transformers, etc. For example, for recurrent neural
       | networks, the effective depth of the network depends on how long
       | the input data sequence was. I.e., the depth-to-width ratio could
       | be varying simply because input length is varying. The learning
       | from the authors' work I think should directly apply if each time
       | step is treated as a training sample on its own, i.e.,
       | backpropagation through time is not considered. However, I wonder
       | if the authors' work still presents some challenges on how long
       | could the input sequences be as the non-Gaussian aspect may start
       | coming into the picture.
       | 
       | As time permits, I would read the manuscript in more detail. I'm
       | hopeful however that other people may achieve that faster and
       | help me understand better. :-)
       | 
       | ## _References:_
       | 
       | [1]
       | https://en.wikipedia.org/wiki/Universal_approximation_theore...
       | 
       | [2] http://neuralnetworksanddeeplearning.com/chap4.html
        
       ___________________________________________________________________
       (page generated 2021-06-19 23:01 UTC)