[HN Gopher] The Modern Mathematics of Deep Learning
       ___________________________________________________________________
        
       The Modern Mathematics of Deep Learning
        
       Author : tims457
       Score  : 150 points
       Date   : 2021-06-12 16:37 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | joe_the_user wrote:
       | It seems like this can leave the reader with the wrong
       | impression. Calculus really is "the mathematics of Newtonian
       | physics". This is just "some mathematics that might help a bit in
       | your intuitions of deep learning".
       | 
       | IE, Deep learning is fundamentally just about getting the
       | mathematically simple but complex and multi-layerd "neural
       | networks" to do stuff. Training them, testing them and deploying
       | them. There are many intuitions about these things but there's no
       | complete theory - some intuitions involve mathematical analogies
       | and simplifications while other involve "folk knowledge" or large
       | scale experiments. And that's not saying folks giving math about
       | deep learning aren't proving real things. It's just they
       | characterizing the whole or even a substantial part of such
       | systems.
       | 
       | It's not surprising that a complex like a many-layered Relu
       | network can't fully characterized or solved mathematically. You'd
       | expect that of any arbitrarily complex algorithmic construct.
       | Differential equations of many variables and arbitrary functions
       | also can't have their solutions fully characterized.
        
         | fogof wrote:
         | As a PhD student who sort of burned out on this type of
         | research, I agree that the complexity of Neural Networks as a
         | mathematical construct makes them very difficult to analyze.
         | This might also have to do with Deep learning theory being a
         | subset of learning theory which is subject to "No Free Lunch"
         | [1], which means that you always have to be very careful not to
         | try to prove something that turns out to be impossible.
         | 
         | That being said, research on the Kernel regime is one of the
         | very cool ideas, in my opinion, to gain traction in this field
         | in the past few years. To summarize: "If you make a neural
         | network wide enough, it gains the power to control its output
         | on each individual input separately, and will begin to fit its
         | training data perfectly". Of course, the real pleasure is in
         | understanding all the mathematical details of this statement!
         | 
         | [1] : https://en.wikipedia.org/wiki/No_free_lunch_theorem
        
           | joe_the_user wrote:
           | I got my master's years ago so now I'm a strict amateur. That
           | said, I don't think the "No free lunch theorem" is very
           | "interesting". It's nearly tautological that no approximation
           | method works for "any" function. The set of
           | predictable/interesting/useful/"real-world" functions is
           | going to have measure 0 compared to white noise so "any
           | function" will basically look like white noise and can't be
           | predicted. Approximating functions/sequences with vanishingly
           | low Kolmogorov complexity is more interesting, impossible in
           | general by Godel's theorem but what's the case "on average"?
           | (depends on the choice process and so ill-defined but
           | defining might be interesting). The kernel regime stuff looks
           | interesting but I don't know it's relation to wide networks.
           | 
           | Neural networks "tend to generalize well in the real world".
           | That's a pretty fuzzy statement imo since "real world" is
           | hardly defined but it's still what people experience and it's
           | more useful to provide a more precise model where this works
           | rather than a model where this doesn't work.
           | 
           | Also, there's good theory on deep networks as universal well
           | as theories of wide/shallow networks [1].
           | 
           | [1]: https://arxiv.org/abs/1901.02220
        
             | roenxi wrote:
             | > Neural networks "tend to generalize well in the real
             | world".
             | 
             | I've always interpreted that as "we've found an algorithm
             | that could, given a foreseeable amount of computing power
             | and maybe some tweaks, simulate human decision making".
             | 
             | It isn't so much that neural networks can approximate the
             | real world as they can approximate human perception of the
             | real world.
        
         | jhrmnn wrote:
         | There are a few works that try to put deep learning on some
         | theoretical basis, I like this one, for example:
         | 
         | https://arxiv.org/abs/1703.00810
         | 
         | This goes beyond mere intuition, but it is also still very far
         | from a "complete theory".
         | 
         | I find it disappointing that so few people in deep learning
         | work on the theoretical foundations.
        
           | quibono wrote:
           | What are some subfields of mathematics that you would say are
           | crucial for gaining a proper understanding of all the things
           | related to deep learning (e.g. let's say the paper you
           | linked)? Even though the theory isn't complete, I'm sure a
           | grounding in certain fields of mathematics will be helpful.
        
             | iNic wrote:
             | This is always difficult to answer, and it will probably be
             | a mixture of many, however I am currently following
             | categorical approaches to machine learning. Category Theory
             | is the area of mathematics that studies composable
             | structures, i.e. like layers in a deep network. It is very
             | abstract and was invented to solve problem in algebraic
             | geometry, but has been fruitful in other areas as well.
        
               | convexity123 wrote:
               | Could you give some favourite references, some use of
               | category theory in ML which gives good results compared
               | to standard approaches?
               | 
               | Is there a group doing this in Zurich?
        
             | 317070 wrote:
             | Dynamical systems and chaos theory (especially for neural
             | networks), information theory (especially for the paper
             | linked), probability theory (especially the more
             | foundational and axiomatic work)
        
           | 0-_-0 wrote:
           | Of the many "understanding neural networks" papers this is
           | one of the few valuable ones.
        
           | keithalewis wrote:
           | Agreed. Until we get to the point where there are theorems of
           | the form, for example, "Given a problem satisfying conditions
           | X, the optimal number of layers to minimize expected training
           | time for data satisfying Y is Z", it is just stamp
           | collecting.
        
         | conformist wrote:
         | It seems like it aims at giving somebody who would like to get
         | started doing theoretical research in the field some pointers
         | and basic insights. I don't think it does a particularly bad
         | job at this, in particular given that it will be a book
         | chapter? The target audience are probably people who have had
         | some exposure to Functional Analysis and the likes before.
        
       | rohittidke wrote:
       | I believe that the curse of dimensionility doesn't apply here as
       | we are optimizing the "universal apppriximator" of the "surface"
       | of the possible real world function.
        
         | antipaul wrote:
         | Does "possible" in your statement refer to the inherent
         | constraints of the architecture as specified by the researcher,
         | or something else?
        
       | amelius wrote:
       | What are the prerequisites?
        
         | fspeech wrote:
         | Mostly analysis. If you understand section 1 notations, you are
         | obviously set. But even if you don't you should still be able
         | to get the ideas with a bit of mental translation. In a word
         | the notation seemed unnecessarily heavy for the level of
         | discussion.
        
           | 0-_-0 wrote:
           | Deep learning papers often use math in a way that obscures
           | rather than enlightens. And when you finally understand what
           | they are saying, you realize it's not interesting at all, or
           | they made a mistake in the math.
        
         | thanksok wrote:
         | Looks like a little bit of everything except the likes of
         | abstract algebra, logic, category theory.
         | 
         | These include linear algebra, graph theory, probability,
         | algorithms, mathematical analysis, topology, differential
         | geometry. But the most important prereqs are math maturity and
         | mental toughness/endurance.
        
           | SilurianWenlock wrote:
           | mental toughness/endurance haha!
        
         | keithalewis wrote:
         | Mind reading. They use terminology without defining it or
         | giving a reference.
        
           | beforeolives wrote:
           | Seriously, I'm struggling to understand things that I already
           | know.
        
           | sundarurfriend wrote:
           | Any examples? I haven't yet come across something like that
           | yet, but I'm only a short way into the article.
        
             | keithalewis wrote:
             | The terms "measurable" and "tempered" for starters.
        
               | ganzuul wrote:
               | For the latter maybe this?
               | https://en.wikipedia.org/wiki/Parallel_tempering
        
         | cpp_frog wrote:
         | While I can't give the _exact_ prerequisites, I know that all
         | of the things that appear in the paper relate to:
         | 
         | (1) Linear Algebra
         | 
         | (2) Optimization Theory (Convex Analysis, non-convex
         | optimization) [0], [2]
         | 
         | (3) Probability Theory and Statistics (Measure Theory,
         | Multivariate Statistics) [1], [3], [4], [5]
         | 
         | (4) Analysis, to a lesser extent. (2) and (3) are the most
         | important.
         | 
         | I would give more references, but my background is too
         | theoretical (and my field is Numerical Analysis of PDE). From
         | the classes I took in college, three or four on each of (1-4),
         | a person with a similar background can recognize the tools
         | without much digging. Maybe some folks here can provide some
         | insights into books that center on applications. So I'm trying
         | not to diverge into too much theory (i.e. for measures, [4]
         | instead of Folland). There also seems to be good use of
         | Analysis techniques in the paper, see theorem 2.1.
         | 
         | I love that the paper references the Moore-Penrose pseudo-
         | inverse, an object of study in both statistics and optimization
         | for which I had to give a lecture for a course.
         | 
         | [0] https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
         | _Convex Optimization_ , Boyd and Vandenberghe
         | 
         | [1] _An Introduction to Multivariate Statistical Analysis_ ,
         | Anderson
         | 
         | [2] _Convex Analysis and Monotone Operator Theory in Hilbert
         | Spaces_ , Bauschke-Combettes
         | 
         | [3] _Theory of Multivariate Statistics_ , Bilodeau-Brenner
         | 
         | [4] _The Elements of Integration and Lebesgue Measure_ , Bartle
         | 
         | [5] _Probability: Theory and Examples_ , Durrett
        
         | godelski wrote:
         | I skimmed it. Looks like just some basic calc and linear
         | algebra. Nothing that crazy.
        
       | pcbro141 wrote:
       | Tangent, but has anyone taken Fast.ai or similar courses and
       | transitioned into the Deep Learning/ML field without a MS/PhD? To
       | be honest, I don't even know what 'doing ML/DL' looks like in
       | practice, but I'm just curious if a lot of folks get in to the
       | field without graduate degrees.
        
         | mustafa_pasi wrote:
         | You can learn all you need to know in 2 to 3 university level
         | courses. So we are talking less than a year of university
         | courses.
         | 
         | Fast.ai is too high level. I don't like it. You would be better
         | served taking actual university courses. A few days ago people
         | linked to LeCun's university class[1]. This is a solid
         | introduction. Does not cover everything but that is OK. Seems
         | like it is missing Bayesian approaches. Then if you want to
         | specialize in vision or speech or robotics or whatever, you
         | take special classes on that topic and learn all the SOTA
         | techniques. Then you are ready to do research already, or apply
         | your knowledge to build stuff. Of course you still have to
         | learn how to do real machine learning, which involves all the
         | data manipulation stuff, but that is learned by doing.
         | 
         | [1] https://cds.nyu.edu/deep-learning/
        
         | akgoel wrote:
         | I am in a Fintech boot camp, and it's clear that doing ML/DL
         | requires very little math, as the math is all abstracted away.
        
           | catillac wrote:
           | The problem with this view is that once one gets stuck, which
           | is very quick when one is doing the work for real, one
           | doesn't have any tools to debug anything except at the most
           | basic level and most probably doesn't understand anything
           | intuitively enough to even reason about what the underlying
           | problem could be.
           | 
           | I don't do this work myself, but we've hired many interns
           | from bootcamps to do ML, and ones from college with ML
           | projects. The bootcamp grads with no additional background
           | have almost universally hit hard walls once anything gets
           | more complex than using Keras to glue together layers. It's
           | given me the impression, anecdotally, that bootcamps are
           | largely predatory to take ones money and provide only a
           | veneer of knowledge in the area. This doesn't seem to apply
           | to people with a CS or math background that took an ML
           | bootcamp to add that dimension to their already-mathematical
           | skillset. But people who have, again only anecdotally in my
           | experience with an n of perhaps only 20, taken a bootcamp to
           | reskill from a totally unrelated and perhaps qualitative
           | field have not had success with a bootcamp alone, but have
           | had success in doing what the above poster recommended in
           | taking university courses in the area.
           | 
           | Very respectfully, if you're in a boot camp right now, you're
           | unlikely deep enough into the day to day work of ML to make
           | the assertion you're making.
        
           | maxwells-daemon wrote:
           | I think it depends! If you want to zoom out and take the
           | "systems view" using standard components, then you probably
           | don't need much math. If you want to develop new
           | architectures or algorithms, then you definitely will. The
           | well-trodden paths of ML might have most of their math
           | abstracted away, but in my experience every time you get
           | close to the frontiers, people are using math to understand
           | what's going on or develop new approaches.
        
             | hogFeast wrote:
             | It also doesn't really work if you have to tackle a new
             | problem.
             | 
             | I stopped studying maths well before university. I am not
             | some kind of math super genius. But working on my own
             | stuff, which did involve new problems, I was up the creek
             | fairly quickly without a solid mathematical understanding
             | of the techniques I was trying to use.
             | 
             | I don't think the bar is particularly high here. Solid
             | understanding of stats, ESL...but I have seen people
             | shotgunning models (I did this years ago too), and that
             | isn't going to work very long.
             | 
             | Also, I don't really understand why you wouldn't study some
             | of this stuff. Maths as taught in schools treats you like a
             | meat calculator...that isn't fun. But if you are interested
             | in ML, going through Stats, Linear Algebra...it is pretty
             | interesting because there are so many clear connections
             | with your work.
        
         | maxwells-daemon wrote:
         | Not Fast.ai, but I self-studied ML during undergrad (mostly
         | from books) and am currently working as an ML research
         | scientist.
         | 
         | That being said, I'm also thinking about starting an ML PhD
         | because it does honestly open more doors to top research
         | groups.
        
         | tmabraham wrote:
         | I took the fast.ai course and now I am doing a Ph.D. in
         | Biomedical Engineering focused on applying deep learning to
         | microscopy.
         | 
         | I don't think fast.ai is enough if you want to do theoretical
         | research in deep learning, but it certainly provides enough to
         | work on practical problems with deep learning. That said, many
         | of us in the fastai community are able to delve deep into,
         | understand, and implement recent deep learning papers and even
         | develop novel techniques. So I think with a little extra
         | studying, one could go easily transition to core deep learning
         | research.
        
         | TrackerFF wrote:
         | One example I can come up with now - image classification /
         | segmentation / regression problems.
         | 
         | Unfortunately, not all data is available or provided in a data
         | "friendly" format - sometimes all you get are image files, and
         | similar. Maybe you want to read some value off these images,
         | count objects, or whatever - which traditionally has been done
         | by trained/skilled workers.
         | 
         | With CNNs, it _can_ be a trivial task implement models for
         | solving the above problems. That's time and money saved for a
         | business.
        
       ___________________________________________________________________
       (page generated 2021-06-12 23:00 UTC)