[HN Gopher] Mathematical Introduction to Deep Learning: Methods,...
       ___________________________________________________________________
        
       Mathematical Introduction to Deep Learning: Methods,
       Implementations, and Theory
        
       Author : Anon84
       Score  : 180 points
       Date   : 2024-01-01 18:46 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | dachworker wrote:
       | Is anyone using any of this math? My guess is no. At best it
       | provides "moral support" for deep learning researchers who want
       | to feel reassured that what they are attempting to do is not
       | impossible.
       | 
       | Glad to be proven wrong, though.
        
         | nerdponx wrote:
         | Describing it as "moral support" really sells it short.
         | 
         | Imagine computer science without sorting algorithms, search
         | algorithms, etc that have been proven correct and have known
         | proven properties. This math serves the same purpose as CS
         | theory.
         | 
         | So yes, if you're just fitting a model from a library like
         | Keras, you're not really "using" the math. If you're working
         | with data sets below a certain size, problems below a certain
         | level of complexity, and models that have been deployed for
         | many years and have well studied properties, you can do a lot
         | with only a cursory understanding of the math, much like you
         | can write perfectly functional web apps in Python or Java
         | without really understanding how the language runtime works at
         | a deep level.
         | 
         | But if you don't actually know how it works, you're going to
         | get stuck pretty badly if you encounter a situation that isn't
         | already baked into a library.
         | 
         | If you want to see what happens when you don't know the
         | underlying math, look at the current generation of "data
         | science" graduates, who don't know their math or statistics
         | fundamentals. There are plenty of issues on the hiring side of
         | course, but ultimately the reason those kids aren't getting
         | jobs is that they don't actually know what they're doing,
         | because they were never forced to learn this stuff.
        
         | nephanth wrote:
         | According to the abstract it covers different ANN
         | architectures, optimization algorithms, probably
         | backpropagation.. so um yes? That is stuff anyoke in machine
         | learning uses everyday?
        
         | danielmarkbruce wrote:
         | Some people like to think and communicate in dense math
         | notation. So, yes.
        
         | godelski wrote:
         | There's something I tell my students. You don't need math to
         | make good models, but you do need to know math to know why your
         | models are wrong.
         | 
         | So yes, math is needed. If you don't have math you're going to
         | hoodwink yourself into thinking you can get to AGI by scale
         | alone. You'll just use transformers everywhere because that's
         | what everyone else does and you'll get confused between
         | activation functions. You'll make models and models that work,
         | but there's a big difference in working models and knowing
         | where to expect your models to fail and understanding their
         | limitations.
         | 
         | I feel a lot of people just look at test set results and expect
         | that to mean that the model isn't overfitting. (not to mention
         | tuning hps based on test set results)
        
           | light_hue_1 wrote:
           | Oh sure. I say the same to my students.
           | 
           | But the particular spin on this book makes it look to non-
           | experts that this is the math you need to do something useful
           | with deep learning. And that's just not true.
           | 
           | Certainly you need to understand what you're optimizing, how
           | your optimizer works, what your objective function is doing,
           | etc. But the vast majority of people don't need to know about
           | theoretical approximation results for problems that they will
           | never actually encounter in real life, etc. For example, I
           | have never used used anything like "6.1.3 Lyapunov-type
           | stability for GD optimization" in a decade of ML research.
           | I'm sure people do! But not on the kinds of problems I work
           | on.
           | 
           | Just look at the comments here. People are complaining about
           | the lack of context, but this is fine for the audience the
           | book is aimed at. It's just the average HN reader.
           | 
           | I think it would be better if the authors chose a different
           | title. As it stands, non-experts will be attracted and then
           | be put off, and experts will think the book is likely to be
           | too generic.
        
             | godelski wrote:
             | Yeah I would have a very hard time recommending this book
             | too. It is absurdly math heavy. I'm not sure I've even seen
             | another book this math dense before and I've read some
             | pretty dense books targeting review. So I'm not even sure
             | what audience this book is aimed for. Citations? And I
             | fully agree that the title doesn't fit whoever that
             | audience is.
        
           | joe_the_user wrote:
           | _If you don 't have math you're going to hoodwink yourself
           | into thinking you can get to AGI by scale alone._
           | 
           | There are very smart people who think we can get to AGI by
           | scale alone - they call that the "the scaling hypothesis", in
           | fact. I think they're wrong but I thought they knew a fair
           | amount of math.
           | 
           | What math would you use to describe the limitations of deep
           | learning? My impression is there aren't any exact theorems
           | that describe either it's limits or it's
           | behavior/possibilities, there are just suggestive theorems
           | and constructions combined with heuristics.
        
         | fastneutron wrote:
         | In the latter part of the book that covers PINNs and other PDE
         | methods, it helps to frame these using the same kind of
         | functional analysis that is used to develop more traditional
         | numerical methods. In this case, it provides a way for
         | practitioners to verify the physical consistency between the
         | various methods.
        
       | reqo wrote:
       | Is it common to publish books directly to ArXiv, especially books
       | that have just been released?
        
         | godelski wrote:
         | It's not too uncommon to see books available online from an
         | official location. At least math and CS textbooks
        
           | nerdponx wrote:
           | Normally I just see it on the author's website.
        
       | godelski wrote:
       | First time I've seen one of these books where I wished there was
       | more words and less math. Usually it is quite the opposite. But
       | this book seems written as if they wanted to avoid natural
       | language at all costs.
        
       | axpy906 wrote:
       | This is in Tensorflow. Would rather see a numpy version or
       | something along those lines so that students can better
       | understand what each step looks like in code.
       | 
       | I concur on the comments noting lack of explanation for the
       | notation/lemmas/proof.
        
         | godelski wrote:
         | I second this. Numpy would be the way to go, so students can
         | switch to JAX or PyTorch trivially. Or they could use a mix,
         | starting with numpy, build the layer from scratch, then hand
         | over the abstraction. Pyro would be really good for this too
        
         | _giorgio_ wrote:
         | Tensorflow? LOL what is this, the year 2010?
        
         | CamperBob2 wrote:
         | Most of the examples I saw used Pytorch. (Which is still a step
         | or two removed from the actual machinery, of course.)
        
       | HybridCurve wrote:
       | As someone who has a deeper knowledge of programming rather than
       | math, I find the mathematical notation here to be harder to
       | understand than the code (even in a programming language I do not
       | know).
       | 
       | Does anyone with a stronger mathematical background here find it
       | easier to understand the math as written more easily than the
       | source code?
        
         | joshuanapoli wrote:
         | Mathematical notation usually has a problem with preferring
         | single-letter names. We usually prefer to avoid highly
         | abbreviated identifier names in software, because they make the
         | program harder to read. But they're common in Math, and I think
         | that it makes for a lot of work jumping back and forth to
         | remind oneself what each symbol means when trying to make sense
         | of a statement.
        
         | aabajian wrote:
         | All three authors are PhDs or PhD-candidates in mathematics.
         | The notation is extremely dense. I'm curious who their target
         | audience of "students and scientists" are for this book.
        
           | angra_mainyu wrote:
           | I had a bunch of classes in undergrad (physics) that had
           | basically the same notation and style.
        
         | layer8 wrote:
         | Mathematical notation is more concise, which may take some
         | getting used to. One reason is that it is optimized for
         | handwriting. Handwriting program code would be very tedious, so
         | you can see why mathematical notation is the way it is.
         | 
         | Apart from that, there is no "the code" equivalent.
         | Mathematical notation is for stating mathematical facts or
         | propositions. That's different from the purpose of the code you
         | would write to implement deep-learning algorithms.
        
         | conformist wrote:
         | Yes, it's easier for mathematicians, because a lot of
         | background knowledge and intuition is encoded in mathematical
         | conventions (eg "C(R)" for continuous functions on the reals
         | etc...). Note that this is probably a book for mathematicians.
        
         | strangedejavu2 wrote:
         | It's not too difficult to understand, but this introduction
         | isn't written with pedagogy in mind IMO
        
         | andrepd wrote:
         | Obligatory hn comment on any math-related topic: "notation bad"
         | 
         | Please be more original.
        
         | outrun86 wrote:
         | I'm just wrapping up a PhD in ML. The notation here is
         | unnecessarily complex IMO. Notation can make things easier, or
         | it can make things more difficult, depending on a number of
         | factors.
        
           | angra_mainyu wrote:
           | Really? Coming from physics (B.Sc only) the notation is
           | refreshingly familiar and straightforward. My topology and
           | analysis classes were basically like this.
           | 
           | In fact, this pdf is literally the resource I've been
           | searching for as many others are far too ambiguous and
           | handwavey focusing more on libraries and APIs than what's
           | going on behind the scenes.
           | 
           | If only there were a similar one for microeconomics and
           | macroeconomics, I'd have my curiosity satiated.
        
             | youainti wrote:
             | As a PhD econ student, the mathematics just comes down
             | solving constrained optimization problems. Figuring out
             | what to consider as an optimand and the associated
             | constraints is the real kicker.
        
               | tnecniv wrote:
               | It depends on what you're doing. That is accurate for,
               | say, describing the training of a neural network, but if
               | you want to prove something about generalization, for
               | example (which the book at least touches on from my
               | skimming), you'll need other techniques as well
        
         | ceh123 wrote:
         | As someone that's in the later stages of a PhD in math, given
         | the title starts with "Mathematical Introduction...", the
         | notation feels pretty reasonable for someone with a background
         | in math.
         | 
         | Sure I might want some slight changes to the notation I found
         | skimming through on my phone, but everything they define and
         | the notation they choose feels pretty familiar and I understand
         | why they did what they did.
         | 
         | Mirroring what someone else said, this is exactly the kind of
         | intro I've been looking for for deep learning.
        
         | WhitneyLand wrote:
         | Use ChatGpt.
         | 
         | Screenshot the math, crop it down to the equation, paste into
         | the chat window.
         | 
         | It can explain everything about it, what each symbol means, and
         | how it applies to the subject.
         | 
         | It's an amazing accelerator for learning math. There's no more
         | getting stuck.
         | 
         | I think it's underrated because people hear "LLM's aren't good
         | at math". They are not good at certain kinds of problem solving
         | (yet), but GPT4 is a fantastic conversational tutor.
        
         | tnecniv wrote:
         | So this is a book written by applied mathematicians for applied
         | mathematics (they state in the preface it's for scientists, but
         | some theoretical scientists and engineers are essentially
         | applied mathematics). As a result, both the topics and the
         | presentation are biased towards those types of people. For
         | example, I've never seen in practice worry about the existence
         | and uniqueness conditions for their gradient-based optimization
         | algorithm in deep learning. However, that's the kind of result
         | those people do care about and academic papers are written on
         | the topic. The title does say that this is a book on the
         | theoretical underpinnings of the subject, so I am not surprised
         | that it is written this way. People also don't necessarily read
         | these books cover-to-cover, but drill into the few chapters
         | that use techniques relevant to what they themselves are
         | researching. There was a similarly verbose monograph I used to
         | use in my research, but only about 20-30 pages had the meat I
         | was interested in.
         | 
         | This kind of book is more verbose than my liking both in terms
         | of rigor and content. For example, they include Gronwall's
         | inequality as a lemma and prove it. The version that they use
         | is a bit more general than the one I normally see, but
         | Gronwall's inequality is a very standard tool in analyzing ODEs
         | and I have rigorous control theory books that state it without
         | proof to avoid clutter (they do provide a reference to a
         | proof). A lot of this verbosity comes about when your standard
         | of proof is high and the assumptions you make are small.
        
         | spi wrote:
         | Sharing my experience here. My background is in math (Ph.D. and
         | a couple of postdoc years) before switching to practitioner in
         | deep learning. This year I taught a class at university (as
         | invited prof) in deep learning for students doing a masters in
         | math and statistics (but with some programming knowledge, too).
         | 
         | I tried to present concepts in an as reasonably accurate
         | mathematical way as possible, and in the end I cut through a
         | lot of math in part to avoid the heavy notation which seems to
         | be present in this book (and in part to make sure students
         | could spend what they learnt in the industry). My actual
         | classes had way more code than formulas.
         | 
         | If you want to write everything very accurately, things get
         | messy, quickly. Finding a good notation for new concepts in
         | math is very hard, something that gets sometimes done by bright
         | minds only, even though afterwards everybody recognizes it was
         | "clear" (think about Einstein notation, Feynman diagrams, etc.,
         | or even just matrix notation, which Gauss was unaware of). If
         | you just take domain A and write in notations from domain B,
         | it's hard to get something useful (translating quantum
         | mechanics to math with C* algebras and co. was a big endeavour,
         | still an open research field to some extent).
         | 
         | So I'll disagree with some of the comments below and claim that
         | the effort of writing down this book was huge but probably
         | scarcely useful. Who can read comfortably these equations
         | probably won't need them (if you know what an affine
         | transformation is, you hardly need to see all its ijkl indices
         | written down explicitly for a 4-dimensional tensor), and the
         | others will just be scared off. There might be a middle ground
         | where it helps some, but at least I haven't encountered such
         | people...
        
       | HighFreqAsuka wrote:
       | I've seen quite a few of these books attempting to explain deep
       | learning from a mathematical perspective and it always surprises
       | me. Deep learning is clearly an empirical science for the time
       | being, and very little theoretical work that has been so
       | impactful that I would think to include it in a book. Of the such
       | books I've seen, this one seems like actively the worst one. A
       | significant amount of space is dedicated to proving lemmas that
       | provide no additional understanding and are only loosely related
       | to deep learning. And a significant chunk of the code I see is
       | just the plotting code, which I don't even understand why you'd
       | include. I'm confident that very few people will ever read
       | significant chunks of this.
       | 
       | I think the best textbooks are still Deep Learning by Goodfellow
       | etal and the more modern Understanding Deep Learning
       | (https://udlbook.github.io/udlbook/).
        
         | blauditore wrote:
         | I think the mathematical background starts making sense once
         | you get a good understanding of the topic, and then people make
         | the wrong assumption that understanding the math will help
         | learning the overall topic, but it that's usually pretty hard.
         | 
         | Rather than trying to form an ituition based on the theory,
         | it's often easier to understand the technicalities after
         | getting an intuition. This is generally true in exact sciences,
         | especially mathematics. That's why examples are helpful.
        
         | danielmarkbruce wrote:
         | UDL has some dense math notation in it.
         | 
         | Math isn't just about proofs. It's a way to communicate. There
         | are several different ways to communicate how a neural net
         | functions. One is with pictures. One is with some code. One is
         | with words. One is with some quite dense math notation.
        
           | HighFreqAsuka wrote:
           | I agree with that, I think UDL uses the necessary amount of
           | math to communicate the ideas correctly. That is obviously a
           | good thing. What it does not do is pretend to be presenting a
           | mathematical theory of deep learning. Basically UDL is
           | exactly how I think current textbooks should be presented.
        
           | n3ur0n wrote:
           | I would say UDL should be very accessible to any undergrad
           | from a strong program.
           | 
           | I would not call the notation 'dense' rather it's 'abused'
           | notation. Once you have seen the abused notation enough
           | times, it makes just makes sense. Aka "mathematical maturity"
           | in the ML space.
           | 
           | My views on this have changed as a first year PhD in ML I got
           | annoyed by the shorthand. Now as someone with a PhD, I get it
           | -- It's just too cumbersome to write out what exactly you
           | mean and you write like you're writing for peers +\\- a
           | level.
        
         | thehappyfellow wrote:
         | This book is not aimed at practitioners but I don't think that
         | means it deserves to be called ,,actively the worst one".
         | 
         | Even though the frontier of deep learning is very much
         | empirical, there's interesting work trying to understand why
         | the techniques work, not only which ones do.
         | 
         | I'm sorry but saying proofs are not a good method for gaining
         | understanding is ridiculous. Of course it's not great for
         | everyone but a book titled ,,Mathematical Introduction to x" is
         | obviously for people with some mathematical training. For that
         | kind of audience lemmas and their proof are natural way of
         | building understanding.
        
           | HighFreqAsuka wrote:
           | Just read the section on ResNets (Section 1.5) and tell me if
           | you think that's the best way to explain ResNets to literally
           | anyone. Tell me if, from that description, you take away that
           | the reason skip connections improve performance is that they
           | improve gradient flow in very deep networks.
        
             | p1esk wrote:
             | _the reason skip connections improve performance is that
             | they improve gradient flow in very deep networks._
             | 
             | Can you prove this statement?
        
               | HighFreqAsuka wrote:
               | Empirically yes, I can consider a very deep fully-
               | connected network, measure the gradients in each layer
               | with and without skip connections, and compare. I can do
               | this across multiple seeds and run a statistical test on
               | the deltas.
        
       | ottaborra wrote:
       | This makes me wonder. Is deep learning as a field an empirical
       | science purely because everyone is afraid of the math? It has the
       | richness of modern day physics but for some reason most the
       | practioners seem to want to keep thinking of it as the wild west
        
         | HighFreqAsuka wrote:
         | No, there are many very mathematically inclined deep learning
         | researchers. It's an empirical science because the mathematical
         | tools we possess are not sufficient to describe the phenomena
         | we observe and make predictions under one unified theory. Being
         | an empirical science does not mean that the field is a "wild
         | west". Deep learning models are subjectable to repeatable
         | controlled experiments, from which you can improve your
         | understanding of what will happen in most cases. Good
         | practitioners know this.
        
           | ottaborra wrote:
           | The main point you're making is fair
           | 
           | The only gripe I have is > Being an empirical science does
           | not mean that the field is a "wild west"
           | 
           | I think what you meant to say is: "Being an empirical science
           | does not <b>necessarily</b> mean that the field is a \"wild
           | west\""
           | 
           | you clearly haven't seen the social sciences
           | 
           | > Good practitioners know this
           | 
           | sure?
           | 
           | Edit: Removed unnecessary portions that wouldn't have
           | continued the conversation in any meaningful way
        
           | trhway wrote:
           | >It's an empirical science because the mathematical tools we
           | possess are not sufficient to describe the phenomena we
           | observe and make predictions under one unified theory.
           | 
           | To me the deep learning is actually itself a tool (which has
           | well established, and simple at that, math underneath -
           | gradient based optimization, vector space representation and
           | compression) to make a good progress toward mathematical
           | foundations of the empirical science of cognition.
           | 
           | In the 90-ies there were works showing that for example
           | Gabors in the first layer of the biological visual cortex are
           | optimal for the feature based image recognition that we have.
           | And as it happens in the visual NNs the convolution kernels
           | in the first layers also converge to the Gabor-like. I see
           | [signs of] similar convergence in the other layers (and all
           | those semantically meaningful vector operations in the
           | embedding space in LLMs are also very telling). Proving
           | optimality or similar is much harder there, yet to me those
           | "repeatable controlled experiments" (i.e. stable convergence)
           | provide strong indication that it will be the case (as
           | something does drive that convergence, and when there is such
           | a drive in dynamic systems, you naturally end asymptotically
           | up ("attracted") near something either fixed or periodic),
           | and that would be a (or even "the") math foundation for
           | understanding of cognition (dis-convergence from the real
           | biological cognition, ie. emergence of completely different,
           | yet comparable, type of cognition would also be great, if not
           | even the much greater result) .
        
         | tnecniv wrote:
         | A little bit of A and B. You can do a lot with very little math
         | beyond linear algebra, calculus, and undergraduate probability,
         | and that knowledge is mainly there to provide intuition and
         | formalize the problem that you're solving a bit. You also churn
         | out results (including very impressive ones) without doing any
         | math.
         | 
         | A result of the above is that people are empirically
         | demonstrating new problems and solving them very quickly --
         | much more quickly than people can come up with theoretical
         | results explaining why they work. The theory is harder to come
         | by for a few reasons, but many of the successful examples of
         | deep learning don't fit nicely into older frameworks from,
         | e.g., statistics and optimal control, to explain them well.
        
       | runsWphotons wrote:
       | I like this book and everyone complaining about the math and math
       | notation is a silly goose.
        
       ___________________________________________________________________
       (page generated 2024-01-01 23:00 UTC)