[HN Gopher] Learning Theory from First Principles [pdf]
       ___________________________________________________________________
        
       Learning Theory from First Principles [pdf]
        
       Author : magnio
       Score  : 109 points
       Date   : 2024-03-02 18:10 UTC (4 hours ago)
        
 (HTM) web link (www.di.ens.fr)
 (TXT) w3m dump (www.di.ens.fr)
        
       | scythmic_waves wrote:
       | Interesting! I'll have to look over it when I have more time.
       | 
       | From a quick glance, it looks like it covers much of the same
       | material as this text [1]. I wonder how they compare.
       | 
       | [1]: https://www.cambridge.org/core/books/understanding-
       | machine-l...
        
         | throwaway81523 wrote:
         | A 2014 book on machine learning sounds quaint and historical.
        
       | p1esk wrote:
       | I can't wait until I tell GPT-5 "I have this idea I want to try,
       | read this book and tell me there's anything relevant there to
       | make it work better".
        
         | ralphist wrote:
         | I've never heard of an LLM making up a new idea. Shouldn't this
         | only work if your thing has already been tried before?
        
           | p1esk wrote:
           | In my example I already have an idea, but I'm not sure how
           | good or novel it is, or perhaps I tried it and it didn't work
           | but I don't understand why.
        
           | westoncb wrote:
           | How people define "new" varies a lot in this context. I've
           | spent a lot of time talking with ChatGPT exploring
           | interdisciplinary ideas and while I think it frequently says
           | things that qualify as new, I've only run into one situation
           | where I was trying to find some way of doing something
           | technical and it just invented something non-trivially
           | original to handle it:
           | https://x.com/Westoncb/status/1763733064478335326?s=20
        
         | hef19898 wrote:
         | Ypur idea would become even better so if _you_ read the book
         | _yourself_... You might even learn something new along the way.
        
           | dkarras wrote:
           | we are optimizing for time here, not learning. I'm gonna die
           | anyways and anything I learn will be dust. If I just need the
           | info to make something work, spending months (of which I have
           | limited number of) to see if something has something useful
           | in it for me vs. spending and afternoon to probe it to get
           | most of the benefits is a no-brainer.
        
             | makapuf wrote:
             | Maybe, but dont expect otherwise from a book explicitly
             | named "Learning Theory from First principles", not "learn
             | large language models in 21 days".
        
         | erbdex wrote:
         | I am realising that passing context for $this is the tricky
         | part as-
         | 
         | 1. It is very difficult for me to tell you about my context as
         | a user within low dimension variables.
         | 
         | 2. I do not understand my situation in the universe to be able
         | to tell AI.
         | 
         | 3. I dont have a vocabulary with AI. Internet i feel aced this
         | with shared HTTP protocol to consistently share agreed upon
         | state. For ex within Uber I am a very narrow request response
         | universe with.. POST phone, car, gps(a,b,c,d), now, payment.
         | 
         | But as a student wanting to learn algorithms how do I pass that
         | I'm $age $internet-type from $place and prefer graphical
         | explanations of algorithms, have tried but gotten scared of
         | that thick book and these $milestones-cs50, know $python upto
         | $proficiency(which again is a fractal variable with research
         | papers on how to define for learning).
         | 
         | Similarly how do I help you understand what stage my startup
         | idea is beyond low traction, but want to know have
         | $networks/(VC, devs, sales) APIs, have $these successful
         | partnerships with such evidence $attendance, $sales. Who should
         | I speak to? Could you pls write the needful in mails and engage
         | in partnership with other bots under $budget.
         | 
         | Even in the real world this vocabulary is in smaller pockets as
         | our contexts are too different.
         | 
         | 4. Learning assumes knowledge exists as a global forever
         | variable in a wider than we understand universe. $meteor being
         | a non maskable interrupt to the power supply at unicorn
         | temperatures in a decade. Similarly one time trends in
         | disposable $companies that $ecosystem uses to learn. I'm in a
         | desert village with with absent electricity might mean those
         | machines never reach me and perhaps most people don't have a
         | basic phone in the world to be able to share state. Their local
         | power mafia politics and absent governance might mean the pdf
         | AI recommends i read might or might not help.
         | 
         | I don't know how this will evolve but to think of the
         | possibilities has been so interesting. It's like computers can
         | talk to us easily and they're such smart babies on day 1 and
         | "folks we aren't able to put right, enough, cheap data in" is
         | perhaps the real bottleneck to how much usefulness we are being
         | able to uncover.
        
       | canjobear wrote:
       | Have they figured out what causes double descent yet?
        
         | a_wild_dandan wrote:
         | No. We don't know. My favorite hypothesis: SGD is...well,
         | stochastic. Meaning you're not optimizing w.r.t the training
         | corpus, but a tiny subset, so your gradient isn't _quite_
         | right. Over-training allows you to bulldoze over local optima
         | and recurse toward the true distribution rather than drive
         | around a local over-fitting basin.
        
           | canjobear wrote:
           | You can get it with full gradient descent though...
           | https://www.nature.com/articles/s41467-020-14663-9
           | 
           | Honestly the fact that there doesn't seem to be a good
           | explanation for this makes me think that we just
           | fundamentally don't understand learning.
        
         | arolihas wrote:
         | There was actually a very recent blog post claiming that
         | statistical mechanics can explain double descent
         | https://calculatedcontent.com/2024/03/01/describing-double-d...
         | 
         | Some more detail here:
         | https://calculatedcontent.com/2019/12/03/towards-a-new-theor...
        
         | iaseiadit wrote:
         | Not an expert, but this paper explores double descent with
         | simple models. The interpretation there: when you extend into
         | the overparameterized regime, that permits optimization towards
         | small-norm weights, which generalize well again. Does that
         | explain DD generally? Does it apply to other models (e.g.
         | DNNs)?
         | 
         | https://arxiv.org/pdf/2303.14151.pdf
        
         | tel wrote:
         | I don't know if it's a generalized result, but the Circuits
         | team at Anthropic has a very compelling thesis: the first phase
         | of descent corresponds to the model memorizing data points, the
         | second phase corresponds to it shifting geometrically toward
         | learning "features".
         | 
         | Here a "feature" might be seen as an abstract, very, very high
         | dimensional vector space. The team is pretty deep in
         | investigating the idea of superposition, where individual
         | neurons encode for multiple concepts. They experiment with a
         | toy model and toy data set where the latent features are
         | represented explicitly and then compressed into a small set of
         | data dimensions. This forces superposition. Then they show how
         | that superposition looks under varying sizes of training data.
         | 
         | It's obviously a toy model, but it's a compelling idea. At
         | least for any model which might suffer from superposition.
         | 
         | https://transformer-circuits.pub/2023/toy-double-descent/ind...
        
       | da39a3ee wrote:
       | There are so many great mathematical PDFs available for free on
       | the Internet, written by academics/educators/engineers. A problem
       | is that there is a huge amount of overlap. I wonder if an AI
       | model could be developed that would do a really good job of
       | synthesizing an overlapping collection into a coherent single PDF
       | without duplication.
        
         | hef19898 wrote:
         | One could also just pick the books used in the corresponding
         | university courses.
        
       | sampo wrote:
       | > 2.5 No free lunch theorem > > Although it may be tempting to
       | define the optimal learning algorithm that works optimally for
       | all distributions, this is impossible. In other words, learning
       | is only possible with assumptions.
       | 
       | A mention of no free lunch theorem should come with a disclaimer
       | that the theorem is not relevant in practice. An assumption that
       | your data originates from the real world, is sufficient that the
       | no free lunch theorem is not a hindrance.
       | 
       | This book doesn't discuss this at all. Maybe mention that "all
       | distributions" means a generalization to higher dimensional
       | spaces of discontinuous functions (including the tiny subset of
       | continuous functions) of something similar to all possible bit
       | sequences generated by tossing a coin. So basically if you data
       | is generated from an even random distribution of "all
       | possibilities", you cannot learn to predict the outcome of the
       | next coin tosses, or similar.
        
         | nextos wrote:
         | Yes, most free lunch theorems and results of this kind, which
         | make overly general assumptions, tend to be too pessimistic.
         | 
         | For example, many people naively think that static program
         | analysis is unfeasible due to the halting problem, Rice's
         | theorem, etc.
        
       | throwaway81523 wrote:
       | This is pretty hard to read. For example, on the first page of
       | chapter 1, it talks about "minimization of quadratic forms" and
       | shows what looks like the formula for linear least squares. Is
       | that right? It doesn't say anything about this. Some more
       | exposition would help.
       | 
       | I do like that there are lots of exercises.
        
         | guimplen wrote:
         | I think the text is geared towards people with some
         | mathematical background who want to understand learning theory.
         | Besides it is clearly stated that this chapter is a review (so
         | its assumed that you learned or will learn these things
         | elsewhere).
        
           | throwaway81523 wrote:
           | Well I have some math background but that section is brisk
           | and slow at the same time, as it were. Such as how it
           | explains how to find inverses of 2x2 matrices.
           | 
           | This is older but is supposed to be good:
           | https://www.deeplearningbook.org/
        
         | nerdponx wrote:
         | The sibling comment is right in that this is clearly not
         | intended for first timers.
         | 
         | But your instincts are correct here. When you write out the
         | objective function for ordinary least squares, it turns out to
         | be a quadratic form. The choice of the word "quadratic" here is
         | not a coincidence: it is the generalization of quadratic
         | functions to matrices.
        
         | garydevenay wrote:
         | Certainly doesn't seem like first principles...
        
       ___________________________________________________________________
       (page generated 2024-03-02 23:00 UTC)