[HN Gopher] Learning Theory from First Principles [pdf]
___________________________________________________________________
Learning Theory from First Principles [pdf]
Author : magnio
Score : 109 points
Date : 2024-03-02 18:10 UTC (4 hours ago)
(HTM) web link (www.di.ens.fr)
(TXT) w3m dump (www.di.ens.fr)
| scythmic_waves wrote:
| Interesting! I'll have to look over it when I have more time.
|
| From a quick glance, it looks like it covers much of the same
| material as this text [1]. I wonder how they compare.
|
| [1]: https://www.cambridge.org/core/books/understanding-
| machine-l...
| throwaway81523 wrote:
| A 2014 book on machine learning sounds quaint and historical.
| p1esk wrote:
| I can't wait until I tell GPT-5 "I have this idea I want to try,
| read this book and tell me there's anything relevant there to
| make it work better".
| ralphist wrote:
| I've never heard of an LLM making up a new idea. Shouldn't this
| only work if your thing has already been tried before?
| p1esk wrote:
| In my example I already have an idea, but I'm not sure how
| good or novel it is, or perhaps I tried it and it didn't work
| but I don't understand why.
| westoncb wrote:
| How people define "new" varies a lot in this context. I've
| spent a lot of time talking with ChatGPT exploring
| interdisciplinary ideas and while I think it frequently says
| things that qualify as new, I've only run into one situation
| where I was trying to find some way of doing something
| technical and it just invented something non-trivially
| original to handle it:
| https://x.com/Westoncb/status/1763733064478335326?s=20
| hef19898 wrote:
| Ypur idea would become even better so if _you_ read the book
| _yourself_... You might even learn something new along the way.
| dkarras wrote:
| we are optimizing for time here, not learning. I'm gonna die
| anyways and anything I learn will be dust. If I just need the
| info to make something work, spending months (of which I have
| limited number of) to see if something has something useful
| in it for me vs. spending and afternoon to probe it to get
| most of the benefits is a no-brainer.
| makapuf wrote:
| Maybe, but dont expect otherwise from a book explicitly
| named "Learning Theory from First principles", not "learn
| large language models in 21 days".
| erbdex wrote:
| I am realising that passing context for $this is the tricky
| part as-
|
| 1. It is very difficult for me to tell you about my context as
| a user within low dimension variables.
|
| 2. I do not understand my situation in the universe to be able
| to tell AI.
|
| 3. I dont have a vocabulary with AI. Internet i feel aced this
| with shared HTTP protocol to consistently share agreed upon
| state. For ex within Uber I am a very narrow request response
| universe with.. POST phone, car, gps(a,b,c,d), now, payment.
|
| But as a student wanting to learn algorithms how do I pass that
| I'm $age $internet-type from $place and prefer graphical
| explanations of algorithms, have tried but gotten scared of
| that thick book and these $milestones-cs50, know $python upto
| $proficiency(which again is a fractal variable with research
| papers on how to define for learning).
|
| Similarly how do I help you understand what stage my startup
| idea is beyond low traction, but want to know have
| $networks/(VC, devs, sales) APIs, have $these successful
| partnerships with such evidence $attendance, $sales. Who should
| I speak to? Could you pls write the needful in mails and engage
| in partnership with other bots under $budget.
|
| Even in the real world this vocabulary is in smaller pockets as
| our contexts are too different.
|
| 4. Learning assumes knowledge exists as a global forever
| variable in a wider than we understand universe. $meteor being
| a non maskable interrupt to the power supply at unicorn
| temperatures in a decade. Similarly one time trends in
| disposable $companies that $ecosystem uses to learn. I'm in a
| desert village with with absent electricity might mean those
| machines never reach me and perhaps most people don't have a
| basic phone in the world to be able to share state. Their local
| power mafia politics and absent governance might mean the pdf
| AI recommends i read might or might not help.
|
| I don't know how this will evolve but to think of the
| possibilities has been so interesting. It's like computers can
| talk to us easily and they're such smart babies on day 1 and
| "folks we aren't able to put right, enough, cheap data in" is
| perhaps the real bottleneck to how much usefulness we are being
| able to uncover.
| canjobear wrote:
| Have they figured out what causes double descent yet?
| a_wild_dandan wrote:
| No. We don't know. My favorite hypothesis: SGD is...well,
| stochastic. Meaning you're not optimizing w.r.t the training
| corpus, but a tiny subset, so your gradient isn't _quite_
| right. Over-training allows you to bulldoze over local optima
| and recurse toward the true distribution rather than drive
| around a local over-fitting basin.
| canjobear wrote:
| You can get it with full gradient descent though...
| https://www.nature.com/articles/s41467-020-14663-9
|
| Honestly the fact that there doesn't seem to be a good
| explanation for this makes me think that we just
| fundamentally don't understand learning.
| arolihas wrote:
| There was actually a very recent blog post claiming that
| statistical mechanics can explain double descent
| https://calculatedcontent.com/2024/03/01/describing-double-d...
|
| Some more detail here:
| https://calculatedcontent.com/2019/12/03/towards-a-new-theor...
| iaseiadit wrote:
| Not an expert, but this paper explores double descent with
| simple models. The interpretation there: when you extend into
| the overparameterized regime, that permits optimization towards
| small-norm weights, which generalize well again. Does that
| explain DD generally? Does it apply to other models (e.g.
| DNNs)?
|
| https://arxiv.org/pdf/2303.14151.pdf
| tel wrote:
| I don't know if it's a generalized result, but the Circuits
| team at Anthropic has a very compelling thesis: the first phase
| of descent corresponds to the model memorizing data points, the
| second phase corresponds to it shifting geometrically toward
| learning "features".
|
| Here a "feature" might be seen as an abstract, very, very high
| dimensional vector space. The team is pretty deep in
| investigating the idea of superposition, where individual
| neurons encode for multiple concepts. They experiment with a
| toy model and toy data set where the latent features are
| represented explicitly and then compressed into a small set of
| data dimensions. This forces superposition. Then they show how
| that superposition looks under varying sizes of training data.
|
| It's obviously a toy model, but it's a compelling idea. At
| least for any model which might suffer from superposition.
|
| https://transformer-circuits.pub/2023/toy-double-descent/ind...
| da39a3ee wrote:
| There are so many great mathematical PDFs available for free on
| the Internet, written by academics/educators/engineers. A problem
| is that there is a huge amount of overlap. I wonder if an AI
| model could be developed that would do a really good job of
| synthesizing an overlapping collection into a coherent single PDF
| without duplication.
| hef19898 wrote:
| One could also just pick the books used in the corresponding
| university courses.
| sampo wrote:
| > 2.5 No free lunch theorem > > Although it may be tempting to
| define the optimal learning algorithm that works optimally for
| all distributions, this is impossible. In other words, learning
| is only possible with assumptions.
|
| A mention of no free lunch theorem should come with a disclaimer
| that the theorem is not relevant in practice. An assumption that
| your data originates from the real world, is sufficient that the
| no free lunch theorem is not a hindrance.
|
| This book doesn't discuss this at all. Maybe mention that "all
| distributions" means a generalization to higher dimensional
| spaces of discontinuous functions (including the tiny subset of
| continuous functions) of something similar to all possible bit
| sequences generated by tossing a coin. So basically if you data
| is generated from an even random distribution of "all
| possibilities", you cannot learn to predict the outcome of the
| next coin tosses, or similar.
| nextos wrote:
| Yes, most free lunch theorems and results of this kind, which
| make overly general assumptions, tend to be too pessimistic.
|
| For example, many people naively think that static program
| analysis is unfeasible due to the halting problem, Rice's
| theorem, etc.
| throwaway81523 wrote:
| This is pretty hard to read. For example, on the first page of
| chapter 1, it talks about "minimization of quadratic forms" and
| shows what looks like the formula for linear least squares. Is
| that right? It doesn't say anything about this. Some more
| exposition would help.
|
| I do like that there are lots of exercises.
| guimplen wrote:
| I think the text is geared towards people with some
| mathematical background who want to understand learning theory.
| Besides it is clearly stated that this chapter is a review (so
| its assumed that you learned or will learn these things
| elsewhere).
| throwaway81523 wrote:
| Well I have some math background but that section is brisk
| and slow at the same time, as it were. Such as how it
| explains how to find inverses of 2x2 matrices.
|
| This is older but is supposed to be good:
| https://www.deeplearningbook.org/
| nerdponx wrote:
| The sibling comment is right in that this is clearly not
| intended for first timers.
|
| But your instincts are correct here. When you write out the
| objective function for ordinary least squares, it turns out to
| be a quadratic form. The choice of the word "quadratic" here is
| not a coincidence: it is the generalization of quadratic
| functions to matrices.
| garydevenay wrote:
| Certainly doesn't seem like first principles...
___________________________________________________________________
(page generated 2024-03-02 23:00 UTC)