[HN Gopher] Where is Noether's principle in machine learning?
       ___________________________________________________________________
        
       Where is Noether's principle in machine learning?
        
       Author : cgadski
       Score  : 282 points
       Date   : 2024-03-01 11:47 UTC (1 days ago)
        
 (HTM) web link (cgad.ski)
 (TXT) w3m dump (cgad.ski)
        
       | empath-nirvana wrote:
       | This is one of those links where just seeing the title sets you
       | off, thinking about the implications.
       | 
       | I'm going to have to spend more time digesting the article, but
       | one thing that jumps out at me, and maybe it's answered in the
       | article and I don't understand it, is the role of time. Generally
       | in physics, you're talking about a quantity being conserved over
       | time, and I'm not sure what plays the role of time when you're
       | talking about conserved quantities in machine learning -- is it
       | conserved over training iterations or over inference layers, or
       | what?
       | 
       | edit: now that i've read it again, I just saw that they described
       | in the second paragraph.
       | 
       | I'm now wondering if in something like Sora that can do a kind of
       | physical modeling, if there's some conserved quantity in the
       | neural network that is _directly analagous_ to conserved
       | quantities in physics -- if there is, for example, something that
       | represents momentum, that operates exactly as momentum as it
       | progresses through the layers.
        
         | Raro wrote:
         | Yeah, I've been thinking about similar concepts in a different
         | context. Fascinating.
         | 
         | Regarding the role of time, the idea of a purely conserved
         | quantity is that it is conserved under the conditions of the
         | system (that's why the article frequently references Newton's
         | First Law), so they're generally held "for all time that these
         | symmetries exist in the system".
         | 
         | Specifically on time: the invariant for systems that exhibit
         | continuous time symmetries (i.e. you move a little bit forward
         | or backward in time and the system looks exactly the same) is
         | energy.
        
           | dustingetz wrote:
           | Here's my ELI5 attempt of the time/energy relation:
           | 
           | imagine a spring at rest (not moving)
           | 
           | strike the spring, it's now oscillating
           | 
           | the system now contains energy like a battery
           | 
           | what is energy? it's stored work potential
           | 
           | the battery is storing the energy, which can then be taken
           | out at some future time
           | 
           | the spring is transporting the energy through time
           | 
           | in fact how do we measure time? with clocks. What's a clock?
           | It's an oscillator. The energized spring _is_ the clock. When
           | system energy is zero, what is time even? There 's no
           | baseline against which to measure change when nothing is
           | changing
        
             | PaulHoule wrote:
             | Symmetry exists abstractly, apart from time.
             | 
             | There are many machine learning problems which should have
             | symmetries: a picture of a cow rotated 135 degrees is still
             | a picture of a cow, the meaning of spoken words shouldn't
             | change with the audio level, etc. If they were doing
             | machine learning on tracks from the LHC the system ought to
             | take account of relativistic momentum and energy.
             | 
             | Can a model learn a symmetry? Or should a symmetry just be
             | built into the model from the beginning?
        
               | sdenton4 wrote:
               | Equivariant machine learning is a thing that people have
               | tried... Tends to be expensive and slow, though, and
               | imposes invariances that our model (a universal function
               | approximator, recall) should just learn anyway: If you
               | don't have enough pictures of upside down cows, just
               | train a normal model with augmentations.
        
         | Raro wrote:
         | Ha, my previous comment was before your new edit mentioning
         | Sora. There is a good reason why the accompanying research
         | report to the Sora demo isn't titled "Awesome Generative
         | Video," but references world models. The interesting feature is
         | how many apparently (approximations to) physical properties
         | emerge (object permanence, linear motion, partially elastic
         | collisions, as well as many of the elements of grammar of
         | film), and which do not (notably material properties of solid
         | and fluids, creation of objects from nothing, etc.)
        
         | Communitivity wrote:
         | "I'm now wondering if in something like Sora that can do a kind
         | of physical modeling, if there's some conserved quantity in the
         | neural network that is _directly analogous_ to conserved
         | quantities in physics"
         | 
         | My first thought on reading that was that if there was it would
         | be interesting to see if there was some way it tied into the
         | concept of us living in a simulation, i.e. we're all living in
         | a complex ML network simulation.
        
         | nostrademons wrote:
         | In physics, the conserved quantity isn't always time.
         | Invariance over time translation is specifically conservation
         | of energy. Invariance over spatial translation is conservation
         | of momentum, invariance over spatial rotation is conservation
         | of conservation of angular momentum, invariance of
         | electromagnetic field is conservation of current, and
         | invariance of wave function phase is conservation of charge.
         | 
         | I think the analogue in machine learning is _conservation over
         | changes in the training data_. After all, the point of machine
         | learning is to find general models that describe the training
         | data given, and minimize the loss function. Assuming that a
         | useful model can be trained, the whole _point_ is that it
         | generalizes to new, unseen instances with minimal losses, i.e.
         | the model remains invariant under shifts in the instances seen.
         | 
         | The more interesting part to me is what this says about
         | philosophy of physics. Noether's Theorem can be restated as
         | "The laws of physics are invariant under X transformation",
         | where X is the gauge symmetry associated with the conservation
         | law. But _maybe this is simply a consequence of how we do
         | physics_. After all, the point of science is to produce
         | generalized laws from empirical observations. It 's trivially
         | easy to find a real-world situation where conservation of
         | energy does _not_ hold (any system with friction, which is
         | basically all of them), but the math gets very messy if you try
         | to actually model the real data, so we rely on approximations
         | that are close enough most of the time. And if many people take
         | empirical measurements at many different points in space, and
         | time, and orientations, you get generalized laws that hold
         | regardless of where /when/who takes the measurement.
         | 
         | Machine learning could be viewed as doing science on
         | empirically measurable social quantities. It won't always be
         | accurate, as individual machine-learning fails show. But it's
         | accurate _enough_ that it can provide useful models for
         | civilization-scale quantities.
        
           | sdenton4 wrote:
           | A nice way to formulate (most) data augmentations is: a
           | family of functions A = {a} such that our optimized neural
           | network f obeys f(x) ~= f(a(x)).
           | 
           | So in this case, we're explicitly defining the set of desired
           | invariances.
        
           | Aardwolf wrote:
           | Is there any way to deduce which invariance gives which
           | conservation? I mean for example: how can you tell that time
           | invariance is the one paired with conservation of energy? Why
           | is e.g. time invariance not paired with momentum, current, or
           | anything else, but specifically energy?
           | 
           | I know that I can remember momentum is paired with
           | translation simply because there's both the angular momentum
           | and the non-angular momentum one and in space you have
           | translation and rotation, so for time energy is the only one
           | that's left over, but I'm not looking for a trick to remember
           | it, I'm looking for the fundamental reason, as well as how to
           | tell what will be paired with some invariance when looking at
           | some other new invariance
        
             | chunky1994 wrote:
             | The conserved quantity is derived from Noether's theorem
             | itself. One thing that is a bit hairy is that Noether's
             | theorem only applies to a continuous, smooth (physical ->
             | there is some wiggle room here) space.
             | 
             | When deriving the conservation of energy from Noether's
             | theorem you basically say that your Lagrangian (which is
             | just a set of equations that describes a physical system)
             | is invariant over time. When you do that you automatically
             | get that energy is conserved. Each invariant produces a
             | conserved quantity as explained in parent comment when you
             | apple a specific transformation that is supposed to not
             | change the system (i.e remain invariant).
             | 
             | Now in doing this you're also invoking the principle of
             | least action (by using Lagrangians to describe the state of
             | a physical system) but that is a separate topic.
        
             | Cleonis wrote:
             | In retrospect: the earliest recognition of a conserved
             | quantity was Kepler's law of areas. Isaac Newton later
             | showed that Kepler's law of areas is a specific instance of
             | a property that obtains for any central force, not just the
             | (inverse square) law of gravity.
             | 
             | About symmetry under change of orientation: for a given
             | (spherically symmetric) source of gravitational interaction
             | the amount of gravitational force is the same in any
             | orientation.
             | 
             | For orbital motion the motion is in a plane, so for the
             | case of orbital motion the relevant symmetry is cilindrical
             | symmetry with respect to the plane of the orbit.
             | 
             | The very first derivation that is presented in Newton's
             | Principia is a derivation that shows that for any central
             | force we have: in equal intervals of time equal amounts of
             | area are swept out.
             | 
             | (The swept out area is proportional to the angular momentum
             | of the orbiting object. That is, the area law anticipated
             | the principle of conservation of angular momentum)
             | 
             | A discussion of Newton's derivation, illustrated with
             | diagrams, is available on my website:
             | http://cleonis.nl/physics/phys256/angular_momentum.php
             | 
             | The thrust of the derivation is that if the force that the
             | motion is subject to is a central force (cilindrical
             | symmetry) then angular momentum is conserved.
             | 
             | So: In retrospect we see that Newton's demonstration of the
             | area law is an instance of symmetry-and-conserved-quantity-
             | relation being used. Symmetry of a force under change of
             | orientation has as corresponding conserved quantity of the
             | resulting (orbiting) motion: conservation of angular
             | momentum.
             | 
             | About conservation laws:
             | 
             | The law of conservation of angular momentum and the law of
             | conservation of momentum are about quantities that are
             | associated with specific spatial characteristics, and the
             | conserved quantity is conserved over _time_.
             | 
             | I'm actually not sure about the reason(s) for
             | classification of conservation of energy. My own view: we
             | have that kinetic energy is not associated with any form of
             | keeping track of orientation; the velocity vector is
             | squared, and that squaring operation discards directional
             | information. More generally, Energy is not associated with
             | any spatial characteristic. Arguably Energy conservation is
             | categorized as associated with symmetry under time
             | translation because of _absence_ of association with any
             | spatial characteristic.
        
             | calhoun137 wrote:
             | The key point is that energy, momentum, and angular
             | momentum are additive constants of the motion, and this
             | additivity is a very important property that ultimately
             | derives from the geometry of the space-time in which the
             | motion takes place.
             | 
             | > Is there any way to deduce which invariance gives which
             | conservation?
             | 
             | Yes. See Landau vol 1 chapter 2 [1].
             | 
             | > I'm looking for the fundamental reason, as well as how to
             | tell what will be paired with some invariance when looking
             | at some other new invariance
             | 
             | I'm not sure there is such a "fundamental reason", since
             | energy, momentum, and angular momentum are by definition
             | the names we give to the conserved quantities associated
             | with time, translation, and rotation.
             | 
             | You are asking "how to tell what will be paired with some
             | invariance" but this is not at all obvious in the case of
             | conservation of charge, which is related to the fact that
             | the results of measurements do not change when all the
             | wavefunctions are shifted by a global phase factor (which
             | in general can depend on position).
             | 
             | I am not aware of any way to guess or understand which
             | invariance is tied to which conserved quantity other than
             | just calculating it out, at least not in a way that is
             | intuitive to me.
             | 
             | [1] https://ia803206.us.archive.org/4/items/landau-and-
             | lifshitz-...
        
               | Aardwolf wrote:
               | But momentum is also conserved over time, as far as I
               | know 'conservation' of all of these things always means
               | over time.
               | 
               | "In a closed system (one that does not exchange any
               | matter with its surroundings and is not acted on by
               | external forces) the total momentum remains constant."
               | 
               | That means it's conserved over time, right? So why is
               | energy the one associated with time and not momentum?
        
               | calhoun137 wrote:
               | my understandinf is that conservation of momentum does
               | not mean momentum is conserved as time passes. it means
               | if you have a (closed) system in a certain configuration
               | (not in an external field) and compute the total
               | momentum, the result is independent of the configuration
               | of the system.
        
           | jedbrown wrote:
           | > It's trivially easy to find a real-world situation where
           | conservation of energy does not hold (any system with
           | friction, which is basically all of them)
           | 
           | Conservation of energy absolutely still holds, but entropy is
           | not conserved so the process is irreversible. If your model
           | doesn't include heat, then discrete energy won't be conserved
           | in a process that produces heat, but that's your modeling
           | choice, not a statement about physics. It is common to model
           | such processes using a dissipation potential.
        
             | nostrademons wrote:
             | Right, but I'm saying that it's _all_ modeling choices, all
             | the way down. Extend the model to include thermal energy
             | and most of the time it holds again - but then it falls
             | down if you also have static electricity that generates a
             | visible spark (say, a wool sweater on a slide) or magnetic
             | drag (say, regenerative braking on a car). Then you can
             | include models for _those_ too, but you 're introducing new
             | concepts with each, and the math gets much hairier. We
             | _call_ the unified model where we abstract away all the
             | different forms of energy  "conservation of energy", but
             | there are a good many practical systems where making
             | tangible predictions using conservation of energy gives
             | wrong answers.
             | 
             | Basically this is a restatement of Box's Aphorism ("All
             | models are wrong, but some are useful") or the ideas in
             | Thomas Kuhn's "The Structure of Scientific Revolutions".
             | The goal of science is to from concrete observations to
             | abstract principles which ideally will accurately predict
             | the value of future concrete observations. In many cases,
             | you can do this. But not all. There is always messy data
             | that doesn't fit into neat, simple, general laws. Usually
             | the messy data is just ignored, because it can't be
             | predicted and is assumed to average out or generally be
             | irrelevant in the end. But sometimes the messy outliers
             | bite you, or someone comes up with a new way to handle them
             | elegantly, and then you get a paradigm shift.
             | 
             | And this has implications for understanding what machine
             | learning is or why it's important. Few people would think
             | that a model linking background color to likeliness to
             | click on ads is a fundamental physical quality, but Google
             | had one 15+ years ago, and it was pretty accurate, and made
             | them a bunch of money. Or similarly, most people wouldn't
             | think of a model of the English language as being a
             | fundamental physical quality, but that's exactly what an
             | LLM is, and they're pretty useful too.
        
               | jcgrillo wrote:
               | It's been a long time since I have cracked a physics
               | book, but your mention of interesting "fundamental
               | physical quantities" triggered the recollection of there
               | being a conservation of information result in quantum
               | mechanics where you can come up with an action whose
               | equations of motion are Schrodinger's equation and the
               | conserved quantity is a probability current. So I wonder
               | to what extent (if any) it might make sense to try to
               | approach these things in terms of the _really
               | fundamental_ quantity of information itself?
        
               | jerf wrote:
               | Approaching physics from a pure information flow is
               | definitely a current research topic. I suspect we see
               | less popsci treatment of it because almost nobody
               | understands information at all, then trying to apply it
               | to physics that also almost nobody understands is
               | probably at least three or four bridges too far for a
               | popsci treatment, but it's a current and active topic.
        
           | chunky1994 wrote:
           | I'm a bit skeptical to give up conservation of energy in a
           | system with friction. Isn't it more accurate to say that if
           | we were to calculate every specific interaction we'd still
           | end up having conservation of energy. Now whether or not
           | we're dealing with a closed system etc becomes important but
           | if we were to able to truly model the entire physical system
           | with friction, we'd still adhere to our conservation laws.
           | 
           | So they are not approximations, but are just terribly
           | difficult calculations, no?
           | 
           | Maybe I'm misunderstanding your point, but this should be
           | true regardless of our philosophy of physics correct?
        
             | nyrikki wrote:
             | It is an analogy stating that dissipative systems do not
             | have a Lagrangian, Noether's work applies to Lagrangian
             | systems
             | 
             | Conservation laws in particular are measurable properties
             | of an isolated physical system do not change as the system
             | evolves over time.
             | 
             | It is important to remember that Physics is about finding
             | useful models that make useful predictions about a system.
             | So it is important to not confuse the map for the
             | territory.
             | 
             | Gibbs free energy and Helmholtz free energy are not
             | conserved.
             | 
             | As thermodynamics, entropy, and entropy are difficult
             | topics due to didactic half-truths, here is a paper that
             | shows that the nbody problem becomes invariant and may be
             | undecidable due to what is a similar issue (in a contrived
             | fashion)
             | 
             | http://philsci-archive.pitt.edu/13175/
             | 
             | While Noether's principle often allows you to see things
             | that can often be simplified in an equation, often it
             | allows you to not just simplify 'terribly difficult
             | calculations' but to actually find computationally possible
             | calculations.
        
           | empath-nirvana wrote:
           | > In physics, the conserved quantity isn't always time.
           | Invariance over time translation is specifically conservation
           | of energy.
           | 
           | That's not what i meant.
           | 
           | When you talk about "conservation of angular momentum", the
           | symmetry is invariance over rotation, but the angular
           | momentum is conserved _over time_.
        
         | shiandow wrote:
         | A convolutional neural network ought to have translational
         | symmetry, which should lead to a generalized version of
         | momentum. If I understood the article correctly the conserved
         | quantity would be <gx, dx>, where dx is the finite difference
         | gradient of x.
         | 
         | This gives a vector with dimensions equal to however many
         | directions you can translate a layer in and which is conserved
         | over all (convolutional) layers.
        
           | cgadski wrote:
           | Exactly right! In fact, because that symmetry does not
           | include an action on the parameters of the layer, your
           | conserved quantity <gx, dx> should hold whether or not the
           | network is stationary for a loss. This means that it'll be
           | stationary on every single data point. (In an image
           | classification model, these values are just telling you
           | whether or not the loss would be improved if the input image
           | were translated.)
        
             | empath-nirvana wrote:
             | Everything in the paper is talking about global symmetries,
             | is there also the possibility of gauge symmetries?
        
         | nurple wrote:
         | I think the most profound insight I've come across while
         | studying this particular topic is the insight that information
         | theory ended up being the answer to conserving the 2nd law with
         | respect to Maxwell's demon thought experiment. Not to put too
         | fine a point, but essentially the knowledge organized in the
         | mind of the demon, about the particles in its system, was
         | calculated to offset the creation of the energy gradient.
         | 
         | I found the thinking of William Sidis to be particularly
         | thought provoking perspective on Noether's benchmark work, in
         | his paper The Animate and the Inanimate he posits--at a high
         | level--that life is a "reversal of the second law of
         | thermodynamics"; not that the 2nd law is a physical symmetry,
         | but a mental one in an existence where energy reversibly flows
         | between positive and negative states.
         | 
         | Indeed, when considering machine learning, I think it's quite
         | interesting to consider how the organizing of
         | information/knowledge done during training in some real way
         | mirrors the energy-creating information interred in the mind of
         | Maxwell's demon.
         | 
         | When taking into account the possible transitive benefits of
         | knowledge organized via machine learning, and its attendant
         | oracle through application, it's easy to see a world where this
         | results in a net entropy loss, the creation of a previously
         | non-existent energy gradient.
         | 
         | In my mind this has interesting implications for Fermi's
         | paradox as it seems to imply the inevitibility of the
         | organization of information. Taken further into my own personal
         | dogma, I think it's inevitable that we create--what we would
         | consider--a sentient being as I believe this is the cycle of
         | our own origin in the larger evolutionary timeline.
        
           | Jerrrry wrote:
           | >at a high level--that life is a "reversal of the second law
           | of thermodynamics";
           | 
           | Life temporarily displaces entropy, locally.
           | 
           | Life wins battles, chaos wins the war.
           | 
           | >Indeed, when considering machine learning, I think it's
           | quite interesting to consider how the organizing of
           | information/knowledge done during training in some real way
           | mirrors the energy-creating information interred in the mind
           | of Maxwell's demon.
           | 
           | This is our human bias favoring the common myth of ever-
           | expanding complexity is an "inevitable" result of the passage
           | of time; refer to Stephen Jay Gould's "Full House: The Spread
           | of Excellence from Plato to Darwin"[0] for the only palatable
           | refute modern evolutionists can offer.
           | 
           | >When taking into account the possible transitive benefits of
           | knowledge organized via machine learning, and its attendant
           | oracle through application, it's easy to see a world where
           | this results in a net entropy loss, the creation of a
           | previously non-existent energy gradient.
           | 
           | Because it is. Randomness combined with a sieve, like a
           | generator and a discriminator, like the primordial protein
           | soup and our own existence as a selector, like chaos and
           | order themselves, MAY - but DOES NOT have to - lead to
           | temporary, localized areas of complexity, that we call
           | 'life'.
           | 
           | This "energy gradient" you speak of is literally gravity
           | pulling baryonic matter foward thru space time. All work
           | requires a temperature gradient - Hawking's musings on the
           | second law of thermodynamics and your own intuition can
           | reason why.
           | 
           | >In my mind this has interesting implications for Fermi's
           | paradox as it seems to imply the inevitibility of the
           | organization of information. Taken further into my own
           | personal dogma, I think it's inevitable that we create--what
           | we would consider--a sentient being as I believe this is the
           | cycle of our own origin in the larger evolutionary timeline.
           | 
           | Over cosmological time spans, it is a near-mathematical
           | certainty, that we are to either reach the universe's Omega
           | point[1] on "our" own accord, perish to our own, by our own
           | creation, or by our own son's, hands.
           | 
           | [0]: https://www.amazon.com/Full-House-Spread-Excellence-
           | Darwin/d...
           | 
           | [1]: https://www.youtube.com/watch?v=eOxHRFN4rs0
        
         | jungturk wrote:
         | > now wondering...if there's some conserved quantity in the
         | neural network that is _directly analagous_ to conserved
         | quantities in physics
         | 
         | Isn't the model attempting to conserve information during
         | training? And isn't information a physical quantity?
        
         | rnhmjoj wrote:
         | Time is not special regarding symmetries and conserved
         | quantities. In general you can consider any family of
         | continuous transformations parametrised by some real variable
         | s: be it translations by a distance x, rotations by an angle
         | ph, etc. These are technically one-parameter subgroups of a Lie
         | group.
         | 
         | Then, if your dynamical system is symmetrical under these
         | transformations you can construct a quantity whose derivative
         | wrt s is zero.
        
       | platz wrote:
       | how do you direct what the network learns if it all comes from
       | supervised learning training sets?
       | 
       | How do you insert rules that aren't learned into what weights are
       | learned?
        
         | nrub wrote:
         | There are promising methods developing for Physic's informed
         | neural networks. Mathematical models can be integrated into the
         | architecture of neural networks such that the parameters of the
         | designed mathematical models can be learned. Examples include
         | learning the frequency of a swinging pendulum from video,
         | amongst more advanced ideas.
         | 
         | https://en.wikipedia.org/wiki/Physics-informed_neural_networ...
         | https://www.youtube.com/watch?v=JoFW2uSd3Uo
        
       | brodolage wrote:
       | How does he create those animations? I'd like to make them as
       | well for myself.
        
         | smokel wrote:
         | They seem to be built with some love by the author. Apparently
         | they have written it in Haxe, judging from the comment in the
         | page source.
        
           | brodolage wrote:
           | Oh that's way out of my league unfortunately. I wonder if
           | there's a library or something that does something like this.
        
             | riemannzeta wrote:
             | Not quite what you're looking for, but worth pointing out
             | that Grant Sanderson of 3Blue1Brown has published the
             | "framework" he uses for his math videos on GitHub.
             | 
             | https://github.com/3b1b/manim
        
           | cgadski wrote:
           | Haha yeah, took some love. I have a scrappy little
           | "framework" that I've been adjusting since I started making
           | interactive posts last year. Writing my interactive widgets
           | feels a bit like doing a game jam now: just copy a template
           | and start compiling+reloading the page, seeing what I can get
           | onto the screen. I've just been using the canvas2d API.
           | 
           | Besides figuring out a good way of dealing with reference
           | frames, the only trick I'd pass on is to use CSS variables to
           | change colors and sizes (line widths, arrow dimensions, etc.)
           | interactively. It definitely helps to tighten the feedback
           | loop on those decisions.
        
         | aeonik wrote:
         | I'd like to know too.
         | 
         | I've been using Emmy from the Clojurescript ecosystem, which
         | works pretty good, but has a few quirks.
         | 
         | https://emmy-viewers.mentat.org/
        
       | r34 wrote:
       | As a complete amateur I was wondering if it could be possible to
       | use that property of light ("to always choose the most optimal
       | route") to solve the traveling salesman problem (and the whole
       | class of those problems as a consequence). Maybe not with an
       | algorithmic approach, but rather some smart implementation of the
       | machine itself.
        
         | nkozyra wrote:
         | This sounds a bit like LIDAR implementations, I assume you mean
         | something similar at a smaller scale, where physical obstacles
         | provide a "path" representation of a problem space?
        
           | r34 wrote:
           | Yup, something like that came to my mind first: create a
           | physical representation (like a map) of the graph you want to
           | solve and use physics to determine the shortest path. Once
           | you have it you could easily compute the winning path's
           | length etc.
        
         | shiandow wrote:
         | If somehow you can ensure that light can only reach a point by
         | travelling through all other points then yes.
         | 
         | It's basically the same way you could use light to solve a
         | maze, just flood the exit with light and walk in the direction
         | which is brightest. Works better for mirror mazes.
        
         | pvg wrote:
         | Google up 'soap film steiner tree' for a fun, well-known
         | variant of this.
        
           | jerf wrote:
           | Then follow it up with
           | https://www.scottaaronson.com/papers/npcomplete.pdf . While
           | reality can "solve" these problems to some extent it turns
           | out that people overestimate reality's ability to solve it
           | _optimally_.
        
         | richk449 wrote:
         | Sounds like some of the trendy analogy computing approaches,
         | like this one for example:
         | 
         | https://www.microsoft.com/en-us/research/uploads/prod/2023/0...
        
         | samatman wrote:
         | This is pretty likely, it's been done with DNA:
         | https://pubmed.ncbi.nlm.nih.gov/15555757/
         | 
         | Physics contains a lot of 'machinery' for solving for low
         | energy states.
        
       | Scene_Cast2 wrote:
       | People have mentioned the discrete - continuous tradeoff. One way
       | to bridge that gap would be to use
       | https://arxiv.org/abs/1806.07366 - they draw an equivalence
       | between vanilla (FC layer) neural nets of constant width with
       | differential equations, and then use a differential equation
       | solver to "train" a "neural net" (from what I remember - it's
       | been years since that paper...).
       | 
       | Another approach might be to take an information theoretic view
       | with the infinite-width finite-entropy nets.
        
         | scarmig wrote:
         | Another angle to look at would be the S4 models, which admit
         | both a continuous time and recurrent discrete representation.
        
       | pmayrgundter wrote:
       | I wonder if an energy and work metric could be derived for
       | gradient descent. This might be useful for a more rigorous
       | approach to hyperparameter development, and maybe for
       | characterizing the data being learned. We say that some datasets
       | are harder to learn, or measure difficulty by the overall compute
       | needed to hit a quality benchmark. Something more essential would
       | be a step forward.
       | 
       | Like in ANN backprop, the gradient descent algorithm can use a
       | momentum to overcome getting stuck in local minima. This was
       | heuristically physical when I learned it.. perhaps it's been
       | developed since. Maybe only allowing a "real" energy to the
       | momentum would then align it with an ability to do work
       | calculation. Might also help with ensemble/monte carlo methods,
       | to maintain an energy account across the ensemble.
        
       | irchans wrote:
       | I liked the article and I hope that I can understand it more with
       | some study.
       | 
       | I think the following sentence in the article is wrong "Applying
       | Noether's theorem gives us three conserved quantities--one for
       | each degree of freedom in our group of transformations--which
       | turn out to be horizontal, vertical, and angular momentum."
       | 
       | I think the correct statement is "Applying Noether's theorem
       | gives us three conserved quantities--one for each degree of
       | freedom in our group of transformations--which turn out to be
       | translation, rotation, and time shifting."
       | 
       | I think translation leads to conservation of momentum, rotation
       | leads to conservation of angular momentum, and time shifting
       | leads to conservation of energy (potential+kinetic). It's been a
       | few decades since I saw the proof, so I might be wrong.
        
         | chunky1994 wrote:
         | Right, the rephrasing of the sentence is a tad more accurate.
         | Your three entities are [invariant -> conserved quantity]:
         | (translation -> momentum), (rotation -> angular momentum) and
         | (time -> energy).
        
         | nostrademons wrote:
         | I think your last paragraph is correct, but the statement in
         | the article is referring to the specific 2D 2-body example
         | given, and its original phrasing is also correct. Translation,
         | rotation, and time-shifting are _transformations_ (matrices),
         | not _quantities_. Horizontal, vertical, and angular (2D)
         | momentum are scalars. The article is saying that if you take
         | the action potential given in the example, there exist scalar
         | quantities (which we call horizontal momentum, vertical
         | momentum, and angular momentum) that remain constant regardless
         | of any horizontal, vertical, or rotational transformation of
         | the coordinate system used to measure the 2-body problem.
        
         | kurthr wrote:
         | The application of Noether's theorem in this case refers only
         | to the energy integral shown (KE = ME - GPE for 2D Kinetic
         | Mechanical and Gravitational Potential Energies) over time.
         | It's really only for that particular 2 body 2 dimensional
         | problem.
         | 
         | More generically in 3 dimensions a transformation with 3
         | translational 2 rotational and 1 time independence would
         | provide conservation of 3 momenta 2 angular momenta and 1
         | energy.
        
         | cgadski wrote:
         | Hi, thanks!
         | 
         | In that sentence I was only talking about the translations and
         | rotations of the plane as a group of invariances for the action
         | of the two-body problem. This group is generated by one-
         | parameter subgroups producing vertical translation, horizontal
         | translation, and rotation about a particular point. Those are
         | the "three degrees of freedom" I was counting.
         | 
         | You're right about the correspondence from symmetries to
         | conservation laws in general.
        
         | esafak wrote:
         | I'll be walking tall the day I can leisurely read articles like
         | this! I wish I had studied this stuff; now time is short.
        
       | iskander wrote:
       | I love the simple but elegant formatting of this blog.
       | 
       | cgadski: what did you use to make it?
        
         | iskander wrote:
         | Only clue in the source:
         | 
         | <!-- this blog is proudly generated by, like, GNU make -->
        
         | cgadski wrote:
         | Thank you!
         | 
         | In the beginning, I used kognise's water.css [1], so most of
         | the smart decisions (background/text color, margins, line
         | spacing I think) probably come from there. Since then it's been
         | some amount of little adjustments. The font is by Jean Francois
         | Porchez, called Le Monde Livre Classic [2].
         | 
         | I draft in Obsidian [3] and build the site with a couple python
         | scripts and KaTeX.
         | 
         | [1] https://watercss.kognise.dev/
         | 
         | [2] https://typofonderie.com/fr/fonts/le-monde-livre-classic
         | 
         | [3] https://obsidian.md/
        
       | scarmig wrote:
       | A related paper I just found and am digesting:
       | https://arxiv.org/abs/2012.04728
       | 
       | Softmax gives rise to translation symmetry, batch normalization
       | to scale symmetry, homogeneous activations to rescale symmetry.
       | Each of those induce their own learning invariants through
       | training.
        
         | cgadski wrote:
         | That's also a neat result! I'd just like to highlight that the
         | conservation laws proved in that paper are functions of the
         | parameters that hold over the course of gradient descent,
         | whereas my post is talking about functions of the activations
         | that are conserved from one layer to the next within an
         | optimized network.
         | 
         | By the way, maybe I'm being too much of a math snob, but I'd
         | argue Kunin's result is only superficially similar to Noether's
         | theorem. (In the paper they call it a "striking similarity"!)
         | Geometrically, what they're saying is that, if a loss function
         | is invariant under a non-zero vector field, then the trajectory
         | of gradient descent will be tangent to the codimension-1
         | distribution of vectors perpendicular to the vector field. If
         | that distribution is integrable (in the sense of the Frobenius
         | theorem), then any of its integrals is conserved under gradient
         | descent. That's a very different geometric picture from
         | Noether's theorem. For example, Noether's theorem gives a
         | direct mapping from invariances to conserved quantities,
         | whereas they need a special integrability condition to hold.
         | But yes, it is a nice result, certainly worth keeping in mind
         | when thinking about your gradient flows. :)
         | 
         | By the way, you might be interested in [1], which also studies
         | gradient descent from the point of view of mechanics and seems
         | to really use Noether-like results.
         | 
         | [1] Tanaka, Hidenori, and Daniel Kunin. "Noether's Learning
         | Dynamics: Role of Symmetry Breaking in Neural Networks." In
         | Advances in Neural Information Processing Systems, 34:25646-60.
         | Curran Associates, Inc., 2021. https://papers.nips.cc/paper/202
         | 1/hash/d76d8deea9c19cc9aaf22....
        
           | jonathanyc wrote:
           | Not GP, but thanks for your detailed comment and the paper
           | reference.
        
           | samatman wrote:
           | I wouldn't call drawing a distinction between an isomorphism
           | and an analogy to be maths snobbery. I would call it
           | mathematics. :)
        
       | emmynoether wrote:
       | It has been shown that a finite difference implementation of wave
       | propagation can be expressed as a deep neural network (e.g.,
       | [1]). These networks can have thousands of layers and yet I don't
       | think they suffer from the exploding/vanishing gradient problem,
       | which I imagine is because in the physical system they model
       | there are conservation laws such as conservation of energy.
       | 
       | [1] https://arxiv.org/abs/1801.07232
        
       | waveBidder wrote:
       | so I think this is a great connection that deserves more thought.
       | as well as an absolutely gorgeous write-up.
       | 
       | The main problem I see with it is that most of the time you _don
       | 't_ want the optimum for your objective function, as that
       | frequently results in overfitting. this leads to things like
       | early stopping being typical.
        
         | cgadski wrote:
         | Thanks so much!
         | 
         | And yes, that's quite true. When parameter gradients don't
         | quite vanish, then the equation
         | 
         | <g_x, d x / d eps> = <g_y, d y / d eps>
         | 
         | becomes
         | 
         | <g_x, d x / d eps> = <g_y, d y / d eps> - <g_theta, d theta / d
         | eps>
         | 
         | where g_theta is the gradient with respect to theta.
         | 
         | In defense of my hypothesis that interesting approximate
         | conservation laws exist in practice, I'd argue that maybe
         | parameter gradients at early stopping are small enough that the
         | last term is pretty small compared to the first two.
         | 
         | On the other hand, stepping back, the condition that our
         | network parameters are approximately stationary for a loss
         | function feels pretty... shallow. My impression of deep
         | learning is that an optimized model _cannot_ be understood as
         | just "some solution to an optimization problem," but is more
         | like a sample from a Boltzmann distribution which happens to
         | concentrate a lot of its probability mass around _certain_
         | minimizers of an energy. So, if we can prove something that is
         | true for neural networks simply because they're "near
         | stationary points", we probably aren't saying anything very
         | fundamental about deep learning.
        
           | riemannzeta wrote:
           | Your work here is so beautiful, but perhaps one lesson is
           | that growth and learning result where symmetries are broken.
           | :-D
        
       | raptortech wrote:
       | See also "Noether Networks: Meta-Learning Useful Conserved
       | Quantities" https://arxiv.org/abs/2112.03321 from 2021.
       | 
       | Abstract: Progress in machine learning (ML) stems from a
       | combination of data availability, computational resources, and an
       | appropriate encoding of inductive biases. Useful biases often
       | exploit symmetries in the prediction problem, such as
       | convolutional networks relying on translation equivariance.
       | Automatically discovering these useful symmetries holds the
       | potential to greatly improve the performance of ML systems, but
       | still remains a challenge. In this work, we focus on sequential
       | prediction problems and take inspiration from Noether's theorem
       | to reduce the problem of finding inductive biases to meta-
       | learning useful conserved quantities. We propose Noether
       | Networks: a new type of architecture where a meta-learned
       | conservation loss is optimized inside the prediction function. We
       | show, theoretically and experimentally, that Noether Networks
       | improve prediction quality, providing a general framework for
       | discovering inductive biases in sequential problems.
        
       | klysm wrote:
       | Completely irrelevant but I love the way the color theme on this
       | blog feels like a chalk board
        
       | calhoun137 wrote:
       | Very nice article! I recently had a long chat with chatgpt on
       | this topic, although from a slightly different perspective.
       | 
       | A neural network is a type of machine that solves non linear
       | optimization problems, and the principle of least action is also
       | a non linear optimization problem that nature solves by some kind
       | of natural law.
       | 
       | This is the one thing that chatgpt mentioned which surpised me
       | the most and which I had not previously considered.
       | 
       | > Eigenvalues of the Hamiltonian in quantum mechanics correspond
       | to energy states. In neural networks, the eigenvalues (principal
       | components) of certain matrices, like the weight matrices in
       | certain layers, can provide information about the dominant
       | features or patterns. The notion of states or dominant features
       | might be loosely analogous between the two domains.
       | 
       | I am skeptical that any conserved quantity besides energy would
       | have a corresponding conserved quantity in ML, and the Reynolds
       | operator will likely be relevant for understanding any
       | correspondence like this.
       | 
       | iirc the Reynolds operator plays an important role in Noethers
       | theorem, and it involves an averaging operation similar to what
       | is described in the linked article.
        
       ___________________________________________________________________
       (page generated 2024-03-02 23:01 UTC)