[HN Gopher] Where is Noether's principle in machine learning?
___________________________________________________________________
Where is Noether's principle in machine learning?
Author : cgadski
Score : 282 points
Date : 2024-03-01 11:47 UTC (1 days ago)
(HTM) web link (cgad.ski)
(TXT) w3m dump (cgad.ski)
| empath-nirvana wrote:
| This is one of those links where just seeing the title sets you
| off, thinking about the implications.
|
| I'm going to have to spend more time digesting the article, but
| one thing that jumps out at me, and maybe it's answered in the
| article and I don't understand it, is the role of time. Generally
| in physics, you're talking about a quantity being conserved over
| time, and I'm not sure what plays the role of time when you're
| talking about conserved quantities in machine learning -- is it
| conserved over training iterations or over inference layers, or
| what?
|
| edit: now that i've read it again, I just saw that they described
| in the second paragraph.
|
| I'm now wondering if in something like Sora that can do a kind of
| physical modeling, if there's some conserved quantity in the
| neural network that is _directly analagous_ to conserved
| quantities in physics -- if there is, for example, something that
| represents momentum, that operates exactly as momentum as it
| progresses through the layers.
| Raro wrote:
| Yeah, I've been thinking about similar concepts in a different
| context. Fascinating.
|
| Regarding the role of time, the idea of a purely conserved
| quantity is that it is conserved under the conditions of the
| system (that's why the article frequently references Newton's
| First Law), so they're generally held "for all time that these
| symmetries exist in the system".
|
| Specifically on time: the invariant for systems that exhibit
| continuous time symmetries (i.e. you move a little bit forward
| or backward in time and the system looks exactly the same) is
| energy.
| dustingetz wrote:
| Here's my ELI5 attempt of the time/energy relation:
|
| imagine a spring at rest (not moving)
|
| strike the spring, it's now oscillating
|
| the system now contains energy like a battery
|
| what is energy? it's stored work potential
|
| the battery is storing the energy, which can then be taken
| out at some future time
|
| the spring is transporting the energy through time
|
| in fact how do we measure time? with clocks. What's a clock?
| It's an oscillator. The energized spring _is_ the clock. When
| system energy is zero, what is time even? There 's no
| baseline against which to measure change when nothing is
| changing
| PaulHoule wrote:
| Symmetry exists abstractly, apart from time.
|
| There are many machine learning problems which should have
| symmetries: a picture of a cow rotated 135 degrees is still
| a picture of a cow, the meaning of spoken words shouldn't
| change with the audio level, etc. If they were doing
| machine learning on tracks from the LHC the system ought to
| take account of relativistic momentum and energy.
|
| Can a model learn a symmetry? Or should a symmetry just be
| built into the model from the beginning?
| sdenton4 wrote:
| Equivariant machine learning is a thing that people have
| tried... Tends to be expensive and slow, though, and
| imposes invariances that our model (a universal function
| approximator, recall) should just learn anyway: If you
| don't have enough pictures of upside down cows, just
| train a normal model with augmentations.
| Raro wrote:
| Ha, my previous comment was before your new edit mentioning
| Sora. There is a good reason why the accompanying research
| report to the Sora demo isn't titled "Awesome Generative
| Video," but references world models. The interesting feature is
| how many apparently (approximations to) physical properties
| emerge (object permanence, linear motion, partially elastic
| collisions, as well as many of the elements of grammar of
| film), and which do not (notably material properties of solid
| and fluids, creation of objects from nothing, etc.)
| Communitivity wrote:
| "I'm now wondering if in something like Sora that can do a kind
| of physical modeling, if there's some conserved quantity in the
| neural network that is _directly analogous_ to conserved
| quantities in physics"
|
| My first thought on reading that was that if there was it would
| be interesting to see if there was some way it tied into the
| concept of us living in a simulation, i.e. we're all living in
| a complex ML network simulation.
| nostrademons wrote:
| In physics, the conserved quantity isn't always time.
| Invariance over time translation is specifically conservation
| of energy. Invariance over spatial translation is conservation
| of momentum, invariance over spatial rotation is conservation
| of conservation of angular momentum, invariance of
| electromagnetic field is conservation of current, and
| invariance of wave function phase is conservation of charge.
|
| I think the analogue in machine learning is _conservation over
| changes in the training data_. After all, the point of machine
| learning is to find general models that describe the training
| data given, and minimize the loss function. Assuming that a
| useful model can be trained, the whole _point_ is that it
| generalizes to new, unseen instances with minimal losses, i.e.
| the model remains invariant under shifts in the instances seen.
|
| The more interesting part to me is what this says about
| philosophy of physics. Noether's Theorem can be restated as
| "The laws of physics are invariant under X transformation",
| where X is the gauge symmetry associated with the conservation
| law. But _maybe this is simply a consequence of how we do
| physics_. After all, the point of science is to produce
| generalized laws from empirical observations. It 's trivially
| easy to find a real-world situation where conservation of
| energy does _not_ hold (any system with friction, which is
| basically all of them), but the math gets very messy if you try
| to actually model the real data, so we rely on approximations
| that are close enough most of the time. And if many people take
| empirical measurements at many different points in space, and
| time, and orientations, you get generalized laws that hold
| regardless of where /when/who takes the measurement.
|
| Machine learning could be viewed as doing science on
| empirically measurable social quantities. It won't always be
| accurate, as individual machine-learning fails show. But it's
| accurate _enough_ that it can provide useful models for
| civilization-scale quantities.
| sdenton4 wrote:
| A nice way to formulate (most) data augmentations is: a
| family of functions A = {a} such that our optimized neural
| network f obeys f(x) ~= f(a(x)).
|
| So in this case, we're explicitly defining the set of desired
| invariances.
| Aardwolf wrote:
| Is there any way to deduce which invariance gives which
| conservation? I mean for example: how can you tell that time
| invariance is the one paired with conservation of energy? Why
| is e.g. time invariance not paired with momentum, current, or
| anything else, but specifically energy?
|
| I know that I can remember momentum is paired with
| translation simply because there's both the angular momentum
| and the non-angular momentum one and in space you have
| translation and rotation, so for time energy is the only one
| that's left over, but I'm not looking for a trick to remember
| it, I'm looking for the fundamental reason, as well as how to
| tell what will be paired with some invariance when looking at
| some other new invariance
| chunky1994 wrote:
| The conserved quantity is derived from Noether's theorem
| itself. One thing that is a bit hairy is that Noether's
| theorem only applies to a continuous, smooth (physical ->
| there is some wiggle room here) space.
|
| When deriving the conservation of energy from Noether's
| theorem you basically say that your Lagrangian (which is
| just a set of equations that describes a physical system)
| is invariant over time. When you do that you automatically
| get that energy is conserved. Each invariant produces a
| conserved quantity as explained in parent comment when you
| apple a specific transformation that is supposed to not
| change the system (i.e remain invariant).
|
| Now in doing this you're also invoking the principle of
| least action (by using Lagrangians to describe the state of
| a physical system) but that is a separate topic.
| Cleonis wrote:
| In retrospect: the earliest recognition of a conserved
| quantity was Kepler's law of areas. Isaac Newton later
| showed that Kepler's law of areas is a specific instance of
| a property that obtains for any central force, not just the
| (inverse square) law of gravity.
|
| About symmetry under change of orientation: for a given
| (spherically symmetric) source of gravitational interaction
| the amount of gravitational force is the same in any
| orientation.
|
| For orbital motion the motion is in a plane, so for the
| case of orbital motion the relevant symmetry is cilindrical
| symmetry with respect to the plane of the orbit.
|
| The very first derivation that is presented in Newton's
| Principia is a derivation that shows that for any central
| force we have: in equal intervals of time equal amounts of
| area are swept out.
|
| (The swept out area is proportional to the angular momentum
| of the orbiting object. That is, the area law anticipated
| the principle of conservation of angular momentum)
|
| A discussion of Newton's derivation, illustrated with
| diagrams, is available on my website:
| http://cleonis.nl/physics/phys256/angular_momentum.php
|
| The thrust of the derivation is that if the force that the
| motion is subject to is a central force (cilindrical
| symmetry) then angular momentum is conserved.
|
| So: In retrospect we see that Newton's demonstration of the
| area law is an instance of symmetry-and-conserved-quantity-
| relation being used. Symmetry of a force under change of
| orientation has as corresponding conserved quantity of the
| resulting (orbiting) motion: conservation of angular
| momentum.
|
| About conservation laws:
|
| The law of conservation of angular momentum and the law of
| conservation of momentum are about quantities that are
| associated with specific spatial characteristics, and the
| conserved quantity is conserved over _time_.
|
| I'm actually not sure about the reason(s) for
| classification of conservation of energy. My own view: we
| have that kinetic energy is not associated with any form of
| keeping track of orientation; the velocity vector is
| squared, and that squaring operation discards directional
| information. More generally, Energy is not associated with
| any spatial characteristic. Arguably Energy conservation is
| categorized as associated with symmetry under time
| translation because of _absence_ of association with any
| spatial characteristic.
| calhoun137 wrote:
| The key point is that energy, momentum, and angular
| momentum are additive constants of the motion, and this
| additivity is a very important property that ultimately
| derives from the geometry of the space-time in which the
| motion takes place.
|
| > Is there any way to deduce which invariance gives which
| conservation?
|
| Yes. See Landau vol 1 chapter 2 [1].
|
| > I'm looking for the fundamental reason, as well as how to
| tell what will be paired with some invariance when looking
| at some other new invariance
|
| I'm not sure there is such a "fundamental reason", since
| energy, momentum, and angular momentum are by definition
| the names we give to the conserved quantities associated
| with time, translation, and rotation.
|
| You are asking "how to tell what will be paired with some
| invariance" but this is not at all obvious in the case of
| conservation of charge, which is related to the fact that
| the results of measurements do not change when all the
| wavefunctions are shifted by a global phase factor (which
| in general can depend on position).
|
| I am not aware of any way to guess or understand which
| invariance is tied to which conserved quantity other than
| just calculating it out, at least not in a way that is
| intuitive to me.
|
| [1] https://ia803206.us.archive.org/4/items/landau-and-
| lifshitz-...
| Aardwolf wrote:
| But momentum is also conserved over time, as far as I
| know 'conservation' of all of these things always means
| over time.
|
| "In a closed system (one that does not exchange any
| matter with its surroundings and is not acted on by
| external forces) the total momentum remains constant."
|
| That means it's conserved over time, right? So why is
| energy the one associated with time and not momentum?
| calhoun137 wrote:
| my understandinf is that conservation of momentum does
| not mean momentum is conserved as time passes. it means
| if you have a (closed) system in a certain configuration
| (not in an external field) and compute the total
| momentum, the result is independent of the configuration
| of the system.
| jedbrown wrote:
| > It's trivially easy to find a real-world situation where
| conservation of energy does not hold (any system with
| friction, which is basically all of them)
|
| Conservation of energy absolutely still holds, but entropy is
| not conserved so the process is irreversible. If your model
| doesn't include heat, then discrete energy won't be conserved
| in a process that produces heat, but that's your modeling
| choice, not a statement about physics. It is common to model
| such processes using a dissipation potential.
| nostrademons wrote:
| Right, but I'm saying that it's _all_ modeling choices, all
| the way down. Extend the model to include thermal energy
| and most of the time it holds again - but then it falls
| down if you also have static electricity that generates a
| visible spark (say, a wool sweater on a slide) or magnetic
| drag (say, regenerative braking on a car). Then you can
| include models for _those_ too, but you 're introducing new
| concepts with each, and the math gets much hairier. We
| _call_ the unified model where we abstract away all the
| different forms of energy "conservation of energy", but
| there are a good many practical systems where making
| tangible predictions using conservation of energy gives
| wrong answers.
|
| Basically this is a restatement of Box's Aphorism ("All
| models are wrong, but some are useful") or the ideas in
| Thomas Kuhn's "The Structure of Scientific Revolutions".
| The goal of science is to from concrete observations to
| abstract principles which ideally will accurately predict
| the value of future concrete observations. In many cases,
| you can do this. But not all. There is always messy data
| that doesn't fit into neat, simple, general laws. Usually
| the messy data is just ignored, because it can't be
| predicted and is assumed to average out or generally be
| irrelevant in the end. But sometimes the messy outliers
| bite you, or someone comes up with a new way to handle them
| elegantly, and then you get a paradigm shift.
|
| And this has implications for understanding what machine
| learning is or why it's important. Few people would think
| that a model linking background color to likeliness to
| click on ads is a fundamental physical quality, but Google
| had one 15+ years ago, and it was pretty accurate, and made
| them a bunch of money. Or similarly, most people wouldn't
| think of a model of the English language as being a
| fundamental physical quality, but that's exactly what an
| LLM is, and they're pretty useful too.
| jcgrillo wrote:
| It's been a long time since I have cracked a physics
| book, but your mention of interesting "fundamental
| physical quantities" triggered the recollection of there
| being a conservation of information result in quantum
| mechanics where you can come up with an action whose
| equations of motion are Schrodinger's equation and the
| conserved quantity is a probability current. So I wonder
| to what extent (if any) it might make sense to try to
| approach these things in terms of the _really
| fundamental_ quantity of information itself?
| jerf wrote:
| Approaching physics from a pure information flow is
| definitely a current research topic. I suspect we see
| less popsci treatment of it because almost nobody
| understands information at all, then trying to apply it
| to physics that also almost nobody understands is
| probably at least three or four bridges too far for a
| popsci treatment, but it's a current and active topic.
| chunky1994 wrote:
| I'm a bit skeptical to give up conservation of energy in a
| system with friction. Isn't it more accurate to say that if
| we were to calculate every specific interaction we'd still
| end up having conservation of energy. Now whether or not
| we're dealing with a closed system etc becomes important but
| if we were to able to truly model the entire physical system
| with friction, we'd still adhere to our conservation laws.
|
| So they are not approximations, but are just terribly
| difficult calculations, no?
|
| Maybe I'm misunderstanding your point, but this should be
| true regardless of our philosophy of physics correct?
| nyrikki wrote:
| It is an analogy stating that dissipative systems do not
| have a Lagrangian, Noether's work applies to Lagrangian
| systems
|
| Conservation laws in particular are measurable properties
| of an isolated physical system do not change as the system
| evolves over time.
|
| It is important to remember that Physics is about finding
| useful models that make useful predictions about a system.
| So it is important to not confuse the map for the
| territory.
|
| Gibbs free energy and Helmholtz free energy are not
| conserved.
|
| As thermodynamics, entropy, and entropy are difficult
| topics due to didactic half-truths, here is a paper that
| shows that the nbody problem becomes invariant and may be
| undecidable due to what is a similar issue (in a contrived
| fashion)
|
| http://philsci-archive.pitt.edu/13175/
|
| While Noether's principle often allows you to see things
| that can often be simplified in an equation, often it
| allows you to not just simplify 'terribly difficult
| calculations' but to actually find computationally possible
| calculations.
| empath-nirvana wrote:
| > In physics, the conserved quantity isn't always time.
| Invariance over time translation is specifically conservation
| of energy.
|
| That's not what i meant.
|
| When you talk about "conservation of angular momentum", the
| symmetry is invariance over rotation, but the angular
| momentum is conserved _over time_.
| shiandow wrote:
| A convolutional neural network ought to have translational
| symmetry, which should lead to a generalized version of
| momentum. If I understood the article correctly the conserved
| quantity would be <gx, dx>, where dx is the finite difference
| gradient of x.
|
| This gives a vector with dimensions equal to however many
| directions you can translate a layer in and which is conserved
| over all (convolutional) layers.
| cgadski wrote:
| Exactly right! In fact, because that symmetry does not
| include an action on the parameters of the layer, your
| conserved quantity <gx, dx> should hold whether or not the
| network is stationary for a loss. This means that it'll be
| stationary on every single data point. (In an image
| classification model, these values are just telling you
| whether or not the loss would be improved if the input image
| were translated.)
| empath-nirvana wrote:
| Everything in the paper is talking about global symmetries,
| is there also the possibility of gauge symmetries?
| nurple wrote:
| I think the most profound insight I've come across while
| studying this particular topic is the insight that information
| theory ended up being the answer to conserving the 2nd law with
| respect to Maxwell's demon thought experiment. Not to put too
| fine a point, but essentially the knowledge organized in the
| mind of the demon, about the particles in its system, was
| calculated to offset the creation of the energy gradient.
|
| I found the thinking of William Sidis to be particularly
| thought provoking perspective on Noether's benchmark work, in
| his paper The Animate and the Inanimate he posits--at a high
| level--that life is a "reversal of the second law of
| thermodynamics"; not that the 2nd law is a physical symmetry,
| but a mental one in an existence where energy reversibly flows
| between positive and negative states.
|
| Indeed, when considering machine learning, I think it's quite
| interesting to consider how the organizing of
| information/knowledge done during training in some real way
| mirrors the energy-creating information interred in the mind of
| Maxwell's demon.
|
| When taking into account the possible transitive benefits of
| knowledge organized via machine learning, and its attendant
| oracle through application, it's easy to see a world where this
| results in a net entropy loss, the creation of a previously
| non-existent energy gradient.
|
| In my mind this has interesting implications for Fermi's
| paradox as it seems to imply the inevitibility of the
| organization of information. Taken further into my own personal
| dogma, I think it's inevitable that we create--what we would
| consider--a sentient being as I believe this is the cycle of
| our own origin in the larger evolutionary timeline.
| Jerrrry wrote:
| >at a high level--that life is a "reversal of the second law
| of thermodynamics";
|
| Life temporarily displaces entropy, locally.
|
| Life wins battles, chaos wins the war.
|
| >Indeed, when considering machine learning, I think it's
| quite interesting to consider how the organizing of
| information/knowledge done during training in some real way
| mirrors the energy-creating information interred in the mind
| of Maxwell's demon.
|
| This is our human bias favoring the common myth of ever-
| expanding complexity is an "inevitable" result of the passage
| of time; refer to Stephen Jay Gould's "Full House: The Spread
| of Excellence from Plato to Darwin"[0] for the only palatable
| refute modern evolutionists can offer.
|
| >When taking into account the possible transitive benefits of
| knowledge organized via machine learning, and its attendant
| oracle through application, it's easy to see a world where
| this results in a net entropy loss, the creation of a
| previously non-existent energy gradient.
|
| Because it is. Randomness combined with a sieve, like a
| generator and a discriminator, like the primordial protein
| soup and our own existence as a selector, like chaos and
| order themselves, MAY - but DOES NOT have to - lead to
| temporary, localized areas of complexity, that we call
| 'life'.
|
| This "energy gradient" you speak of is literally gravity
| pulling baryonic matter foward thru space time. All work
| requires a temperature gradient - Hawking's musings on the
| second law of thermodynamics and your own intuition can
| reason why.
|
| >In my mind this has interesting implications for Fermi's
| paradox as it seems to imply the inevitibility of the
| organization of information. Taken further into my own
| personal dogma, I think it's inevitable that we create--what
| we would consider--a sentient being as I believe this is the
| cycle of our own origin in the larger evolutionary timeline.
|
| Over cosmological time spans, it is a near-mathematical
| certainty, that we are to either reach the universe's Omega
| point[1] on "our" own accord, perish to our own, by our own
| creation, or by our own son's, hands.
|
| [0]: https://www.amazon.com/Full-House-Spread-Excellence-
| Darwin/d...
|
| [1]: https://www.youtube.com/watch?v=eOxHRFN4rs0
| jungturk wrote:
| > now wondering...if there's some conserved quantity in the
| neural network that is _directly analagous_ to conserved
| quantities in physics
|
| Isn't the model attempting to conserve information during
| training? And isn't information a physical quantity?
| rnhmjoj wrote:
| Time is not special regarding symmetries and conserved
| quantities. In general you can consider any family of
| continuous transformations parametrised by some real variable
| s: be it translations by a distance x, rotations by an angle
| ph, etc. These are technically one-parameter subgroups of a Lie
| group.
|
| Then, if your dynamical system is symmetrical under these
| transformations you can construct a quantity whose derivative
| wrt s is zero.
| platz wrote:
| how do you direct what the network learns if it all comes from
| supervised learning training sets?
|
| How do you insert rules that aren't learned into what weights are
| learned?
| nrub wrote:
| There are promising methods developing for Physic's informed
| neural networks. Mathematical models can be integrated into the
| architecture of neural networks such that the parameters of the
| designed mathematical models can be learned. Examples include
| learning the frequency of a swinging pendulum from video,
| amongst more advanced ideas.
|
| https://en.wikipedia.org/wiki/Physics-informed_neural_networ...
| https://www.youtube.com/watch?v=JoFW2uSd3Uo
| brodolage wrote:
| How does he create those animations? I'd like to make them as
| well for myself.
| smokel wrote:
| They seem to be built with some love by the author. Apparently
| they have written it in Haxe, judging from the comment in the
| page source.
| brodolage wrote:
| Oh that's way out of my league unfortunately. I wonder if
| there's a library or something that does something like this.
| riemannzeta wrote:
| Not quite what you're looking for, but worth pointing out
| that Grant Sanderson of 3Blue1Brown has published the
| "framework" he uses for his math videos on GitHub.
|
| https://github.com/3b1b/manim
| cgadski wrote:
| Haha yeah, took some love. I have a scrappy little
| "framework" that I've been adjusting since I started making
| interactive posts last year. Writing my interactive widgets
| feels a bit like doing a game jam now: just copy a template
| and start compiling+reloading the page, seeing what I can get
| onto the screen. I've just been using the canvas2d API.
|
| Besides figuring out a good way of dealing with reference
| frames, the only trick I'd pass on is to use CSS variables to
| change colors and sizes (line widths, arrow dimensions, etc.)
| interactively. It definitely helps to tighten the feedback
| loop on those decisions.
| aeonik wrote:
| I'd like to know too.
|
| I've been using Emmy from the Clojurescript ecosystem, which
| works pretty good, but has a few quirks.
|
| https://emmy-viewers.mentat.org/
| r34 wrote:
| As a complete amateur I was wondering if it could be possible to
| use that property of light ("to always choose the most optimal
| route") to solve the traveling salesman problem (and the whole
| class of those problems as a consequence). Maybe not with an
| algorithmic approach, but rather some smart implementation of the
| machine itself.
| nkozyra wrote:
| This sounds a bit like LIDAR implementations, I assume you mean
| something similar at a smaller scale, where physical obstacles
| provide a "path" representation of a problem space?
| r34 wrote:
| Yup, something like that came to my mind first: create a
| physical representation (like a map) of the graph you want to
| solve and use physics to determine the shortest path. Once
| you have it you could easily compute the winning path's
| length etc.
| shiandow wrote:
| If somehow you can ensure that light can only reach a point by
| travelling through all other points then yes.
|
| It's basically the same way you could use light to solve a
| maze, just flood the exit with light and walk in the direction
| which is brightest. Works better for mirror mazes.
| pvg wrote:
| Google up 'soap film steiner tree' for a fun, well-known
| variant of this.
| jerf wrote:
| Then follow it up with
| https://www.scottaaronson.com/papers/npcomplete.pdf . While
| reality can "solve" these problems to some extent it turns
| out that people overestimate reality's ability to solve it
| _optimally_.
| richk449 wrote:
| Sounds like some of the trendy analogy computing approaches,
| like this one for example:
|
| https://www.microsoft.com/en-us/research/uploads/prod/2023/0...
| samatman wrote:
| This is pretty likely, it's been done with DNA:
| https://pubmed.ncbi.nlm.nih.gov/15555757/
|
| Physics contains a lot of 'machinery' for solving for low
| energy states.
| Scene_Cast2 wrote:
| People have mentioned the discrete - continuous tradeoff. One way
| to bridge that gap would be to use
| https://arxiv.org/abs/1806.07366 - they draw an equivalence
| between vanilla (FC layer) neural nets of constant width with
| differential equations, and then use a differential equation
| solver to "train" a "neural net" (from what I remember - it's
| been years since that paper...).
|
| Another approach might be to take an information theoretic view
| with the infinite-width finite-entropy nets.
| scarmig wrote:
| Another angle to look at would be the S4 models, which admit
| both a continuous time and recurrent discrete representation.
| pmayrgundter wrote:
| I wonder if an energy and work metric could be derived for
| gradient descent. This might be useful for a more rigorous
| approach to hyperparameter development, and maybe for
| characterizing the data being learned. We say that some datasets
| are harder to learn, or measure difficulty by the overall compute
| needed to hit a quality benchmark. Something more essential would
| be a step forward.
|
| Like in ANN backprop, the gradient descent algorithm can use a
| momentum to overcome getting stuck in local minima. This was
| heuristically physical when I learned it.. perhaps it's been
| developed since. Maybe only allowing a "real" energy to the
| momentum would then align it with an ability to do work
| calculation. Might also help with ensemble/monte carlo methods,
| to maintain an energy account across the ensemble.
| irchans wrote:
| I liked the article and I hope that I can understand it more with
| some study.
|
| I think the following sentence in the article is wrong "Applying
| Noether's theorem gives us three conserved quantities--one for
| each degree of freedom in our group of transformations--which
| turn out to be horizontal, vertical, and angular momentum."
|
| I think the correct statement is "Applying Noether's theorem
| gives us three conserved quantities--one for each degree of
| freedom in our group of transformations--which turn out to be
| translation, rotation, and time shifting."
|
| I think translation leads to conservation of momentum, rotation
| leads to conservation of angular momentum, and time shifting
| leads to conservation of energy (potential+kinetic). It's been a
| few decades since I saw the proof, so I might be wrong.
| chunky1994 wrote:
| Right, the rephrasing of the sentence is a tad more accurate.
| Your three entities are [invariant -> conserved quantity]:
| (translation -> momentum), (rotation -> angular momentum) and
| (time -> energy).
| nostrademons wrote:
| I think your last paragraph is correct, but the statement in
| the article is referring to the specific 2D 2-body example
| given, and its original phrasing is also correct. Translation,
| rotation, and time-shifting are _transformations_ (matrices),
| not _quantities_. Horizontal, vertical, and angular (2D)
| momentum are scalars. The article is saying that if you take
| the action potential given in the example, there exist scalar
| quantities (which we call horizontal momentum, vertical
| momentum, and angular momentum) that remain constant regardless
| of any horizontal, vertical, or rotational transformation of
| the coordinate system used to measure the 2-body problem.
| kurthr wrote:
| The application of Noether's theorem in this case refers only
| to the energy integral shown (KE = ME - GPE for 2D Kinetic
| Mechanical and Gravitational Potential Energies) over time.
| It's really only for that particular 2 body 2 dimensional
| problem.
|
| More generically in 3 dimensions a transformation with 3
| translational 2 rotational and 1 time independence would
| provide conservation of 3 momenta 2 angular momenta and 1
| energy.
| cgadski wrote:
| Hi, thanks!
|
| In that sentence I was only talking about the translations and
| rotations of the plane as a group of invariances for the action
| of the two-body problem. This group is generated by one-
| parameter subgroups producing vertical translation, horizontal
| translation, and rotation about a particular point. Those are
| the "three degrees of freedom" I was counting.
|
| You're right about the correspondence from symmetries to
| conservation laws in general.
| esafak wrote:
| I'll be walking tall the day I can leisurely read articles like
| this! I wish I had studied this stuff; now time is short.
| iskander wrote:
| I love the simple but elegant formatting of this blog.
|
| cgadski: what did you use to make it?
| iskander wrote:
| Only clue in the source:
|
| <!-- this blog is proudly generated by, like, GNU make -->
| cgadski wrote:
| Thank you!
|
| In the beginning, I used kognise's water.css [1], so most of
| the smart decisions (background/text color, margins, line
| spacing I think) probably come from there. Since then it's been
| some amount of little adjustments. The font is by Jean Francois
| Porchez, called Le Monde Livre Classic [2].
|
| I draft in Obsidian [3] and build the site with a couple python
| scripts and KaTeX.
|
| [1] https://watercss.kognise.dev/
|
| [2] https://typofonderie.com/fr/fonts/le-monde-livre-classic
|
| [3] https://obsidian.md/
| scarmig wrote:
| A related paper I just found and am digesting:
| https://arxiv.org/abs/2012.04728
|
| Softmax gives rise to translation symmetry, batch normalization
| to scale symmetry, homogeneous activations to rescale symmetry.
| Each of those induce their own learning invariants through
| training.
| cgadski wrote:
| That's also a neat result! I'd just like to highlight that the
| conservation laws proved in that paper are functions of the
| parameters that hold over the course of gradient descent,
| whereas my post is talking about functions of the activations
| that are conserved from one layer to the next within an
| optimized network.
|
| By the way, maybe I'm being too much of a math snob, but I'd
| argue Kunin's result is only superficially similar to Noether's
| theorem. (In the paper they call it a "striking similarity"!)
| Geometrically, what they're saying is that, if a loss function
| is invariant under a non-zero vector field, then the trajectory
| of gradient descent will be tangent to the codimension-1
| distribution of vectors perpendicular to the vector field. If
| that distribution is integrable (in the sense of the Frobenius
| theorem), then any of its integrals is conserved under gradient
| descent. That's a very different geometric picture from
| Noether's theorem. For example, Noether's theorem gives a
| direct mapping from invariances to conserved quantities,
| whereas they need a special integrability condition to hold.
| But yes, it is a nice result, certainly worth keeping in mind
| when thinking about your gradient flows. :)
|
| By the way, you might be interested in [1], which also studies
| gradient descent from the point of view of mechanics and seems
| to really use Noether-like results.
|
| [1] Tanaka, Hidenori, and Daniel Kunin. "Noether's Learning
| Dynamics: Role of Symmetry Breaking in Neural Networks." In
| Advances in Neural Information Processing Systems, 34:25646-60.
| Curran Associates, Inc., 2021. https://papers.nips.cc/paper/202
| 1/hash/d76d8deea9c19cc9aaf22....
| jonathanyc wrote:
| Not GP, but thanks for your detailed comment and the paper
| reference.
| samatman wrote:
| I wouldn't call drawing a distinction between an isomorphism
| and an analogy to be maths snobbery. I would call it
| mathematics. :)
| emmynoether wrote:
| It has been shown that a finite difference implementation of wave
| propagation can be expressed as a deep neural network (e.g.,
| [1]). These networks can have thousands of layers and yet I don't
| think they suffer from the exploding/vanishing gradient problem,
| which I imagine is because in the physical system they model
| there are conservation laws such as conservation of energy.
|
| [1] https://arxiv.org/abs/1801.07232
| waveBidder wrote:
| so I think this is a great connection that deserves more thought.
| as well as an absolutely gorgeous write-up.
|
| The main problem I see with it is that most of the time you _don
| 't_ want the optimum for your objective function, as that
| frequently results in overfitting. this leads to things like
| early stopping being typical.
| cgadski wrote:
| Thanks so much!
|
| And yes, that's quite true. When parameter gradients don't
| quite vanish, then the equation
|
| <g_x, d x / d eps> = <g_y, d y / d eps>
|
| becomes
|
| <g_x, d x / d eps> = <g_y, d y / d eps> - <g_theta, d theta / d
| eps>
|
| where g_theta is the gradient with respect to theta.
|
| In defense of my hypothesis that interesting approximate
| conservation laws exist in practice, I'd argue that maybe
| parameter gradients at early stopping are small enough that the
| last term is pretty small compared to the first two.
|
| On the other hand, stepping back, the condition that our
| network parameters are approximately stationary for a loss
| function feels pretty... shallow. My impression of deep
| learning is that an optimized model _cannot_ be understood as
| just "some solution to an optimization problem," but is more
| like a sample from a Boltzmann distribution which happens to
| concentrate a lot of its probability mass around _certain_
| minimizers of an energy. So, if we can prove something that is
| true for neural networks simply because they're "near
| stationary points", we probably aren't saying anything very
| fundamental about deep learning.
| riemannzeta wrote:
| Your work here is so beautiful, but perhaps one lesson is
| that growth and learning result where symmetries are broken.
| :-D
| raptortech wrote:
| See also "Noether Networks: Meta-Learning Useful Conserved
| Quantities" https://arxiv.org/abs/2112.03321 from 2021.
|
| Abstract: Progress in machine learning (ML) stems from a
| combination of data availability, computational resources, and an
| appropriate encoding of inductive biases. Useful biases often
| exploit symmetries in the prediction problem, such as
| convolutional networks relying on translation equivariance.
| Automatically discovering these useful symmetries holds the
| potential to greatly improve the performance of ML systems, but
| still remains a challenge. In this work, we focus on sequential
| prediction problems and take inspiration from Noether's theorem
| to reduce the problem of finding inductive biases to meta-
| learning useful conserved quantities. We propose Noether
| Networks: a new type of architecture where a meta-learned
| conservation loss is optimized inside the prediction function. We
| show, theoretically and experimentally, that Noether Networks
| improve prediction quality, providing a general framework for
| discovering inductive biases in sequential problems.
| klysm wrote:
| Completely irrelevant but I love the way the color theme on this
| blog feels like a chalk board
| calhoun137 wrote:
| Very nice article! I recently had a long chat with chatgpt on
| this topic, although from a slightly different perspective.
|
| A neural network is a type of machine that solves non linear
| optimization problems, and the principle of least action is also
| a non linear optimization problem that nature solves by some kind
| of natural law.
|
| This is the one thing that chatgpt mentioned which surpised me
| the most and which I had not previously considered.
|
| > Eigenvalues of the Hamiltonian in quantum mechanics correspond
| to energy states. In neural networks, the eigenvalues (principal
| components) of certain matrices, like the weight matrices in
| certain layers, can provide information about the dominant
| features or patterns. The notion of states or dominant features
| might be loosely analogous between the two domains.
|
| I am skeptical that any conserved quantity besides energy would
| have a corresponding conserved quantity in ML, and the Reynolds
| operator will likely be relevant for understanding any
| correspondence like this.
|
| iirc the Reynolds operator plays an important role in Noethers
| theorem, and it involves an averaging operation similar to what
| is described in the linked article.
___________________________________________________________________
(page generated 2024-03-02 23:01 UTC)