[HN Gopher] Modular Manifolds
___________________________________________________________________
Modular Manifolds
Author : babelfish
Score : 115 points
Date : 2025-09-26 17:06 UTC (5 hours ago)
(HTM) web link (thinkingmachines.ai)
(TXT) w3m dump (thinkingmachines.ai)
| jasonjmcghee wrote:
| The learning rates they demonstrate are crazy - though the
| standard when talking about CIFAR-10 is 94% accuracy iirc.
| Showing ~60% accuracy is weird.
|
| Has DAWNBench been done with manifold Muon (with a more
| appropriate architecture)?
| snake_doc wrote:
| Um.. the model is tiny: https://github.com/thinking-machines-
| lab/manifolds/blob/main...
| jasonjmcghee wrote:
| Yeah, it's just the wrong architecture for the job, so I
| found it to be a strange example.
|
| Here's the top model on DAWNBench -
| https://github.com/apple/ml-
| cifar-10-faster/blob/main/fast_c...
|
| Trains for 15 epochs and it, like all the others is a 9 layer
| resnet.
| srean wrote:
| Usually there's more to a ML, data-science idea (that's not
| a full fledged fledged out journal paper) than beating a
| SOTA benchmark.
|
| In fact beating SOTA is often the least interesting part of
| an interesting paper and the SOTA-blind reviewers often use
| it as a gatekeeping device.
| jasonjmcghee wrote:
| Sure, of course. Wasn't suggesting "are you beating a
| sota benchmark"? I'm floating the idea of an ablation
| that matches a realistic scenario for the dataset / task.
| Personally curious how manifold muon performs compared to
| AdamW in a throughly explored context. This is the first
| time I've seen a 3-layer mlp on cifar-10.
|
| I probably should have made the 9-layer ResNet part more,
| front-and-center / central to my point.
| Jackson__ wrote:
| They say they train for ~3 epochs. Could it be that's just not
| long enough of a training run? I have no idea how many epochs
| are usually used in those models.
| pooooooooooooop wrote:
| its a 3-layer MLP as stated in the article
| snake_doc wrote:
| Hmmm... http://www.incompleteideas.net/IncIdeas/BitterLesson.html
| whimsicalism wrote:
| this is a bad example to claim the bitter lesson applies to,
| it's about the fundamentals of optimization techniques not
| about tying to hand-crafted things for the solution space.
| snake_doc wrote:
| Aren't they all optimization techniques at the end of the
| day? Now you're just debating semantics
| whimsicalism wrote:
| believe what you want, i guess
| ACCount37 wrote:
| Doesn't apply as long as the improvements obtained there scale
| with compute.
|
| Now, are there actual meaningful improvements to obtain, and do
| they stick around all the way to frontier runs? Unclear,
| really. So far, it looks like opening a can of hyperparameters.
| TimorousBestie wrote:
| Reminiscing about an old HN comment arguing that differential
| geometry was irrelevant to machine learning with a smile on my
| face.
|
| Happy to see this opinion expressed here, too. The more math
| skeptics there are out there, the longer I get to keep my job. :)
| deviation wrote:
| The world is full of useful shapes! No reason that math
| shouldn't :)
| srean wrote:
| "I have never had to do integrate the "arctan" function by hand
| in my entire career" arguments are not worth engaging with.
|
| If people are happy with a job or a role that does not need
| math that' fine.
|
| Familiarity with Maths let's you to rise to the occasion, to
| become more than a replaceable cog.
|
| The thing is, unless you are trained in math you wouldn't even
| recognise the opportunity, that a certain kind Of Math could
| have been used here. In fact, even if you are trained in Math
| you may not see it till much later -- it needs a special eye
| and something in that moment.
|
| Polyhedrons were looked at for centuries after centuries by
| top-notch mathematicians. All missed Euler's formula, except
| perhaps Descartes.
|
| Often what happens is some nontrivial branch of mathematics
| suddenly finds a novel and impactful application. Then crowds
| jump in to learn that Math. But it's mostly already a little
| too late for them, they have missed this bus.
|
| The best case is one already knows the Math beforehand and you
| don't know which part will be handy. It helps if you love the
| subject and can afford to invest time to learn it for the love
| of the subject. Once in a while you happen to find yourself in
| the right place and the right time and with the right tools you
| need.
| gowld wrote:
| > Often what happens is some nontrivial branch of mathematics
| suddenly finds a novel and impactful application. Then crowds
| jump in to learn that Math. But it's mostly already a little
| too late for them, they have missed this bus.
|
| However, in the meantime, the experts in that math have
| "missed the bus" on whatever the application area is, that
| the math expert knows not enough about because they were
| studying math instead.
| esafak wrote:
| > This post covers one appealing way to constrain the weight
| matrices of a neural network--by keeping the tensors constrained
| to submanifolds at each layer. This opens the door to re-thinking
| optimization, as we can co-design optimization algorithms with
| these manifold constraints. As an example, we propose a manifold
| version of the Muon optimizer whose weights are constrained to
| the Stiefel manifold: the manifold of matrices with unit
| condition number. We conclude the post by defining the idea of a
| modular manifold, which is a composable manifold that attempts to
| make it easier to scale up and train large networks.
|
| Very good presentation. Projected gradient methods were popular
| during the convex optimization craze two decades ago. The ideas
| advanced here have precedent and seem sensible to me. My concern
| is whether it helps much. The test accuracy in figure 6b shows a
| marginal increase, and a gentler transition to the overfitting
| regime, suggesting the regularization is working. The higher LR
| did not translate to a speed up: "Manifold Muon increased the
| wall clock time per step compared to AdamW..."
|
| More fundamentally, I am a bit skeptical that low test accuracy
| is the right goal in LLMs because statistical learning theory
| does not adequately model the macro-behavior of very large
| models.
| namibj wrote:
| > The test accuracy in figure 6b shows a marginal increase, and
| a gentler transition to the overfitting regime, suggesting the
| regularization is working.
|
| Sounds like it might help for online RL training regimes as
| those are naturally quite vulnerable to overfitting .
| jpt4 wrote:
| \> statistical learning theory does not adequately model the
| macro-behavior of very large models.
|
| Might you please elaborate on this? I recognize that
| "artificial neural networks are lossy de/compression
| algorithms" does not enumerate the nuances of these structures,
| but am curious whether anything in particular is both
| interesting and missing from SLT.
| esafak wrote:
| SLT typically uses empirical risk minimization, leading to
| the bias-variance decomposition and a unimodal extremum as
| the monotonically decreasing bias supposedly balances against
| the monotonically increasing variance. We now know this does
| not accurately model overparameterized models, which exhibit
| double descent, and other phenomena like grokking. To explain
| them you have to look past classical statistics to
| statistical mechanics.
| p1esk wrote:
| _The test accuracy in figure 6b shows a marginal increase, and
| a gentler transition to the overfitting regime, suggesting the
| regularization is working._
|
| Higher LR does not mean there's overfitting.
| uoaei wrote:
| This is exactly the kind of out-of-the-box thinking that will get
| us past some of the limitations of the current crop of AI
| architectures. Bravo to the authors.
| SubiculumCode wrote:
| Curious why the authors chose the blog format over a research
| report?
| almostgotcaught wrote:
| you mean a paper? because it's not paper quality content?
| pooooooooooooop wrote:
| thinkingmachines likes to flex
| fmap wrote:
| Isn't this an old idea? E.g., here's a textbook on optimization
| algorithms for matrix manifolds https://press.princeton.edu/absil
| and here's a library that implements this in python for the
| Stiefel manifold that's the subject of this blog post:
| https://pymanopt.org/docs/stable/manifolds.html#module-pyman...
|
| What is novel about the approach in the blog post? Serious
| question, I really can't tell after reading the post.
| cs702 wrote:
| I don't think it's been tried at scale, with large models.
|
| It remains to be seen if it works better than conventional
| training schemes.
| godelski wrote:
| > Isn't this an old idea?
|
| So are neural networks. So is attention.
|
| What's your point? Sometimes things need to be retried.
| Sometimes there are small subtle details that make or break an
| idea. So what's the point of acting dismissively? If an old
| idea that didn't work now works, then what's the problem?
| Besides, progress is typically iterative, not through leaps and
| bounds. The vast majority of things that look like leaps are
| just because we don't see the steps between.
|
| The reason I'm saying this is because that sentiment is often
| used to pass over working solutions and slows down their
| progress. So even if unintentional it should cause us to
| rethink how we respond. Otherwise we end up with such silly
| claims like "Einstein just used Tensors" and "Nash just used
| topology". In some sense these are accurate, but they are too
| high level descriptions (and these are real dismissals. Which
| again, so what? If it works, it works?).
|
| Why is "novelty" even a question? Novelty is only ever in the
| eyes of the beholder. > What is novel about the
| approach in the blog post? Serious question, I really can't
| tell after reading the post.
|
| Honestly, I do not know, but I'll give you my best read on it.
|
| 1) Scale: Don't underestimate the importance of this. While I
| don't think scale is all you need, it certainly is a critical
| factor.
|
| 2) Different optimization: I may be missing something, but it
| looks like they are using a different optimizer. They mention
| that they're using the muon optimizer constraining to a Stiefel
| manifold. Neither of those things are unique on their own, but
| is their combination? This is where I'm uncertain because such
| a thing would be easy to miss. Maybe someone did and was
| unsuccessful with it. Maybe someone did, but was not at scale.
| Maybe someone did, it worked, and just nobody noticed (that
| happens a lot!).
|
| So I think this is quite similar to how 99% of progress and
| breakthroughs are made: putting together ideas that seem
| unrelated and inventing some glue to generalize the process. At
| a high level this always looks like you're just putting
| existing things together, but that glue is really hard to make.
| And to continue that analogy, if we do a good enough job gluing
| things together then to anyone but an expert it'll look like
| there is no glue. It can be surprisingly difficult to tell if
| something is glued, welded, mated, milled, printed, or
| whatever. It usually takes a very keen eye to determine the
| answer non-destructively.
| fmap wrote:
| Apologies if this came across the wrong way. I really do want
| to know what the novel contributions of the post are, because
| the author implies that something about what they're doing is
| solving previously open problems:
|
| > I figured out how to solve manifold Muon in the square case
| late last year, but I was unable to solve the full
| rectangular case and thus posed the problem as an open
| problem on the Modula docs. Jianlin Su solved the problem
| this summer
|
| It sounds like the generalisation of projected gradient
| decent to "Muon" is what they're focusing on, but the
| derivation is all about the retraction map on the Stiefel
| manifold? I think I'm missing some background here.
| godelski wrote:
| > Apologies if this came across the wrong way
|
| I was uncertain but your other statements made me think
| that sentiment was unintentional. I just want to push back
| against it because it is too common and misused even with
| good intentions. I hope you don't see this as me saying
| anything about your character. Honestly, impressions are
| that you do care. > It sounds like the
| generalisation of projected gradient decent to "Muon"
|
| I'm not a niche expert here, but do have knowledge in
| adjacent/overlapping domains. It sounds like you're in a
| similar boat? I ask because this pulls back to what I was
| trying to say about sometimes needing an expert eye.
|
| If it helps, here's the "paper" for the Muon optimizer[0]
| and here's a follow-up[1]. Muon is definitely a gradient
| decent technique, but so are Adam, SGD, Ada, and many
| more[2].
|
| The big thing of Muon is using NewtonSchulz5. So you update
| parameters with th_{t-1} - e[NS_5(mB_{t-1} +
| [?]L(th_{t-1}))] (I bracketed so you can see that this is
| just a specific version of th_{t-1} -
| eF([?]L(th_{t-1}),...) which the standard gradient descent
| -- th - e[?]L(th) -- is in that class of functions,
| right?). So we should be careful to over generalize and say
| that this is just gradient descent. You could even say [1]
| is "just [0] but with weight-decay" (or go look at the Adam
| and AdamW algos ;)
|
| But one thing I should add is that gradient descent
| algorithms aren't topologically aware. I was able to find
| this post which asks a related question, trying to find
| what the conditions are for a surface's geodesic to align
| with gradient descent (note Newton differs from GD too). I
| don't think this paper is creating a solution where the GD
| formulation results in following a geodesic to the minimum,
| but my take is that it is working towards that direction.
| And to clarify, we'd want to follow the geodesic because
| that gives us the shortest or most energy efficient path
| (which ever perspective you want to use). In optimization
| we want to try to accomplish these two things (and more!):
| 1) take the "best" path to the optima, 2) find the best
| optima. Unfortunately these are ill-defined and there's not
| always objective answers to them. But in an ideal gradient
| descent algorithm we'd want it to go to the global minimum
| and take the fastest path, right? So with that it helps to
| be aware of the geometry (part of why people look at the
| Hessian but that comes at the cost of increased computation
| even if the additional information can get us there in
| fewer steps. So that's not (always) "the best").
|
| I know this isn't a full answer and maybe with more reading
| I'll have a better one for you. But I'm hoping my answer
| can at least help you see some of the underlying nuanced
| problems that (_I think_) the authors are trying to get at.
| Hopefully I'm not too far off base lol. I'm hoping someone
| with more expertise can jump in and provide
| corrections/clarifications in the mean time.
|
| [0] https://kellerjordan.github.io/posts/muon/
|
| [1] https://arxiv.org/abs/2502.16982
|
| [2] (far from a complete list)
| https://docs.pytorch.org/docs/stable/optim.html#algorithms
|
| [3] (I think similar types of questions may also be
| fruitful)
| https://mathoverflow.net/questions/42617/functions-whose-
| gra...
| aanet wrote:
| Not here to comment on the _content_ of the blog post...
|
| Just wanted to say the blog post design looks super nice.
| Beautifully laid out, very readable typography, clear graphics,
| approachable design with a welcoming UX, footnotes in the side,
| etc.
|
| Anybody know how this is designed / styled? (I can see three.js
| being used, along with katex.js - but don't know more details)
|
| Thanks
| ddellacosta wrote:
| UX on the other hand...I hate it when sites hijack my key
| commands for moving backwards and forwards in my browser
| history. Please don't do this!
| manas96 wrote:
| I think the diagrams look very similar to what Keenan Crane
| uses in his papers, perhaps they used that tool. I think his
| students have now fleshed it out for general use.
| spyder wrote:
| For me it's horrible, some scripts makes the scroll very
| choppy, unusable... had to disable scripts just to be able to
| normally scroll :-(
| cs702 wrote:
| TL;DR: The OP notes that we currently use all sorts of tricks of
| the trade, including applying normalization layers, to keep unit
| values in DNNs from getting too large or too small when we train
| them. Keeping unit values from getting too large or small
| prevents numerical underflow/overflow, and also helps speed up
| learning by keeping the magnitudes of updates small in relation
| to weights. The OP proposes that we should constrain weights to
| be in sub-manifolds with unit condition number[a] at each layer,
| and that we should modify/design SGD algorithms to work well
| within those manifolds.
|
| I find the idea compelling, but it's too early to know if it will
| work well at scale, you know, with large models, in the real
| world.
|
| --
|
| [a] https://en.wikipedia.org/wiki/Condition_number
|
| --
|
| EDIT: On the other hand, yesterday I saw a paper about doing
| basically _the opposite_ , letting unit values in DNNs get as big
| or small as they need to get... by mapping them to complex
| logarithms and _keeping them_ in that domain:
| https://openreview.net/forum?id=SUuzb0SOGu . I also found this
| opposing idea oddly compelling, but I don't know how well it
| works either, because it hasn't been tested at scale.
| robots0only wrote:
| so their way to differentiate against frontier labs is to try
| writing research blog posts (not papers). It will be interesting
| to see how this plays out. I don't think that anyone serious
| about developing frontier models would be putting anything useful
| out there for others. We already see this with all the incumbents
| -- Google, OAI, Anthropic, xAI, DeepSeek and other chinese labs.
| aghilmort wrote:
| Interesting. Modular manifolds are precisely what hypertokens use
| for prompt compiling.
|
| Specifically, we linearize the emergent KVQ operations of an
| arbitrary prompt in any arbitrary model by way of interleaving
| error-correcting code (ECC).
|
| ECC tokens are out-of-band tokens, e.g., Unicode's Private Use
| Area (PUA), interleaved with raw context tokens. This
| construction induces an in-context associate memory.
|
| Any sort of interleaved labeling basis, e.g., A1, quick brown
| fox, A2, jumped lazy dog, induces a similar effect to for
| chaining recall & reasoning more reliably.
|
| This trick works because PUA tokens are generally untrained hence
| their initial embedding is still random Gaussian w.h.p. Similar
| effects can be achieved by simply using token combos unlikely to
| exist and are often in practice more effective since PUA tokens
| like emojis or Mandarin characters are often 2,3, or 4 tokens
| after tokenization vs. codeword combos like zy-qu-qwerty every k
| content tokens, where can be variable.
|
| Building attention architecture using modular manifolds in white
| / gray-box models like this new work shows vs. prompt-based black
| box injection is a natural next step, and so can at least
| anecdotally validate what they're building ahead of next paper or
| two.
|
| Which is all to say, absolutely great to see others building in
| this way!
| glowcoil wrote:
| The original article discusses techniques for constraining the
| weights of a neural network to a submanifold of weight space
| during training. Your comment discusses interleaving the tokens
| of an LLM prompt with Unicode PUA code points. These are two
| almost completely unrelated things, so it is very confusing to
| me that you are confidently asserting that they are the same
| thing. Can you please elaborate on why you think there is any
| connection at all between your comment and the original
| article?
| aghilmort wrote:
| Our ECC construction induces an emergent modular manifold
| during KVQ computation.
|
| Suppose we use 3 codeword lanes every codeword which is our
| default. Each lane of tokens is based on some prime, p, so
| collectively forms CRT-driven codeword (Chinese Remainder
| Theorem). This is discretely equivalent to labeling every k
| tokens with 1x globally unique indexing grammar.
|
| That interleaving also corresponds to a triple of adjacent
| orthogonal embeddings since those tokens still retain a
| random gaussian embedding. The net effect is we similarly
| slice the latent space into spaced chain of modular manifolds
| within the latent space every k content tokens.
|
| We also refer to that interleaving as Steifel frames for
| similar reasons as the post reads etc. We began work this
| spring or so to inject that net construction inside the model
| with early results in similar direction as post described.
| That's another way of saying this sort of approach lets us
| make that chained atlas (wc?) of modular manifolds as tight
| as possible within dimensional limits of the embedding,
| floating point precision, etc.
|
| We somewhat tongue-in-cheek refer to this as the
| retokenization group at the prompt level re: renormalization
| group / tensor nets / etc. Relayering group is the same net
| intuition or perhaps reconnection group at architecture
| level.
| glowcoil wrote:
| I'm sorry, but even if I am maximally charitable and assume
| that everything you are saying is meaningful and makes
| sense, it still has essentially nothing to do with the
| original article. The original article is about imposing
| _constraints_ on the _weights_ of a neural network, during
| training, so that they lie on a particular manifold inside
| the overall _weight space_. The "modular" part is about
| being able to specify these constraints separately for
| individual layers or modules of a network and then compose
| them together into a meaningful constraint for the global
| network.
|
| You are talking about _latent space_ during _inference_ ,
| not weight space during training, and you are talking about
| interleaving tokens with random Gaussian tokens, not
| constraining values to lie on a manifold within a larger
| space. Whether or not the thing you are describing is
| meaningful or useful, it is basically unrelated to the
| original article, and you are not using the term "modular
| manifold" to refer to the same thing.
| aghilmort wrote:
| hmm / hear you. my point wasn't that we are applying
| modular manifolds in the same way it was that we are
| working on model reliability from two extremal ends using
| the same principle. there are various ways to induce
| modular manifolds in model at various levels of
| resolution / power. we started at outside / working in
| level and so it works with any black-box model out of the
| box and zero knowledge needed, dont even need to know
| token dictionary to show effect.
|
| We're already working on pushing construction deeper into
| model both architecture and training. currently that's
| for fine-tuning and ultimately full architecture
| shrinkage / pruning and raw training vs. just fine-tuning
| etc.
|
| & it was just great to see someone else using modular
| manifolds even if they are using them at the training
| stage vs. inference stage. they're exploiting modular
| form at training, we're doing it at inference. cool to
| see.
| snake_doc wrote:
| Wot? Is this what AI generated non-sense has come to? This is
| totally unrelated.
| aghilmort wrote:
| Nope. Construction induces ECC-driven emergent modular
| manifolds in latent space during KVQ maths. Can't use any ole
| ECC / crux why works. More in another reply.
| yodon wrote:
| Is the original Thinking Machines trademark[0] no longer active?
| They were the original AI company, back when AI was a completely
| different thing than it is today.
|
| [0]
| https://en.m.wikipedia.org/wiki/Thinking_Machines_Corporatio...
| gowld wrote:
| What does this mean?
| phoenicyan wrote:
| Well-done post, I'd like to read more of their work and it's
| exciting to see these new ideas. Though as other people have
| said, the one set of empirical results that they present is a
| bit... confusing? I'd think they'd have some more compelling
| examples to present given all the pretty math.
|
| Their modular norm paper (https://arxiv.org/abs/2405.14813) has
| several more examples; see their appendix D in particular, but
| these are also mystifying. Yes they're interested in how things
| scale but am I the only one to whom it seems that the training
| losses they report are just not competitive with things that are
| currently being used?
| nenenejej wrote:
| https://archive.is/bP3BG
|
| If you like to scroll on mobile :)
| nenenejej wrote:
| Nice! Posts like this make me remorseful of not following a
| mathematics career. I'm sure some of the notation is basic (as in
| undergrad) but I'd need an entire weekend to understand the post.
___________________________________________________________________
(page generated 2025-09-26 23:00 UTC)