[HN Gopher] Modular Manifolds
       ___________________________________________________________________
        
       Modular Manifolds
        
       Author : babelfish
       Score  : 115 points
       Date   : 2025-09-26 17:06 UTC (5 hours ago)
        
 (HTM) web link (thinkingmachines.ai)
 (TXT) w3m dump (thinkingmachines.ai)
        
       | jasonjmcghee wrote:
       | The learning rates they demonstrate are crazy - though the
       | standard when talking about CIFAR-10 is 94% accuracy iirc.
       | Showing ~60% accuracy is weird.
       | 
       | Has DAWNBench been done with manifold Muon (with a more
       | appropriate architecture)?
        
         | snake_doc wrote:
         | Um.. the model is tiny: https://github.com/thinking-machines-
         | lab/manifolds/blob/main...
        
           | jasonjmcghee wrote:
           | Yeah, it's just the wrong architecture for the job, so I
           | found it to be a strange example.
           | 
           | Here's the top model on DAWNBench -
           | https://github.com/apple/ml-
           | cifar-10-faster/blob/main/fast_c...
           | 
           | Trains for 15 epochs and it, like all the others is a 9 layer
           | resnet.
        
             | srean wrote:
             | Usually there's more to a ML, data-science idea (that's not
             | a full fledged fledged out journal paper) than beating a
             | SOTA benchmark.
             | 
             | In fact beating SOTA is often the least interesting part of
             | an interesting paper and the SOTA-blind reviewers often use
             | it as a gatekeeping device.
        
               | jasonjmcghee wrote:
               | Sure, of course. Wasn't suggesting "are you beating a
               | sota benchmark"? I'm floating the idea of an ablation
               | that matches a realistic scenario for the dataset / task.
               | Personally curious how manifold muon performs compared to
               | AdamW in a throughly explored context. This is the first
               | time I've seen a 3-layer mlp on cifar-10.
               | 
               | I probably should have made the 9-layer ResNet part more,
               | front-and-center / central to my point.
        
         | Jackson__ wrote:
         | They say they train for ~3 epochs. Could it be that's just not
         | long enough of a training run? I have no idea how many epochs
         | are usually used in those models.
        
         | pooooooooooooop wrote:
         | its a 3-layer MLP as stated in the article
        
       | snake_doc wrote:
       | Hmmm... http://www.incompleteideas.net/IncIdeas/BitterLesson.html
        
         | whimsicalism wrote:
         | this is a bad example to claim the bitter lesson applies to,
         | it's about the fundamentals of optimization techniques not
         | about tying to hand-crafted things for the solution space.
        
           | snake_doc wrote:
           | Aren't they all optimization techniques at the end of the
           | day? Now you're just debating semantics
        
             | whimsicalism wrote:
             | believe what you want, i guess
        
         | ACCount37 wrote:
         | Doesn't apply as long as the improvements obtained there scale
         | with compute.
         | 
         | Now, are there actual meaningful improvements to obtain, and do
         | they stick around all the way to frontier runs? Unclear,
         | really. So far, it looks like opening a can of hyperparameters.
        
       | TimorousBestie wrote:
       | Reminiscing about an old HN comment arguing that differential
       | geometry was irrelevant to machine learning with a smile on my
       | face.
       | 
       | Happy to see this opinion expressed here, too. The more math
       | skeptics there are out there, the longer I get to keep my job. :)
        
         | deviation wrote:
         | The world is full of useful shapes! No reason that math
         | shouldn't :)
        
         | srean wrote:
         | "I have never had to do integrate the "arctan" function by hand
         | in my entire career" arguments are not worth engaging with.
         | 
         | If people are happy with a job or a role that does not need
         | math that' fine.
         | 
         | Familiarity with Maths let's you to rise to the occasion, to
         | become more than a replaceable cog.
         | 
         | The thing is, unless you are trained in math you wouldn't even
         | recognise the opportunity, that a certain kind Of Math could
         | have been used here. In fact, even if you are trained in Math
         | you may not see it till much later -- it needs a special eye
         | and something in that moment.
         | 
         | Polyhedrons were looked at for centuries after centuries by
         | top-notch mathematicians. All missed Euler's formula, except
         | perhaps Descartes.
         | 
         | Often what happens is some nontrivial branch of mathematics
         | suddenly finds a novel and impactful application. Then crowds
         | jump in to learn that Math. But it's mostly already a little
         | too late for them, they have missed this bus.
         | 
         | The best case is one already knows the Math beforehand and you
         | don't know which part will be handy. It helps if you love the
         | subject and can afford to invest time to learn it for the love
         | of the subject. Once in a while you happen to find yourself in
         | the right place and the right time and with the right tools you
         | need.
        
           | gowld wrote:
           | > Often what happens is some nontrivial branch of mathematics
           | suddenly finds a novel and impactful application. Then crowds
           | jump in to learn that Math. But it's mostly already a little
           | too late for them, they have missed this bus.
           | 
           | However, in the meantime, the experts in that math have
           | "missed the bus" on whatever the application area is, that
           | the math expert knows not enough about because they were
           | studying math instead.
        
       | esafak wrote:
       | > This post covers one appealing way to constrain the weight
       | matrices of a neural network--by keeping the tensors constrained
       | to submanifolds at each layer. This opens the door to re-thinking
       | optimization, as we can co-design optimization algorithms with
       | these manifold constraints. As an example, we propose a manifold
       | version of the Muon optimizer whose weights are constrained to
       | the Stiefel manifold: the manifold of matrices with unit
       | condition number. We conclude the post by defining the idea of a
       | modular manifold, which is a composable manifold that attempts to
       | make it easier to scale up and train large networks.
       | 
       | Very good presentation. Projected gradient methods were popular
       | during the convex optimization craze two decades ago. The ideas
       | advanced here have precedent and seem sensible to me. My concern
       | is whether it helps much. The test accuracy in figure 6b shows a
       | marginal increase, and a gentler transition to the overfitting
       | regime, suggesting the regularization is working. The higher LR
       | did not translate to a speed up: "Manifold Muon increased the
       | wall clock time per step compared to AdamW..."
       | 
       | More fundamentally, I am a bit skeptical that low test accuracy
       | is the right goal in LLMs because statistical learning theory
       | does not adequately model the macro-behavior of very large
       | models.
        
         | namibj wrote:
         | > The test accuracy in figure 6b shows a marginal increase, and
         | a gentler transition to the overfitting regime, suggesting the
         | regularization is working.
         | 
         | Sounds like it might help for online RL training regimes as
         | those are naturally quite vulnerable to overfitting .
        
         | jpt4 wrote:
         | \> statistical learning theory does not adequately model the
         | macro-behavior of very large models.
         | 
         | Might you please elaborate on this? I recognize that
         | "artificial neural networks are lossy de/compression
         | algorithms" does not enumerate the nuances of these structures,
         | but am curious whether anything in particular is both
         | interesting and missing from SLT.
        
           | esafak wrote:
           | SLT typically uses empirical risk minimization, leading to
           | the bias-variance decomposition and a unimodal extremum as
           | the monotonically decreasing bias supposedly balances against
           | the monotonically increasing variance. We now know this does
           | not accurately model overparameterized models, which exhibit
           | double descent, and other phenomena like grokking. To explain
           | them you have to look past classical statistics to
           | statistical mechanics.
        
         | p1esk wrote:
         | _The test accuracy in figure 6b shows a marginal increase, and
         | a gentler transition to the overfitting regime, suggesting the
         | regularization is working._
         | 
         | Higher LR does not mean there's overfitting.
        
       | uoaei wrote:
       | This is exactly the kind of out-of-the-box thinking that will get
       | us past some of the limitations of the current crop of AI
       | architectures. Bravo to the authors.
        
       | SubiculumCode wrote:
       | Curious why the authors chose the blog format over a research
       | report?
        
         | almostgotcaught wrote:
         | you mean a paper? because it's not paper quality content?
        
         | pooooooooooooop wrote:
         | thinkingmachines likes to flex
        
       | fmap wrote:
       | Isn't this an old idea? E.g., here's a textbook on optimization
       | algorithms for matrix manifolds https://press.princeton.edu/absil
       | and here's a library that implements this in python for the
       | Stiefel manifold that's the subject of this blog post:
       | https://pymanopt.org/docs/stable/manifolds.html#module-pyman...
       | 
       | What is novel about the approach in the blog post? Serious
       | question, I really can't tell after reading the post.
        
         | cs702 wrote:
         | I don't think it's been tried at scale, with large models.
         | 
         | It remains to be seen if it works better than conventional
         | training schemes.
        
         | godelski wrote:
         | > Isn't this an old idea?
         | 
         | So are neural networks. So is attention.
         | 
         | What's your point? Sometimes things need to be retried.
         | Sometimes there are small subtle details that make or break an
         | idea. So what's the point of acting dismissively? If an old
         | idea that didn't work now works, then what's the problem?
         | Besides, progress is typically iterative, not through leaps and
         | bounds. The vast majority of things that look like leaps are
         | just because we don't see the steps between.
         | 
         | The reason I'm saying this is because that sentiment is often
         | used to pass over working solutions and slows down their
         | progress. So even if unintentional it should cause us to
         | rethink how we respond. Otherwise we end up with such silly
         | claims like "Einstein just used Tensors" and "Nash just used
         | topology". In some sense these are accurate, but they are too
         | high level descriptions (and these are real dismissals. Which
         | again, so what? If it works, it works?).
         | 
         | Why is "novelty" even a question? Novelty is only ever in the
         | eyes of the beholder.                 > What is novel about the
         | approach in the blog post? Serious question, I really can't
         | tell after reading the post.
         | 
         | Honestly, I do not know, but I'll give you my best read on it.
         | 
         | 1) Scale: Don't underestimate the importance of this. While I
         | don't think scale is all you need, it certainly is a critical
         | factor.
         | 
         | 2) Different optimization: I may be missing something, but it
         | looks like they are using a different optimizer. They mention
         | that they're using the muon optimizer constraining to a Stiefel
         | manifold. Neither of those things are unique on their own, but
         | is their combination? This is where I'm uncertain because such
         | a thing would be easy to miss. Maybe someone did and was
         | unsuccessful with it. Maybe someone did, but was not at scale.
         | Maybe someone did, it worked, and just nobody noticed (that
         | happens a lot!).
         | 
         | So I think this is quite similar to how 99% of progress and
         | breakthroughs are made: putting together ideas that seem
         | unrelated and inventing some glue to generalize the process. At
         | a high level this always looks like you're just putting
         | existing things together, but that glue is really hard to make.
         | And to continue that analogy, if we do a good enough job gluing
         | things together then to anyone but an expert it'll look like
         | there is no glue. It can be surprisingly difficult to tell if
         | something is glued, welded, mated, milled, printed, or
         | whatever. It usually takes a very keen eye to determine the
         | answer non-destructively.
        
           | fmap wrote:
           | Apologies if this came across the wrong way. I really do want
           | to know what the novel contributions of the post are, because
           | the author implies that something about what they're doing is
           | solving previously open problems:
           | 
           | > I figured out how to solve manifold Muon in the square case
           | late last year, but I was unable to solve the full
           | rectangular case and thus posed the problem as an open
           | problem on the Modula docs. Jianlin Su solved the problem
           | this summer
           | 
           | It sounds like the generalisation of projected gradient
           | decent to "Muon" is what they're focusing on, but the
           | derivation is all about the retraction map on the Stiefel
           | manifold? I think I'm missing some background here.
        
             | godelski wrote:
             | > Apologies if this came across the wrong way
             | 
             | I was uncertain but your other statements made me think
             | that sentiment was unintentional. I just want to push back
             | against it because it is too common and misused even with
             | good intentions. I hope you don't see this as me saying
             | anything about your character. Honestly, impressions are
             | that you do care.                 > It sounds like the
             | generalisation of projected gradient decent to "Muon"
             | 
             | I'm not a niche expert here, but do have knowledge in
             | adjacent/overlapping domains. It sounds like you're in a
             | similar boat? I ask because this pulls back to what I was
             | trying to say about sometimes needing an expert eye.
             | 
             | If it helps, here's the "paper" for the Muon optimizer[0]
             | and here's a follow-up[1]. Muon is definitely a gradient
             | decent technique, but so are Adam, SGD, Ada, and many
             | more[2].
             | 
             | The big thing of Muon is using NewtonSchulz5. So you update
             | parameters with th_{t-1} - e[NS_5(mB_{t-1} +
             | [?]L(th_{t-1}))] (I bracketed so you can see that this is
             | just a specific version of th_{t-1} -
             | eF([?]L(th_{t-1}),...) which the standard gradient descent
             | -- th - e[?]L(th) -- is in that class of functions,
             | right?). So we should be careful to over generalize and say
             | that this is just gradient descent. You could even say [1]
             | is "just [0] but with weight-decay" (or go look at the Adam
             | and AdamW algos ;)
             | 
             | But one thing I should add is that gradient descent
             | algorithms aren't topologically aware. I was able to find
             | this post which asks a related question, trying to find
             | what the conditions are for a surface's geodesic to align
             | with gradient descent (note Newton differs from GD too). I
             | don't think this paper is creating a solution where the GD
             | formulation results in following a geodesic to the minimum,
             | but my take is that it is working towards that direction.
             | And to clarify, we'd want to follow the geodesic because
             | that gives us the shortest or most energy efficient path
             | (which ever perspective you want to use). In optimization
             | we want to try to accomplish these two things (and more!):
             | 1) take the "best" path to the optima, 2) find the best
             | optima. Unfortunately these are ill-defined and there's not
             | always objective answers to them. But in an ideal gradient
             | descent algorithm we'd want it to go to the global minimum
             | and take the fastest path, right? So with that it helps to
             | be aware of the geometry (part of why people look at the
             | Hessian but that comes at the cost of increased computation
             | even if the additional information can get us there in
             | fewer steps. So that's not (always) "the best").
             | 
             | I know this isn't a full answer and maybe with more reading
             | I'll have a better one for you. But I'm hoping my answer
             | can at least help you see some of the underlying nuanced
             | problems that (_I think_) the authors are trying to get at.
             | Hopefully I'm not too far off base lol. I'm hoping someone
             | with more expertise can jump in and provide
             | corrections/clarifications in the mean time.
             | 
             | [0] https://kellerjordan.github.io/posts/muon/
             | 
             | [1] https://arxiv.org/abs/2502.16982
             | 
             | [2] (far from a complete list)
             | https://docs.pytorch.org/docs/stable/optim.html#algorithms
             | 
             | [3] (I think similar types of questions may also be
             | fruitful)
             | https://mathoverflow.net/questions/42617/functions-whose-
             | gra...
        
       | aanet wrote:
       | Not here to comment on the _content_ of the blog post...
       | 
       | Just wanted to say the blog post design looks super nice.
       | Beautifully laid out, very readable typography, clear graphics,
       | approachable design with a welcoming UX, footnotes in the side,
       | etc.
       | 
       | Anybody know how this is designed / styled? (I can see three.js
       | being used, along with katex.js - but don't know more details)
       | 
       | Thanks
        
         | ddellacosta wrote:
         | UX on the other hand...I hate it when sites hijack my key
         | commands for moving backwards and forwards in my browser
         | history. Please don't do this!
        
         | manas96 wrote:
         | I think the diagrams look very similar to what Keenan Crane
         | uses in his papers, perhaps they used that tool. I think his
         | students have now fleshed it out for general use.
        
         | spyder wrote:
         | For me it's horrible, some scripts makes the scroll very
         | choppy, unusable... had to disable scripts just to be able to
         | normally scroll :-(
        
       | cs702 wrote:
       | TL;DR: The OP notes that we currently use all sorts of tricks of
       | the trade, including applying normalization layers, to keep unit
       | values in DNNs from getting too large or too small when we train
       | them. Keeping unit values from getting too large or small
       | prevents numerical underflow/overflow, and also helps speed up
       | learning by keeping the magnitudes of updates small in relation
       | to weights. The OP proposes that we should constrain weights to
       | be in sub-manifolds with unit condition number[a] at each layer,
       | and that we should modify/design SGD algorithms to work well
       | within those manifolds.
       | 
       | I find the idea compelling, but it's too early to know if it will
       | work well at scale, you know, with large models, in the real
       | world.
       | 
       | --
       | 
       | [a] https://en.wikipedia.org/wiki/Condition_number
       | 
       | --
       | 
       | EDIT: On the other hand, yesterday I saw a paper about doing
       | basically _the opposite_ , letting unit values in DNNs get as big
       | or small as they need to get... by mapping them to complex
       | logarithms and _keeping them_ in that domain:
       | https://openreview.net/forum?id=SUuzb0SOGu . I also found this
       | opposing idea oddly compelling, but I don't know how well it
       | works either, because it hasn't been tested at scale.
        
       | robots0only wrote:
       | so their way to differentiate against frontier labs is to try
       | writing research blog posts (not papers). It will be interesting
       | to see how this plays out. I don't think that anyone serious
       | about developing frontier models would be putting anything useful
       | out there for others. We already see this with all the incumbents
       | -- Google, OAI, Anthropic, xAI, DeepSeek and other chinese labs.
        
       | aghilmort wrote:
       | Interesting. Modular manifolds are precisely what hypertokens use
       | for prompt compiling.
       | 
       | Specifically, we linearize the emergent KVQ operations of an
       | arbitrary prompt in any arbitrary model by way of interleaving
       | error-correcting code (ECC).
       | 
       | ECC tokens are out-of-band tokens, e.g., Unicode's Private Use
       | Area (PUA), interleaved with raw context tokens. This
       | construction induces an in-context associate memory.
       | 
       | Any sort of interleaved labeling basis, e.g., A1, quick brown
       | fox, A2, jumped lazy dog, induces a similar effect to for
       | chaining recall & reasoning more reliably.
       | 
       | This trick works because PUA tokens are generally untrained hence
       | their initial embedding is still random Gaussian w.h.p. Similar
       | effects can be achieved by simply using token combos unlikely to
       | exist and are often in practice more effective since PUA tokens
       | like emojis or Mandarin characters are often 2,3, or 4 tokens
       | after tokenization vs. codeword combos like zy-qu-qwerty every k
       | content tokens, where can be variable.
       | 
       | Building attention architecture using modular manifolds in white
       | / gray-box models like this new work shows vs. prompt-based black
       | box injection is a natural next step, and so can at least
       | anecdotally validate what they're building ahead of next paper or
       | two.
       | 
       | Which is all to say, absolutely great to see others building in
       | this way!
        
         | glowcoil wrote:
         | The original article discusses techniques for constraining the
         | weights of a neural network to a submanifold of weight space
         | during training. Your comment discusses interleaving the tokens
         | of an LLM prompt with Unicode PUA code points. These are two
         | almost completely unrelated things, so it is very confusing to
         | me that you are confidently asserting that they are the same
         | thing. Can you please elaborate on why you think there is any
         | connection at all between your comment and the original
         | article?
        
           | aghilmort wrote:
           | Our ECC construction induces an emergent modular manifold
           | during KVQ computation.
           | 
           | Suppose we use 3 codeword lanes every codeword which is our
           | default. Each lane of tokens is based on some prime, p, so
           | collectively forms CRT-driven codeword (Chinese Remainder
           | Theorem). This is discretely equivalent to labeling every k
           | tokens with 1x globally unique indexing grammar.
           | 
           | That interleaving also corresponds to a triple of adjacent
           | orthogonal embeddings since those tokens still retain a
           | random gaussian embedding. The net effect is we similarly
           | slice the latent space into spaced chain of modular manifolds
           | within the latent space every k content tokens.
           | 
           | We also refer to that interleaving as Steifel frames for
           | similar reasons as the post reads etc. We began work this
           | spring or so to inject that net construction inside the model
           | with early results in similar direction as post described.
           | That's another way of saying this sort of approach lets us
           | make that chained atlas (wc?) of modular manifolds as tight
           | as possible within dimensional limits of the embedding,
           | floating point precision, etc.
           | 
           | We somewhat tongue-in-cheek refer to this as the
           | retokenization group at the prompt level re: renormalization
           | group / tensor nets / etc. Relayering group is the same net
           | intuition or perhaps reconnection group at architecture
           | level.
        
             | glowcoil wrote:
             | I'm sorry, but even if I am maximally charitable and assume
             | that everything you are saying is meaningful and makes
             | sense, it still has essentially nothing to do with the
             | original article. The original article is about imposing
             | _constraints_ on the _weights_ of a neural network, during
             | training, so that they lie on a particular manifold inside
             | the overall _weight space_. The  "modular" part is about
             | being able to specify these constraints separately for
             | individual layers or modules of a network and then compose
             | them together into a meaningful constraint for the global
             | network.
             | 
             | You are talking about _latent space_ during _inference_ ,
             | not weight space during training, and you are talking about
             | interleaving tokens with random Gaussian tokens, not
             | constraining values to lie on a manifold within a larger
             | space. Whether or not the thing you are describing is
             | meaningful or useful, it is basically unrelated to the
             | original article, and you are not using the term "modular
             | manifold" to refer to the same thing.
        
               | aghilmort wrote:
               | hmm / hear you. my point wasn't that we are applying
               | modular manifolds in the same way it was that we are
               | working on model reliability from two extremal ends using
               | the same principle. there are various ways to induce
               | modular manifolds in model at various levels of
               | resolution / power. we started at outside / working in
               | level and so it works with any black-box model out of the
               | box and zero knowledge needed, dont even need to know
               | token dictionary to show effect.
               | 
               | We're already working on pushing construction deeper into
               | model both architecture and training. currently that's
               | for fine-tuning and ultimately full architecture
               | shrinkage / pruning and raw training vs. just fine-tuning
               | etc.
               | 
               | & it was just great to see someone else using modular
               | manifolds even if they are using them at the training
               | stage vs. inference stage. they're exploiting modular
               | form at training, we're doing it at inference. cool to
               | see.
        
         | snake_doc wrote:
         | Wot? Is this what AI generated non-sense has come to? This is
         | totally unrelated.
        
           | aghilmort wrote:
           | Nope. Construction induces ECC-driven emergent modular
           | manifolds in latent space during KVQ maths. Can't use any ole
           | ECC / crux why works. More in another reply.
        
       | yodon wrote:
       | Is the original Thinking Machines trademark[0] no longer active?
       | They were the original AI company, back when AI was a completely
       | different thing than it is today.
       | 
       | [0]
       | https://en.m.wikipedia.org/wiki/Thinking_Machines_Corporatio...
        
       | gowld wrote:
       | What does this mean?
        
       | phoenicyan wrote:
       | Well-done post, I'd like to read more of their work and it's
       | exciting to see these new ideas. Though as other people have
       | said, the one set of empirical results that they present is a
       | bit... confusing? I'd think they'd have some more compelling
       | examples to present given all the pretty math.
       | 
       | Their modular norm paper (https://arxiv.org/abs/2405.14813) has
       | several more examples; see their appendix D in particular, but
       | these are also mystifying. Yes they're interested in how things
       | scale but am I the only one to whom it seems that the training
       | losses they report are just not competitive with things that are
       | currently being used?
        
       | nenenejej wrote:
       | https://archive.is/bP3BG
       | 
       | If you like to scroll on mobile :)
        
       | nenenejej wrote:
       | Nice! Posts like this make me remorseful of not following a
       | mathematics career. I'm sure some of the notation is basic (as in
       | undergrad) but I'd need an entire weekend to understand the post.
        
       ___________________________________________________________________
       (page generated 2025-09-26 23:00 UTC)