hngopher.com

       [HN Gopher] The Tradeoffs of SSMs and Transformers
       ___________________________________________________________________
        
       The Tradeoffs of SSMs and Transformers
        
       Author : jxmorris12
       Score  : 30 points
       Date   : 2025-07-08 19:12 UTC (3 hours ago)
        
 (HTM) web link (goombalab.github.io)
 (TXT) w3m dump (goombalab.github.io)
        
       | macleginn wrote:
       | The part on tokenisation is not very convincing. Replacing BPE
       | with characters or even bytes will not "remove tokenisation" --
       | atoms will still be tokens, relating to different things in
       | different cultures/writing traditions (a "Chinese byte" is a part
       | of a Chinese character; an "English byte" is basicaly a letter or
       | a number) and not relating to something fundamentally linguistic.
       | BPE can be thought of as another way of representing linguistic
       | sequences with symbols of some kind; it provides less inductive
       | bias into the use of language, but it is not perhaps
       | categorically different from any kind of writing.
        
       | Herring wrote:
       | I'm a bit bearish on SSMs (and hybrid SSM/transformers) because
       | the leading open weight models (DeepSeek, Qwen, Gemma, Llama) are
       | all transformers. There's just no way none of them tried SSMs.
        
         | visarga wrote:
         | Yes, until serious adoption I am reserved too, both on SSMs and
         | diffusion based LLMs.
        
         | nextos wrote:
         | Second-generation LSTMs (xLSTM) do have leading performance on
         | zero-shot time series forecasting:
         | https://arxiv.org/abs/2505.23719.
         | 
         | I think other architectures, aside from the transformer, might
         | lead to SOTA performance, but they remain a bit unexplored.
        
         | programjames wrote:
         | I mean, everyone is still using variational autoencoders for
         | their latent flow models instead of the information bottleneck.
         | It's because it's cheaper (in founder time) to raise 10(0)x
         | more money instead of having to design your own algorithms and
         | architectures for a novel idea that _might_ work in theory, but
         | could be a dead end six months down the line. Just look at
         | LiquidAI. Brilliant idea, but it took them ~5 years to do all
         | the research and another to get their first models to market...
         | which don 't yet seem to be any better than models with a
         | similar compute requirement. I find it pretty plausible that
         | none of the "big" LLM companies seriously tried SSMs, because
         | they already have plenty enough money to throw at transformers,
         | or took a quick path to get a big valuation.
        
       ___________________________________________________________________
       (page generated 2025-07-08 23:00 UTC)