[HN Gopher] Ask HN: Is anybody building an alternative transformer?
       ___________________________________________________________________
        
       Ask HN: Is anybody building an alternative transformer?
        
       Curious if anybody out there is trying to build a new
       model/architecture that would succeed the transformer?  I geek out
       on this subject in my spare time. Curious if anybody else is doing
       so and if you're willing to share ideas?
        
       Author : taiboku256
       Score  : 76 points
       Date   : 2025-02-14 20:00 UTC (3 hours ago)
        
       | pestatije wrote:
       | please define transformer
        
         | jaylaal wrote:
         | Robots in disguise.
        
         | janalsncm wrote:
         | https://en.m.wikipedia.org/wiki/Transformer_(deep_learning_a...
        
           | cshimmin wrote:
           | Yeah, it's literally the most important practical development
           | in AI/ML of the decade. This is like reading an article (or
           | headline, more like) on HN and saying "please define git".
        
             | yukinon wrote:
             | Not everyone is aware of the details of AI/ML,
             | "transformer" is actually a specific term in the space that
             | also overlaps with "transformer" in other fields adjacent
             | to Software Development. This is when we all need to wear
             | our empathy hat and remind ourselves that we exist in a
             | bubble, so when we see an overloaded term, we should add
             | even the most minimal context to help. OP could have added
             | "AI/ML" in the title for minimal effort and real estate.
             | Let's not veer towards the path of elitism.
             | 
             | Also, the majority of developers using version control are
             | using Git. I guarantee the majority of developers outside
             | the AI/ML bubble do not know what a "transformer" is.
        
               | cshimmin wrote:
               | Fair enough! Bubble or not, I certainly have very
               | regularly (weekly?) seen headlines on hn about
               | transformers for at least a few years now. Like how
               | bitcoin used to be on hn frontpage every week for a
               | couple years circa 2010 (to the derision of half of the
               | commenters). Not everyone is in the crypto space, but
               | they know what bitcoin is.
               | 
               | Anyhow I suppose the existence of such questions on hn is
               | evidence that I'm in more of a bubble that I esteemed,
               | thanks for the reality check :)
               | 
               | (also my comment was in defense of parent who linked the
               | wiki page, which defines transformer as per request, and
               | is being downvoted for that)
        
               | stavros wrote:
               | I, too, haven't seen the word "transformer" outside an ML
               | context in months. Didn't stop me from wondering if the
               | OP meant the thing that changes voltage.
        
             | happytoexplain wrote:
             | >This is like ... saying "please define git"
             | 
             | It's really not. "Git" has a single extremely strong
             | definition for tech people, and a single regional slang
             | definition. "Transformer" has multiple strong definitions
             | for tech people, and multiple strong definitions
             | colloquially.
             | 
             | Not that we can't infer the OP's meaning - just that it's
             | nowhere near as unambiguous as "git".
        
         | herpdyderp wrote:
         | Until I read this comment I thought we were talking about
         | https://en.wikipedia.org/wiki/Transformer and I was very
         | confused...
        
       | fred_is_fred wrote:
       | The one from cybertron? The one that changes voltage levels? The
       | AI algorithm one?
       | 
       | Edit: or perhaps you are working on a new insect sex regulation
       | gene? If so that would be a great discussion here -
       | https://en.wikipedia.org/wiki/Transformer_(gene)
        
       | vednig wrote:
       | I've a design in mind which is very simple and interesting but
       | don't know if it would be scalable to the stage, rn it's just a
       | superficial design inspired by IronMan's JARVIS, i'm working on
       | preparing the architecture.
        
       | happytoexplain wrote:
       | I hate that popular domains take ownership of highly generic
       | words. Many years ago, I struggled for a while to understand that
       | when people say "frontend" they often mean a website frontend,
       | even without any further context.
        
         | perrygeo wrote:
         | The worst offender is "feature". In my domain (ML and geo) we
         | have three definitions.
         | 
         | Feature could be referring to some addition to the user-facing
         | product, a raster input to machine learning, or a vector entity
         | in GeoJSON. Context is the only tool we have to make the
         | distinction, it gets really confusing when you're working on
         | features that involve querying the features with features.
        
           | janalsncm wrote:
           | You can say the same thing about "model" even in ML.
           | Depending on the context it can be quite confusing:
           | 
           | 1) an architecture described in a paper
           | 
           | 2) the trained weights of a specific instantiation of
           | architecture
           | 
           | 3) a chunk of code/neural net that accomplishes a task,
           | agnostic to the above definitions
        
       | nextos wrote:
       | The xLSTM could become a good alternative to transformers:
       | https://arxiv.org/abs/2405.04517. On very long contexts, such as
       | those arising in DNA models, these models perform really well.
       | 
       | There's a big state-space model comeback initiated by the
       | S3-Mamba saga. RWKV, which is a hybrid between classical RNNs and
       | transformers, is also worth mentioning.
        
         | bob1029 wrote:
         | I was just about to post this. There was a MLST podcast about
         | it a few days ago:
         | 
         | https://www.youtube.com/watch?v=8u2pW2zZLCs
         | 
         | Lots of related papers referenced in the description.
        
       | janalsncm wrote:
       | There are alternatives that optimize around the edges. Like
       | Deepseek's Multi-head Latent Attention, or Grouped Query
       | Attention. DeepSeek also showed an optimization on Mixture of
       | Experts. These are all clear improvements to the Vaswani
       | architecture.
       | 
       | There are optimizations like extreme 1.58 bit quant that can be
       | applied to anything.
       | 
       | There are architectures that stray farther. Like SSMs and some
       | attempts at bringing the RNN back from the dead. And even text
       | diffusion models that try to generate paragraphs like we generate
       | images i.e. not word by word.
        
         | dr_dshiv wrote:
         | Mixture of depths, too.
        
       | Analemma_ wrote:
       | Literally everybody doing cutting edge AI research is trying to
       | replace the transformer, because transformers have a bunch of
       | undesirable properties like being quadratic in context window
       | size. But they're also surprisingly resilient: despite the
       | billions of dollars and man-hours poured into the field and many
       | attempted improvements, cutting-edge models aren't all that
       | different architecturally from the original attention paper,
       | aside from their size and a few incidental details like the ReLU
       | activation function, because nobody has found anything better
       | yet.
       | 
       | I do expect transformers to be replaced eventually, but they do
       | seem to have their own "bitter lesson" where trying to outperform
       | them usually ends in failure.
        
         | PaulHoule wrote:
         | My guess is there is a cost-capability tradeoff such that the
         | O(N^2) really is buying you something you couldn't get for
         | O(N). Behind that, there really are intelligent systems
         | problems that boil down to solving SAT and should be NP-
         | complete... LLMs may be able to short circuit those problems
         | and get lucky guesses quite frequently, maybe the
         | 'hallucinations' won't go away for anything O(N^2).
        
       | ai-christianson wrote:
       | Not an alternative transformer like you asked for, but OptiLLM
       | looks interesting for squeezing more juice out of existing LLMs.
        
       | ipunchghosts wrote:
       | Yes. Happy to chat if u msg me. Using RL coupled with NNs to
       | integrate search directly into inference instead of as an
       | afterthought like Chain of though and test time training.
        
         | almosthere wrote:
         | Are we able to "msg" people on here?
        
           | Joel_Mckay wrote:
           | No, thank god... =3
        
           | viraptor wrote:
           | Only if they explicitly make the email public in the profile.
           | It's hidden by default.
        
       | htrp wrote:
       | Anyone know what the rwkv people are up to now?
       | 
       | https://arxiv.org/abs/2305.13048
        
         | viraptor wrote:
         | You can see all the development directly from them:
         | https://github.com/BlinkDL/RWKV-LM
         | 
         | Last week version 7 was released and every time they make
         | significant improvements.
        
       | quantadev wrote:
       | Right now as long as the rocket's heading straight up, everyone's
       | on board with MLPs (Multilayer Perceptrons/Transformers)! Why not
       | stay on the same rocket for now!? We're almost at AGI already!
        
         | cshimmin wrote:
         | I wouldn't conflate MLPs with transformers, MLP is a small
         | building block of almost any standard neural architecture
         | (excluding spiking/neuromorphic types).
         | 
         | But to your point, the trend towards increasing inference-time
         | compute costs, being ushered by CoT/reasoning models is one
         | good reason to look for equally capable models that can be
         | optimized for inference efficiency. Traditionally training was
         | the main compute cost, so it's reasonable to ask if there's
         | unexplored space there.
        
           | quantadev wrote:
           | What I meant by "NNs and Transformers" is that once we've
           | found the magical ingredient (and we've found it) people tend
           | to all be focused in the same area of research. Mankind just
           | got kinda lucky that all this can run on essentially game
           | graphics boards!
        
         | drdeca wrote:
         | Why are you conflating MLPs in general with specifically
         | transformers?
        
           | quantadev wrote:
           | I consider MLPs the building blocks of all this, and is what
           | makes things a neural net, as opposed to some other data
           | structure.
        
       | sgt101 wrote:
       | I'll see your architectural innovation and raise you a loss
       | function revolution.
       | 
       | https://arxiv.org/pdf/2412.21149
        
       | mvieira38 wrote:
       | Related: There was buzz last year about Kolmogorov Arnold
       | Networks, and https://arxiv.org/abs/2409.10594 was claiming KANs
       | perform better than standard MLPs in the transformer
       | architecture. Does anyone know of these being explored in the LLM
       | space? KANs seem to have better properties regarding memory if
       | I'm not mistaken.
        
         | pineapple_sauce wrote:
         | I believe KAN hype died off due to practical reasons (e.g.
         | FLOPs from implementation) and empirical results, i.e. people
         | reproduced KANs and they found the claims/results made in the
         | original paper were misleading.
         | 
         | Here's a paper showing KANs are no better than MLPs, if
         | anything they are typically worse when comparing fairly.
         | https://arxiv.org/pdf/2407.16674
        
       | hztar wrote:
       | You have stuff like: https://www.literal-labs.ai/tsetlin-
       | machines/ and https://tsetlinmachine.org/ European initiatives..
        
         | mvieira38 wrote:
         | 52x less energy is crazy. Seems like it's in the veeery early
         | stages, though, a quick search basically only yields the
         | original paper and articles about it. This comment from the
         | creator really shines light on the novel approach, though,
         | which I find oddly antagonistic towards Big Tech:
         | 
         | "Where the Tsetlin machine currently excels is energy-
         | constrained edge machine learning, where you can get up to
         | 10000x less energy consumption and 1000x faster inference
         | (https://www.mignon.ai). My goal is to create an alternative to
         | BigTech's black boxes: free, green, transparent, and logical
         | (http://cair.uia.no)." (https://www.reddit.com/r/MachineLearnin
         | g/comments/17xoj68/co...)
        
           | hztar wrote:
           | It's true that Tsetlin Machines are currently a fringe area
           | of ML research, especially compared to the focus on deep
           | learning advancements coming out of SF and China. It's early
           | days, but the energy efficiency potential is insane. I
           | believe further investment could yield significant results.
           | Having been supervised by the creator, I'm admittedly biased,
           | but the underlying foundation in Tsetlin's learning automata
           | gives it a solid theoretical grounding. Dedicated funding is
           | definitely needed to explore its full potential.
        
       | czhu12 wrote:
       | The MAMBA [1] model gained some traction as a potential
       | successor. It's basically an RNN without the non linearity
       | applied across hidden states, which makes it logarithmic time
       | (instead of linear time) inference with a parallelizable scan
       | [2].
       | 
       | It promises much faster inference with much lower compute costs,
       | and I think up to 7B params, performs on par with transformers.
       | I've yet to see a 40B+ model trained.
       | 
       | The researches of MAMBA went on to start a company called
       | Cartesia [3], which is MAMBA applied to voice models
       | 
       | [1] https://jackcook.com/2024/02/23/mamba.html
       | 
       | [2] https://www.csd.uwo.ca/~mmorenom/HPC-
       | Slides/Parallel_prefix_... <- Pulled up a random example from
       | google, but Stanford CS149 has an entire lecture devoted to
       | parallel scan.
       | 
       | [3] https://cartesia.ai/
        
         | monroewalker wrote:
         | Oh that would be awesome for that to work. Thanks for sharing
        
           | stavros wrote:
           | If I'm not misremembering, Mistral released a model based on
           | MAMBA, but I haven't heard much about it since.
        
         | kla-s wrote:
         | Jamba 1.5 Large is 398B params (94B active) and weights are
         | available.
         | 
         | https://arxiv.org/abs/2408.12570
         | 
         | Credit https://news.ycombinator.com/user?id=sanxiyn for making
         | me aware
        
       | PaulHoule wrote:
       | Personally I think foundation models are for the birds, the cost
       | of developing one is immense and the time involved is so great
       | that you can't do many run-break-fix cycles so you will get
       | nowhere on a shoestring. (Though maybe you can get somewhere on
       | simple tasks and synthetic data)
       | 
       | Personally I am working on a _reliable_ model trainer for
       | classification and sequence labeling tasks that uses something
       | like ModernBERT at the front end and some kind of LSTM on the
       | back end.
       | 
       | People who hold court on machine learning forums will swear by
       | fine-tuned BERT and similar things but they are not at all
       | interested in talking about the reliable bit. I've read a lot of
       | arXiv papers where somebody tries to fine-tune a BERT for a
       | classification task, runs some arbitrarily chosen parameters they
       | got out of another paper and it sort-of works some of the time.
       | 
       | It drives me up the wall that you can't use early stopping for
       | BERT fine-tuning like I've been using on neural nets _since 1990
       | or so_ and if I believe what I 'm seeing I don't think the
       | networks I've been using for BERT fine-tuning can really benefit
       | from training sets with more than a few thousand examples,
       | emphasis on the "few".
       | 
       | My assumption is that everybody else is going to be working on
       | the flashy task of developing better foundation models and as
       | long as they emit an embedding-per-token I can plug a better
       | foundation model in and my models will perform better.
        
         | mindcrime wrote:
         | > Personally I think foundation models are for the birds,
         | 
         | I might not quite that far, but I have publicly said (and will
         | stand by the statement) that I think that training
         | progressively larger and more complex foundation models is a
         | waste of resources. But my view of AI is rooted in a neuro-
         | symbolic approach, with emphasis on the "symbolic". I envision
         | neural networks not as the core essence of an AI, but mainly as
         | just adapters between different representations that are used
         | by different sub-systems. And possibly as "scaffolding" where
         | one can use the "intelligence" baked into an LLM as a bridge to
         | get the overall system to where it can learn, and then
         | eventually kick the scaffold down once it isn't needed anymore.
        
           | PaulHoule wrote:
           | I can sure talk your ear off about that one as I went way too
           | far into the semantic web rabbit hole.
           | 
           | Training LLMs to use 'tools' of various types is a great
           | idea, as it is to run them inside frameworks that check that
           | their output satisfies various constraints. Still certain
           | problems like the NP-complete nature of SAT solving (and many
           | intelligent systems problems, such as word problems you'd
           | expect an A.I. to solve, boil down to SAT solving) and
           | problems such as the halting problem, Godel's theorem and
           | such are still problems. I understand Doug Hofstader has
           | softened his positions lately, but I think many of the
           | problems set up in this book
           | 
           | https://en.wikipedia.org/wiki/G%C3%B6del,_Escher,_Bach
           | 
           | (particularly the Achilles & Tortoise dialog) still stand
           | today, as cringey as that book seems to me in 2025.
        
           | dr_dshiv wrote:
           | Good old fashioned AI, amirite
        
           | tlb wrote:
           | We learned something pretty big and surprising from each new
           | generation of LLM, for a small fraction of the time and cost
           | of a new particle accelerator or space telescope. Compared to
           | other big science projects, they're giving pretty good bang
           | for the buck.
        
       | bravura wrote:
       | Check out "Attention as an RNN" by Feng et al (2024), with Bengio
       | as a co-author. https://arxiv.org/pdf/2405.13956
       | 
       | Abstract: The advent of Transformers marked a significant
       | breakthrough in sequence modelling, providing a highly performant
       | architecture capable of leveraging GPU parallelism. However,
       | Transformers are computationally expensive at inference time,
       | limiting their applications, particularly in low-resource
       | settings (e.g., mobile and embedded devices). Addressing this, we
       | (1) begin by showing that attention can be viewed as a special
       | Recurrent Neural Network (RNN) with the ability to compute its
       | many-to-one RNN output efficiently. We then (2) show that popular
       | attention-based models such as Transformers can be viewed as RNN
       | variants. However, unlike traditional RNNs (e.g., LSTMs), these
       | models cannot be updated efficiently with new tokens, an
       | important property in sequence modelling. Tackling this, we (3)
       | introduce a new efficient method of computing attention's many-
       | tomany RNN output based on the parallel prefix scan algorithm.
       | Building on the new attention formulation, we (4) introduce
       | Aaren, an attention-based module that can not only (i) be trained
       | in parallel (like Transformers) but also (ii) be updated
       | efficiently with new tokens, requiring only constant memory for
       | inferences (like traditional RNNs). Empirically, we show Aarens
       | achieve comparable performance to Transformers on 38 datasets
       | spread across four popular sequential problem settings:
       | reinforcement learning, event forecasting, time series
       | classification, and time series forecasting tasks while being
       | more time and memory-efficient.
        
       | neom wrote:
       | https://github.com/triadicresonance/triadic this was on one of
       | the llm discord servers few weeks ago
        
       | SiddanthEmani wrote:
       | Titans has a new approach to longer and faster memory compared to
       | transformers.
       | 
       | https://arxiv.org/html/2501.00663v1
        
       | mbloom1915 wrote:
       | AI aside, the world could also use an alternative electric
       | transformer. The backlog from main suppliers is 40+ weeks and far
       | too expensive. There is a MAJOR manuf and supply issue here as
       | all new build construction competes for same equipment...
        
       | jostmey wrote:
       | My guess is that new architectures will be about doing more with
       | less compute. For example, are there architectures that can
       | operate at lower bit precision or better turn off and on
       | components as required by the task?
        
       | freeone3000 wrote:
       | I'm working with a group on an RL core with models as tool use,
       | for explainable agentic tasks with actual discovery.
        
       | kolinko wrote:
       | Not a new model per se, but a new algorithm for inference -
       | https://kolinko.github.io/effort/
        
       ___________________________________________________________________
       (page generated 2025-02-14 23:00 UTC)