[HN Gopher] Ask HN: Is anybody building an alternative transformer?
___________________________________________________________________
Ask HN: Is anybody building an alternative transformer?
Curious if anybody out there is trying to build a new
model/architecture that would succeed the transformer? I geek out
on this subject in my spare time. Curious if anybody else is doing
so and if you're willing to share ideas?
Author : taiboku256
Score : 76 points
Date : 2025-02-14 20:00 UTC (3 hours ago)
| pestatije wrote:
| please define transformer
| jaylaal wrote:
| Robots in disguise.
| janalsncm wrote:
| https://en.m.wikipedia.org/wiki/Transformer_(deep_learning_a...
| cshimmin wrote:
| Yeah, it's literally the most important practical development
| in AI/ML of the decade. This is like reading an article (or
| headline, more like) on HN and saying "please define git".
| yukinon wrote:
| Not everyone is aware of the details of AI/ML,
| "transformer" is actually a specific term in the space that
| also overlaps with "transformer" in other fields adjacent
| to Software Development. This is when we all need to wear
| our empathy hat and remind ourselves that we exist in a
| bubble, so when we see an overloaded term, we should add
| even the most minimal context to help. OP could have added
| "AI/ML" in the title for minimal effort and real estate.
| Let's not veer towards the path of elitism.
|
| Also, the majority of developers using version control are
| using Git. I guarantee the majority of developers outside
| the AI/ML bubble do not know what a "transformer" is.
| cshimmin wrote:
| Fair enough! Bubble or not, I certainly have very
| regularly (weekly?) seen headlines on hn about
| transformers for at least a few years now. Like how
| bitcoin used to be on hn frontpage every week for a
| couple years circa 2010 (to the derision of half of the
| commenters). Not everyone is in the crypto space, but
| they know what bitcoin is.
|
| Anyhow I suppose the existence of such questions on hn is
| evidence that I'm in more of a bubble that I esteemed,
| thanks for the reality check :)
|
| (also my comment was in defense of parent who linked the
| wiki page, which defines transformer as per request, and
| is being downvoted for that)
| stavros wrote:
| I, too, haven't seen the word "transformer" outside an ML
| context in months. Didn't stop me from wondering if the
| OP meant the thing that changes voltage.
| happytoexplain wrote:
| >This is like ... saying "please define git"
|
| It's really not. "Git" has a single extremely strong
| definition for tech people, and a single regional slang
| definition. "Transformer" has multiple strong definitions
| for tech people, and multiple strong definitions
| colloquially.
|
| Not that we can't infer the OP's meaning - just that it's
| nowhere near as unambiguous as "git".
| herpdyderp wrote:
| Until I read this comment I thought we were talking about
| https://en.wikipedia.org/wiki/Transformer and I was very
| confused...
| fred_is_fred wrote:
| The one from cybertron? The one that changes voltage levels? The
| AI algorithm one?
|
| Edit: or perhaps you are working on a new insect sex regulation
| gene? If so that would be a great discussion here -
| https://en.wikipedia.org/wiki/Transformer_(gene)
| vednig wrote:
| I've a design in mind which is very simple and interesting but
| don't know if it would be scalable to the stage, rn it's just a
| superficial design inspired by IronMan's JARVIS, i'm working on
| preparing the architecture.
| happytoexplain wrote:
| I hate that popular domains take ownership of highly generic
| words. Many years ago, I struggled for a while to understand that
| when people say "frontend" they often mean a website frontend,
| even without any further context.
| perrygeo wrote:
| The worst offender is "feature". In my domain (ML and geo) we
| have three definitions.
|
| Feature could be referring to some addition to the user-facing
| product, a raster input to machine learning, or a vector entity
| in GeoJSON. Context is the only tool we have to make the
| distinction, it gets really confusing when you're working on
| features that involve querying the features with features.
| janalsncm wrote:
| You can say the same thing about "model" even in ML.
| Depending on the context it can be quite confusing:
|
| 1) an architecture described in a paper
|
| 2) the trained weights of a specific instantiation of
| architecture
|
| 3) a chunk of code/neural net that accomplishes a task,
| agnostic to the above definitions
| nextos wrote:
| The xLSTM could become a good alternative to transformers:
| https://arxiv.org/abs/2405.04517. On very long contexts, such as
| those arising in DNA models, these models perform really well.
|
| There's a big state-space model comeback initiated by the
| S3-Mamba saga. RWKV, which is a hybrid between classical RNNs and
| transformers, is also worth mentioning.
| bob1029 wrote:
| I was just about to post this. There was a MLST podcast about
| it a few days ago:
|
| https://www.youtube.com/watch?v=8u2pW2zZLCs
|
| Lots of related papers referenced in the description.
| janalsncm wrote:
| There are alternatives that optimize around the edges. Like
| Deepseek's Multi-head Latent Attention, or Grouped Query
| Attention. DeepSeek also showed an optimization on Mixture of
| Experts. These are all clear improvements to the Vaswani
| architecture.
|
| There are optimizations like extreme 1.58 bit quant that can be
| applied to anything.
|
| There are architectures that stray farther. Like SSMs and some
| attempts at bringing the RNN back from the dead. And even text
| diffusion models that try to generate paragraphs like we generate
| images i.e. not word by word.
| dr_dshiv wrote:
| Mixture of depths, too.
| Analemma_ wrote:
| Literally everybody doing cutting edge AI research is trying to
| replace the transformer, because transformers have a bunch of
| undesirable properties like being quadratic in context window
| size. But they're also surprisingly resilient: despite the
| billions of dollars and man-hours poured into the field and many
| attempted improvements, cutting-edge models aren't all that
| different architecturally from the original attention paper,
| aside from their size and a few incidental details like the ReLU
| activation function, because nobody has found anything better
| yet.
|
| I do expect transformers to be replaced eventually, but they do
| seem to have their own "bitter lesson" where trying to outperform
| them usually ends in failure.
| PaulHoule wrote:
| My guess is there is a cost-capability tradeoff such that the
| O(N^2) really is buying you something you couldn't get for
| O(N). Behind that, there really are intelligent systems
| problems that boil down to solving SAT and should be NP-
| complete... LLMs may be able to short circuit those problems
| and get lucky guesses quite frequently, maybe the
| 'hallucinations' won't go away for anything O(N^2).
| ai-christianson wrote:
| Not an alternative transformer like you asked for, but OptiLLM
| looks interesting for squeezing more juice out of existing LLMs.
| ipunchghosts wrote:
| Yes. Happy to chat if u msg me. Using RL coupled with NNs to
| integrate search directly into inference instead of as an
| afterthought like Chain of though and test time training.
| almosthere wrote:
| Are we able to "msg" people on here?
| Joel_Mckay wrote:
| No, thank god... =3
| viraptor wrote:
| Only if they explicitly make the email public in the profile.
| It's hidden by default.
| htrp wrote:
| Anyone know what the rwkv people are up to now?
|
| https://arxiv.org/abs/2305.13048
| viraptor wrote:
| You can see all the development directly from them:
| https://github.com/BlinkDL/RWKV-LM
|
| Last week version 7 was released and every time they make
| significant improvements.
| quantadev wrote:
| Right now as long as the rocket's heading straight up, everyone's
| on board with MLPs (Multilayer Perceptrons/Transformers)! Why not
| stay on the same rocket for now!? We're almost at AGI already!
| cshimmin wrote:
| I wouldn't conflate MLPs with transformers, MLP is a small
| building block of almost any standard neural architecture
| (excluding spiking/neuromorphic types).
|
| But to your point, the trend towards increasing inference-time
| compute costs, being ushered by CoT/reasoning models is one
| good reason to look for equally capable models that can be
| optimized for inference efficiency. Traditionally training was
| the main compute cost, so it's reasonable to ask if there's
| unexplored space there.
| quantadev wrote:
| What I meant by "NNs and Transformers" is that once we've
| found the magical ingredient (and we've found it) people tend
| to all be focused in the same area of research. Mankind just
| got kinda lucky that all this can run on essentially game
| graphics boards!
| drdeca wrote:
| Why are you conflating MLPs in general with specifically
| transformers?
| quantadev wrote:
| I consider MLPs the building blocks of all this, and is what
| makes things a neural net, as opposed to some other data
| structure.
| sgt101 wrote:
| I'll see your architectural innovation and raise you a loss
| function revolution.
|
| https://arxiv.org/pdf/2412.21149
| mvieira38 wrote:
| Related: There was buzz last year about Kolmogorov Arnold
| Networks, and https://arxiv.org/abs/2409.10594 was claiming KANs
| perform better than standard MLPs in the transformer
| architecture. Does anyone know of these being explored in the LLM
| space? KANs seem to have better properties regarding memory if
| I'm not mistaken.
| pineapple_sauce wrote:
| I believe KAN hype died off due to practical reasons (e.g.
| FLOPs from implementation) and empirical results, i.e. people
| reproduced KANs and they found the claims/results made in the
| original paper were misleading.
|
| Here's a paper showing KANs are no better than MLPs, if
| anything they are typically worse when comparing fairly.
| https://arxiv.org/pdf/2407.16674
| hztar wrote:
| You have stuff like: https://www.literal-labs.ai/tsetlin-
| machines/ and https://tsetlinmachine.org/ European initiatives..
| mvieira38 wrote:
| 52x less energy is crazy. Seems like it's in the veeery early
| stages, though, a quick search basically only yields the
| original paper and articles about it. This comment from the
| creator really shines light on the novel approach, though,
| which I find oddly antagonistic towards Big Tech:
|
| "Where the Tsetlin machine currently excels is energy-
| constrained edge machine learning, where you can get up to
| 10000x less energy consumption and 1000x faster inference
| (https://www.mignon.ai). My goal is to create an alternative to
| BigTech's black boxes: free, green, transparent, and logical
| (http://cair.uia.no)." (https://www.reddit.com/r/MachineLearnin
| g/comments/17xoj68/co...)
| hztar wrote:
| It's true that Tsetlin Machines are currently a fringe area
| of ML research, especially compared to the focus on deep
| learning advancements coming out of SF and China. It's early
| days, but the energy efficiency potential is insane. I
| believe further investment could yield significant results.
| Having been supervised by the creator, I'm admittedly biased,
| but the underlying foundation in Tsetlin's learning automata
| gives it a solid theoretical grounding. Dedicated funding is
| definitely needed to explore its full potential.
| czhu12 wrote:
| The MAMBA [1] model gained some traction as a potential
| successor. It's basically an RNN without the non linearity
| applied across hidden states, which makes it logarithmic time
| (instead of linear time) inference with a parallelizable scan
| [2].
|
| It promises much faster inference with much lower compute costs,
| and I think up to 7B params, performs on par with transformers.
| I've yet to see a 40B+ model trained.
|
| The researches of MAMBA went on to start a company called
| Cartesia [3], which is MAMBA applied to voice models
|
| [1] https://jackcook.com/2024/02/23/mamba.html
|
| [2] https://www.csd.uwo.ca/~mmorenom/HPC-
| Slides/Parallel_prefix_... <- Pulled up a random example from
| google, but Stanford CS149 has an entire lecture devoted to
| parallel scan.
|
| [3] https://cartesia.ai/
| monroewalker wrote:
| Oh that would be awesome for that to work. Thanks for sharing
| stavros wrote:
| If I'm not misremembering, Mistral released a model based on
| MAMBA, but I haven't heard much about it since.
| kla-s wrote:
| Jamba 1.5 Large is 398B params (94B active) and weights are
| available.
|
| https://arxiv.org/abs/2408.12570
|
| Credit https://news.ycombinator.com/user?id=sanxiyn for making
| me aware
| PaulHoule wrote:
| Personally I think foundation models are for the birds, the cost
| of developing one is immense and the time involved is so great
| that you can't do many run-break-fix cycles so you will get
| nowhere on a shoestring. (Though maybe you can get somewhere on
| simple tasks and synthetic data)
|
| Personally I am working on a _reliable_ model trainer for
| classification and sequence labeling tasks that uses something
| like ModernBERT at the front end and some kind of LSTM on the
| back end.
|
| People who hold court on machine learning forums will swear by
| fine-tuned BERT and similar things but they are not at all
| interested in talking about the reliable bit. I've read a lot of
| arXiv papers where somebody tries to fine-tune a BERT for a
| classification task, runs some arbitrarily chosen parameters they
| got out of another paper and it sort-of works some of the time.
|
| It drives me up the wall that you can't use early stopping for
| BERT fine-tuning like I've been using on neural nets _since 1990
| or so_ and if I believe what I 'm seeing I don't think the
| networks I've been using for BERT fine-tuning can really benefit
| from training sets with more than a few thousand examples,
| emphasis on the "few".
|
| My assumption is that everybody else is going to be working on
| the flashy task of developing better foundation models and as
| long as they emit an embedding-per-token I can plug a better
| foundation model in and my models will perform better.
| mindcrime wrote:
| > Personally I think foundation models are for the birds,
|
| I might not quite that far, but I have publicly said (and will
| stand by the statement) that I think that training
| progressively larger and more complex foundation models is a
| waste of resources. But my view of AI is rooted in a neuro-
| symbolic approach, with emphasis on the "symbolic". I envision
| neural networks not as the core essence of an AI, but mainly as
| just adapters between different representations that are used
| by different sub-systems. And possibly as "scaffolding" where
| one can use the "intelligence" baked into an LLM as a bridge to
| get the overall system to where it can learn, and then
| eventually kick the scaffold down once it isn't needed anymore.
| PaulHoule wrote:
| I can sure talk your ear off about that one as I went way too
| far into the semantic web rabbit hole.
|
| Training LLMs to use 'tools' of various types is a great
| idea, as it is to run them inside frameworks that check that
| their output satisfies various constraints. Still certain
| problems like the NP-complete nature of SAT solving (and many
| intelligent systems problems, such as word problems you'd
| expect an A.I. to solve, boil down to SAT solving) and
| problems such as the halting problem, Godel's theorem and
| such are still problems. I understand Doug Hofstader has
| softened his positions lately, but I think many of the
| problems set up in this book
|
| https://en.wikipedia.org/wiki/G%C3%B6del,_Escher,_Bach
|
| (particularly the Achilles & Tortoise dialog) still stand
| today, as cringey as that book seems to me in 2025.
| dr_dshiv wrote:
| Good old fashioned AI, amirite
| tlb wrote:
| We learned something pretty big and surprising from each new
| generation of LLM, for a small fraction of the time and cost
| of a new particle accelerator or space telescope. Compared to
| other big science projects, they're giving pretty good bang
| for the buck.
| bravura wrote:
| Check out "Attention as an RNN" by Feng et al (2024), with Bengio
| as a co-author. https://arxiv.org/pdf/2405.13956
|
| Abstract: The advent of Transformers marked a significant
| breakthrough in sequence modelling, providing a highly performant
| architecture capable of leveraging GPU parallelism. However,
| Transformers are computationally expensive at inference time,
| limiting their applications, particularly in low-resource
| settings (e.g., mobile and embedded devices). Addressing this, we
| (1) begin by showing that attention can be viewed as a special
| Recurrent Neural Network (RNN) with the ability to compute its
| many-to-one RNN output efficiently. We then (2) show that popular
| attention-based models such as Transformers can be viewed as RNN
| variants. However, unlike traditional RNNs (e.g., LSTMs), these
| models cannot be updated efficiently with new tokens, an
| important property in sequence modelling. Tackling this, we (3)
| introduce a new efficient method of computing attention's many-
| tomany RNN output based on the parallel prefix scan algorithm.
| Building on the new attention formulation, we (4) introduce
| Aaren, an attention-based module that can not only (i) be trained
| in parallel (like Transformers) but also (ii) be updated
| efficiently with new tokens, requiring only constant memory for
| inferences (like traditional RNNs). Empirically, we show Aarens
| achieve comparable performance to Transformers on 38 datasets
| spread across four popular sequential problem settings:
| reinforcement learning, event forecasting, time series
| classification, and time series forecasting tasks while being
| more time and memory-efficient.
| neom wrote:
| https://github.com/triadicresonance/triadic this was on one of
| the llm discord servers few weeks ago
| SiddanthEmani wrote:
| Titans has a new approach to longer and faster memory compared to
| transformers.
|
| https://arxiv.org/html/2501.00663v1
| mbloom1915 wrote:
| AI aside, the world could also use an alternative electric
| transformer. The backlog from main suppliers is 40+ weeks and far
| too expensive. There is a MAJOR manuf and supply issue here as
| all new build construction competes for same equipment...
| jostmey wrote:
| My guess is that new architectures will be about doing more with
| less compute. For example, are there architectures that can
| operate at lower bit precision or better turn off and on
| components as required by the task?
| freeone3000 wrote:
| I'm working with a group on an RL core with models as tool use,
| for explainable agentic tasks with actual discovery.
| kolinko wrote:
| Not a new model per se, but a new algorithm for inference -
| https://kolinko.github.io/effort/
___________________________________________________________________
(page generated 2025-02-14 23:00 UTC)