[HN Gopher] Byte latent transformer: Patches scale better than t...
___________________________________________________________________
Byte latent transformer: Patches scale better than tokens
Author : dlojudice
Score : 79 points
Date : 2025-05-12 16:55 UTC (6 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| dlojudice wrote:
| This BLT approach is why "AI research is stalling" takes are
| wrong. Dynamic byte-level patches instead of tokens seems
| genuinely innovative, not just scaling up the same architecture.
| Better efficiency AND handling edge cases better? Actual
| progress. The field is still finding clever ways to rethink
| fundamentals.
| zamalek wrote:
| I think the sentiment (at least my sentiment) is that
| "mainstream ML" has fallen into the transformer local minimum,
| and given the weight of the players in that space it will take
| a huge amount of force to move them out of it.
|
| The likes of this, Mercury Coder, and even RKWV are definitely
| hopeful - but there's a pitch black shadow of hype and
| speculation to outshine.
| anon291 wrote:
| I disagree. Most AI innovation today is around things like
| agents, integrations, and building out use cases. This is
| possible because transformers have made human-like AI
| possible for the first-time in the history of humanity. These
| use-cases will remain the same even if the underlying
| architecture changes. The number of people working on new
| architectures today is way more than were working on neural
| networks in 2017 when 'attention is all you need' came out.
| Nevertheless, actual ML model researchers are only a small
| portion of the total ML/AI community, and this is fine.
| janalsncm wrote:
| > AI innovation today
|
| I think you are talking about something else. In my
| opinion, integration is very different from fundamental ML
| research.
| anon291 wrote:
| There is more fundamental ML research today than at any
| other point in history, including in non-transformer
| architectures. That is my point. It doesn't seem that way
| because 90%+ of 'ML research' has nothing to do with
| fundamental ML and is instead research around
| applications, which are indifferent to the underlying
| model at the end of the day. That was the point of my
| comment.
| Retric wrote:
| The sheer scale of computation and data available is what's
| pushing AI to near human levels. The same algorithms in
| 1980 wouldn't be nearly as useful.
| mdaniel wrote:
| I've secretly wondered if the next (ahem) quantum leap in
| output quality will arrive with quantum computing wherein
| answering 10,000 if statements simultaneously would
| radically change the inference pipeline
|
| But I am also open to the fact that I may be thinking of
| this in terms of 'faster horses' and not the right
| question
| spindump8930 wrote:
| It's not clear how your perception of quantum computing
| would lead to 'faster horses' in the current view of NN
| architectures - keep mind that the common view of
| 'exploring many paths simultaneously' is at best an
| oversimplification (https://scottaaronson.blog/?p=2026).
|
| That said, perhaps advances in computing fundamentals
| would lead to something entirely new (and not at all
| horselike).
| anon291 wrote:
| If you can tie in a loss function for a neural network
| into the quantum excitement state of a quantum system,
| then presumably, letting the system settle at the energy
| minimum would be equivalent to a training step, but
| perhaps much faster.
| anon291 wrote:
| It's true, but you can't deny the importance of the
| architecture. It's pretty clear that using simple
| perceptrons would _not_ have led us down the same path.
| Retric wrote:
| Sure, but I think a reasonable corollary is that new
| algorithms and architectures will show their strengths
| when new realms of computation become available.
| spindump8930 wrote:
| If you consider most of the dominate architectures in
| deeplearning type approaches, transformers are remarkably
| generic. If you reduce transformer like architectures to
| "position independent iterated self attention with
| intermediate transformations", they can support ~all
| modalities and incorporate other representations (e.g.
| convolutions, CLIP style embeddings, graphs or sequences
| encoded with additional position embeddings). On top of
| that, they're very compute friendly.
|
| Two of the largest weaknesses seem to be auto-regressive
| sampling (not unique to the base architecture) and
| expensive self attention over very long contexts (whether
| sequence shaped or generic graph shaped). Many researchers
| are focusing efforts there!
|
| Also see: https://www.isattentionallyouneed.com/
| anon291 wrote:
| Transformers are very close to some types of feed forward
| networks. The difference is that transformers can be
| trained in parallel without the need for auto-regression
| (which is slow, for training, but kind of nice for
| streaming , low-latency inference). It's a mathematical
| trick. RWKV makes it obvious.
| janalsncm wrote:
| I think DeepSeek (v3 and r1) showed us that there's still a ton
| of meat on the bone for fundamental research and optimization.
| Lerc wrote:
| Absolutely, I have seen so many good ideas that have not yet
| made it into notable trained models.
|
| A lot of that is because you need to have a lot more faith
| than "seems like a good idea" before you spend a few million
| in training that depends upon it.
|
| Some of it is because when the models released now began
| training, a lot of those ideas hasn't been published yet.
|
| Time will resolve most of that, cheaper and more performant
| hardware will allow a lot of those ideas to be tested without
| the massive commitment required to build the leading edge
| models.
| Workaccount2 wrote:
| The big guys are almost certainly incinerating millions a
| day on training "maybe it could show some promise"
| techniques. With the way things are right now, they are
| probably green lighting everything to find an edge.
| joe_the_user wrote:
| I don't think you're understanding what the "stall" arguments
| are saying.
|
| Certainly tweaks to performance continue but as understand it,
| the stalling argument looks at the tendency of broad,
| "subjective" llm performance to not get beyond a certain level.
| Basically, that the massive projects to throw more data and
| training at the thing results in more marginal _apparent_
| improvements than the jump(s) we say with GPT 2-3-3.5-4.
|
| The situation imo is that some point once you've ingested and
| trained on all the world's digitized books, all the coherent
| parts of the Internet, etc., you a limit to what you get with
| just "predict next" training. More information after this is
| more of the same on a higher level.
|
| But again, no doubt, progress on the level of algorithms will
| continue (Deep Seek was indication of what's possible). But the
| situation is such progress essentially allows adequate LLMs
| faster rather than any progress towards "general intelligence".
|
| Edit: clarity and structure
| gwern wrote:
| It is pretty much the same scaling, though:
| https://arxiv.org/pdf/2412.09871#page=10 It just lets you avoid
| some of the pathologies of BPEs.
| spindump8930 wrote:
| This paper is very cool, comes from respected authors, and is a
| very nice idea with good experiments (flop controlled for
| compute). It shouldn't be seen as a wall-breaking innovation
| though. From the paper:
|
| > Existing transformer libraries and codebases are designed to
| be highly efficient for tokenizer-based transformer
| architectures. While we present theoretical flop matched
| experiments and also use certain efficient implementations
| (such as FlexAttention) to handle layers that deviate from the
| vanilla transformer architecture, our implementations may yet
| not be at parity with tokenizer-based models in terms of wall-
| clock time and may benefit from further optimizations.
|
| And unfortunately wall-clock deficiencies mean that any quality
| improvement needs to overcome that additional scaling barrier
| before any big runs (meaning expensive) can risk using it.
| armcat wrote:
| This was previously reported 5 months ago:
| https://news.ycombinator.com/item?id=42415122 (84 comments).
|
| As an aside - I am a big fan of Luke Zettlemoyer and his team at
| the University of Washington. They've been doing cool NLP
| research for years!
| entilzha wrote:
| Great to see our paper here again! Since the paper release, we've
| also released model weights here for anyone interesting in
| building on top of it: https://huggingface.co/facebook/blt. We
| also added HF Hub code to easily load the model
| https://github.com/facebookresearch/blt?tab=readme-ov-file#l....
___________________________________________________________________
(page generated 2025-05-12 23:00 UTC)