[HN Gopher] Byte latent transformer: Patches scale better than t...
       ___________________________________________________________________
        
       Byte latent transformer: Patches scale better than tokens
        
       Author : dlojudice
       Score  : 79 points
       Date   : 2025-05-12 16:55 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | dlojudice wrote:
       | This BLT approach is why "AI research is stalling" takes are
       | wrong. Dynamic byte-level patches instead of tokens seems
       | genuinely innovative, not just scaling up the same architecture.
       | Better efficiency AND handling edge cases better? Actual
       | progress. The field is still finding clever ways to rethink
       | fundamentals.
        
         | zamalek wrote:
         | I think the sentiment (at least my sentiment) is that
         | "mainstream ML" has fallen into the transformer local minimum,
         | and given the weight of the players in that space it will take
         | a huge amount of force to move them out of it.
         | 
         | The likes of this, Mercury Coder, and even RKWV are definitely
         | hopeful - but there's a pitch black shadow of hype and
         | speculation to outshine.
        
           | anon291 wrote:
           | I disagree. Most AI innovation today is around things like
           | agents, integrations, and building out use cases. This is
           | possible because transformers have made human-like AI
           | possible for the first-time in the history of humanity. These
           | use-cases will remain the same even if the underlying
           | architecture changes. The number of people working on new
           | architectures today is way more than were working on neural
           | networks in 2017 when 'attention is all you need' came out.
           | Nevertheless, actual ML model researchers are only a small
           | portion of the total ML/AI community, and this is fine.
        
             | janalsncm wrote:
             | > AI innovation today
             | 
             | I think you are talking about something else. In my
             | opinion, integration is very different from fundamental ML
             | research.
        
               | anon291 wrote:
               | There is more fundamental ML research today than at any
               | other point in history, including in non-transformer
               | architectures. That is my point. It doesn't seem that way
               | because 90%+ of 'ML research' has nothing to do with
               | fundamental ML and is instead research around
               | applications, which are indifferent to the underlying
               | model at the end of the day. That was the point of my
               | comment.
        
             | Retric wrote:
             | The sheer scale of computation and data available is what's
             | pushing AI to near human levels. The same algorithms in
             | 1980 wouldn't be nearly as useful.
        
               | mdaniel wrote:
               | I've secretly wondered if the next (ahem) quantum leap in
               | output quality will arrive with quantum computing wherein
               | answering 10,000 if statements simultaneously would
               | radically change the inference pipeline
               | 
               | But I am also open to the fact that I may be thinking of
               | this in terms of 'faster horses' and not the right
               | question
        
               | spindump8930 wrote:
               | It's not clear how your perception of quantum computing
               | would lead to 'faster horses' in the current view of NN
               | architectures - keep mind that the common view of
               | 'exploring many paths simultaneously' is at best an
               | oversimplification (https://scottaaronson.blog/?p=2026).
               | 
               | That said, perhaps advances in computing fundamentals
               | would lead to something entirely new (and not at all
               | horselike).
        
               | anon291 wrote:
               | If you can tie in a loss function for a neural network
               | into the quantum excitement state of a quantum system,
               | then presumably, letting the system settle at the energy
               | minimum would be equivalent to a training step, but
               | perhaps much faster.
        
               | anon291 wrote:
               | It's true, but you can't deny the importance of the
               | architecture. It's pretty clear that using simple
               | perceptrons would _not_ have led us down the same path.
        
               | Retric wrote:
               | Sure, but I think a reasonable corollary is that new
               | algorithms and architectures will show their strengths
               | when new realms of computation become available.
        
             | spindump8930 wrote:
             | If you consider most of the dominate architectures in
             | deeplearning type approaches, transformers are remarkably
             | generic. If you reduce transformer like architectures to
             | "position independent iterated self attention with
             | intermediate transformations", they can support ~all
             | modalities and incorporate other representations (e.g.
             | convolutions, CLIP style embeddings, graphs or sequences
             | encoded with additional position embeddings). On top of
             | that, they're very compute friendly.
             | 
             | Two of the largest weaknesses seem to be auto-regressive
             | sampling (not unique to the base architecture) and
             | expensive self attention over very long contexts (whether
             | sequence shaped or generic graph shaped). Many researchers
             | are focusing efforts there!
             | 
             | Also see: https://www.isattentionallyouneed.com/
        
               | anon291 wrote:
               | Transformers are very close to some types of feed forward
               | networks. The difference is that transformers can be
               | trained in parallel without the need for auto-regression
               | (which is slow, for training, but kind of nice for
               | streaming , low-latency inference). It's a mathematical
               | trick. RWKV makes it obvious.
        
         | janalsncm wrote:
         | I think DeepSeek (v3 and r1) showed us that there's still a ton
         | of meat on the bone for fundamental research and optimization.
        
           | Lerc wrote:
           | Absolutely, I have seen so many good ideas that have not yet
           | made it into notable trained models.
           | 
           | A lot of that is because you need to have a lot more faith
           | than "seems like a good idea" before you spend a few million
           | in training that depends upon it.
           | 
           | Some of it is because when the models released now began
           | training, a lot of those ideas hasn't been published yet.
           | 
           | Time will resolve most of that, cheaper and more performant
           | hardware will allow a lot of those ideas to be tested without
           | the massive commitment required to build the leading edge
           | models.
        
             | Workaccount2 wrote:
             | The big guys are almost certainly incinerating millions a
             | day on training "maybe it could show some promise"
             | techniques. With the way things are right now, they are
             | probably green lighting everything to find an edge.
        
         | joe_the_user wrote:
         | I don't think you're understanding what the "stall" arguments
         | are saying.
         | 
         | Certainly tweaks to performance continue but as understand it,
         | the stalling argument looks at the tendency of broad,
         | "subjective" llm performance to not get beyond a certain level.
         | Basically, that the massive projects to throw more data and
         | training at the thing results in more marginal _apparent_
         | improvements than the jump(s) we say with GPT 2-3-3.5-4.
         | 
         | The situation imo is that some point once you've ingested and
         | trained on all the world's digitized books, all the coherent
         | parts of the Internet, etc., you a limit to what you get with
         | just "predict next" training. More information after this is
         | more of the same on a higher level.
         | 
         | But again, no doubt, progress on the level of algorithms will
         | continue (Deep Seek was indication of what's possible). But the
         | situation is such progress essentially allows adequate LLMs
         | faster rather than any progress towards "general intelligence".
         | 
         | Edit: clarity and structure
        
         | gwern wrote:
         | It is pretty much the same scaling, though:
         | https://arxiv.org/pdf/2412.09871#page=10 It just lets you avoid
         | some of the pathologies of BPEs.
        
         | spindump8930 wrote:
         | This paper is very cool, comes from respected authors, and is a
         | very nice idea with good experiments (flop controlled for
         | compute). It shouldn't be seen as a wall-breaking innovation
         | though. From the paper:
         | 
         | > Existing transformer libraries and codebases are designed to
         | be highly efficient for tokenizer-based transformer
         | architectures. While we present theoretical flop matched
         | experiments and also use certain efficient implementations
         | (such as FlexAttention) to handle layers that deviate from the
         | vanilla transformer architecture, our implementations may yet
         | not be at parity with tokenizer-based models in terms of wall-
         | clock time and may benefit from further optimizations.
         | 
         | And unfortunately wall-clock deficiencies mean that any quality
         | improvement needs to overcome that additional scaling barrier
         | before any big runs (meaning expensive) can risk using it.
        
       | armcat wrote:
       | This was previously reported 5 months ago:
       | https://news.ycombinator.com/item?id=42415122 (84 comments).
       | 
       | As an aside - I am a big fan of Luke Zettlemoyer and his team at
       | the University of Washington. They've been doing cool NLP
       | research for years!
        
       | entilzha wrote:
       | Great to see our paper here again! Since the paper release, we've
       | also released model weights here for anyone interesting in
       | building on top of it: https://huggingface.co/facebook/blt. We
       | also added HF Hub code to easily load the model
       | https://github.com/facebookresearch/blt?tab=readme-ov-file#l....
        
       ___________________________________________________________________
       (page generated 2025-05-12 23:00 UTC)