[HN Gopher] Byte Latent Transformer: Patches Scale Better Than T...
       ___________________________________________________________________
        
       Byte Latent Transformer: Patches Scale Better Than Tokens
        
       Author : zxexz
       Score  : 264 points
       Date   : 2024-12-14 06:36 UTC (16 hours ago)
        
 (HTM) web link (ai.meta.com)
 (TXT) w3m dump (ai.meta.com)
        
       | bloomingkales wrote:
       | I thought we're supposed to be plateauing!?
        
         | ArnoVW wrote:
         | We are. Plateauing doesnt mean you don't book progress.
         | Arguably that is what you would call "plateaud".
         | 
         | The argument of plateauing is not that AI is fundamentally
         | impossible. The argument is that just dumping more data and
         | more compute on the problem, using the same approach, has
         | diminishing returns.
         | 
         | It's that statistical inference is not how the human mind works
         | (not exclusively) and thus that we are not _guaranteed_ to be
         | able to replicate all traits of human intelligence by brute
         | forcing.
         | 
         | Of course we can and will still improve the algorithms. But the
         | question remains whether tweaks like these, as cool and useful
         | they may be to solve certain issues, will be enough by
         | themselves.
         | 
         | Since it remains statistical in nature, my position is "no".
        
           | logicchains wrote:
           | > that we are not guaranteed to be able to replicate all
           | traits of human intelligence by brute forcing.
           | 
           | We know from complexity theory that transformers with chain
           | of thought are guaranteed to be able to reproduce a
           | significant fraction of human reasoning, anything in the
           | complexity class PTIME: https://arxiv.org/abs/2310.07923
        
             | fl0id wrote:
             | I don't think this paper says what you claim. It says chain
             | of reasoning and its length can improve transformer
             | performance. Not that this represent a significant fraction
             | of human reasoning or that it's even reasoning
        
         | random3 wrote:
         | who's "we"?
        
       | qouteall wrote:
       | Related quote from Karpathy:
       | 
       | Tokenization is at the heart of much weirdness of LLMs. Do not
       | brush it off.
       | 
       | * Why can't LLM spell words? Tokenization.
       | 
       | * Why can't LLM do super simple string processing tasks like
       | reversing a string? Tokenization.
       | 
       | * Why is LLM worse at non-English languages (e.g. Japanese)?
       | Tokenization.
       | 
       | * Why is LLM bad at simple arithmetic? Tokenization.
       | 
       | * Why did GPT-2 have more than necessary trouble coding in
       | Python? Tokenization.
       | 
       | * Why did my LLM abruptly halt when it sees the string
       | "<|endoftext|>"? Tokenization.
       | 
       | * What is this weird warning I get about a "trailing whitespace"?
       | Tokenization.
       | 
       | * Why the LLM break if I ask it about "SolidGoldMagikarp"?
       | Tokenization.
       | 
       | * Why should I prefer to use YAML over JSON with LLMs?
       | Tokenization.
       | 
       | * Why is LLM not actually end-to-end language modeling?
       | Tokenization.
       | 
       | * What is the real root of suffering? Tokenization.
        
         | withinboredom wrote:
         | It's weird because I'm pretty sure my brain does something
         | similar when I speed read. I don't actually, usually, read the
         | words; instead I recognize the shape of the words (most common
         | words) then I jump to the subject of the paragraphs and break
         | down the meaning of the whole page in a second or so.
        
           | Timwi wrote:
           | That's generally true, but you also have the ability to stop
           | and look closer if you want to. If someone asks you to count
           | the letters in a word, you will stop to look at the letters
           | individually. If you see an unfamiliar word like
           | SolidGoldMagikarp, you can stop and break it apart.
           | Tokenization prevents LLMs from doing this.
        
             | kimixa wrote:
             | Generally the current crop of LLMs seem pretty good
             | analogues of the "scan reading" immediate instinctual
             | response to stimulus, but seems to completely lack the
             | higher level that can then go "Wait, that doesn't seem
             | right, let's go back over that again". Like hallucinations
             | and seeing "Faces" in dark shadows until you look again,
             | it's like it's doing a pretty good emulation of some level
             | of consciousness.
             | 
             | Is that a fundamental difference to the level of
             | processing? I haven't seen that sort of second-tier logic
             | pop up from any emergence behaviors from increasing scale
             | yet, but will that come with time? I'm not sure.
        
               | visarga wrote:
               | You can prompt the model to do that kind of "stream of
               | mind" process. It will maximize modeling uncertainty.
               | This is my prompt:
               | 
               | > Write in a raw, real-time stream-of-consciousness
               | style, as if actively solving a problem. Your response
               | should feel like unpolished notes--messy, exploratory,
               | and authentic. Show your full thought process, including
               | missteps, dead ends, and course corrections. Use markers
               | to signal mental states: Insights: "Wait -", "Hold on -",
               | "Oh -", "Suddenly seeing -", "This connects to -".
               | Testing: "Testing with -", "Breaking this down -",
               | "Running an example -", "Checking if -". Problems: "Stuck
               | on -", "This doesn't work because -", "Need to figure out
               | -", "Not quite adding up -". Progress: "Making headway
               | -", "Starting to see the pattern -", "Explains why -",
               | "Now it makes sense -". Process: "Tracing the logic -",
               | "Following this thread -", "Unpacking this idea -",
               | "Exploring implications -". Uncertainty: "Maybe -",
               | "Could be -", "Not sure yet -", "Might explain -".
               | Transitions: "This leads to -", "Which means -",
               | "Building on that -", "Connecting back to -". Lean into
               | real-time realizations: "Wait, that won't work
               | because..." or "Ah, I missed this..." Show evolving
               | understanding through short paragraphs, with natural
               | pauses where ideas shift. Structure your thought
               | evolution as follows: Begin with an initial take: "This
               | might work because..." or "At first glance..." Identify
               | problems or angles: "Actually, this doesn't hold up
               | because..." Test examples or counterexamples: "Let me try
               | -", "What happens if -". Seek deeper patterns: "I'm
               | seeing a connection -", "This ties back to -". Link
               | broader implications: "This means -", "If this holds,
               | then -". Admit confusion openly: "I don't get this yet",
               | "Something's missing here". Reveal partial understanding:
               | "I see why X, but not Y". Show failures and iterations:
               | "Still not right - trying another approach". Embrace a
               | debugging mindset, treating ideas like code--break them
               | into steps, test logic, reveal failure modes, and
               | iterate. Skip introductions and conclusions. Stop when
               | you solve the problem or find clear next steps. Use
               | short, direct sentences to mimic real-time thinking. The
               | goal is to capture the messy, evolving nature of problem-
               | solving and thought refinement.
               | 
               | Just try this, you can insert at any point in a LLM chat
               | session. I built it by reverse engineering the QwQ-32B
               | model responses with Claude. QwQ itself is based on the
               | GPT-o1 method.
        
               | astrange wrote:
               | I've tried prompts like this with Claude, but it can get
               | so nitpicky of itself that it runs out of space for the
               | actual answer. It seems it does help to train the model
               | to do it.
        
             | PaulHoule wrote:
             | I've often wanted to talk with an LLM about its
             | tokenization (e.g. how many tokens are there in "the
             | simplest of phrases") I wonder if you fed it information
             | about its tokenization (text like "rabbit is spelled r, a,
             | b, b, i, t") if it could talk about it.
        
           | dr_dshiv wrote:
           | Well said!!
           | 
           | I'm waiting for reading studies on AI generated text, that's
           | a different kind of speed read
        
           | entilzha wrote:
           | (Author Here)
           | 
           | In editing we couldn't find a good place for this so cut it
           | in the current version, but at one point had discussed a
           | parallel with information density of speech as described by
           | one paper. Essentially the paper found that in languages that
           | were less information dense per syllable, speakers spoke
           | faster to achieve similar information density as languages
           | with higher density per syllable. You could see patching by
           | entropy paralleling this if you consider that low entropy
           | bytes in terms of Shannon entropy are less information dense.
        
         | orbital-decay wrote:
         | Meta's approach doesn't seem to throw out character grouping
         | entirely, it just makes it dynamic.
        
         | saurik wrote:
         | In all seriousness: why has it been years now and it feels like
         | there is no incremental engineering-level progress on these
         | issues? Like, it seems like doing some manual intervention to
         | the tokenization to at least remove exceptional tokens and add
         | some semantics to how they break up numbers seem like quick
         | wins.
        
           | falcor84 wrote:
           | >In all seriousness: why has it been years now and it feels
           | like there is no incremental engineering-level progress on
           | these issues?
           | 
           | From where I'm standing, LLMs appear to be the fastest moving
           | technological field in history.
        
             | PaulHoule wrote:
             | A field can seem to be going quickly and going nowhere at
             | the same time. Or rather a new technique can be invented
             | and then exhausted in the time it takes somebody to get a
             | PhD. (See
             | https://en.wikipedia.org/wiki/Renormalization_group applied
             | to phase transitions, which turned up just in time for the
             | physics job crisis of 1970)
             | 
             | I didn't ever believe that there was going to be a GPT-5
             | trained with exponentially more text and resources. Not
             | only is there not enough text, but that's the path to ruin.
             | Why?
             | 
             | Cycle time. Two years ago we had little idea of how those
             | models work so I knew there was a huge room in improving
             | performance. It gets the cost down, it lets you put the
             | models on your device, and it speeds up development. If I
             | can train 10 models in the time it takes you to train 1
             | model I can make much faster progress.
             | 
             | However even a GPT-15 trained with a Dyson sphere is going
             | to struggle to sort things. (Structurally a pure LLM can't
             | do that!) My #1 beef with Microsoft's Copilot is that you
             | can ask it if it can sort a certain list of items (either a
             | list you are discussing with it or say "states of the
             | United States ordered by percent water area") it will say
             | yes and if you ask it what it thinks the probability is
             | that it will get it in the right order it will say "very
             | high" but when you try it the list comes out totally wrong.
             | 
             | It is equally unable to "help me make an atom bomb" except
             | in the bomb case it will say that it can't but in the
             | sorting case it says it can.
             | 
             | The obvious answer is that it should use tools to sort.
             | That's right but the problem of "knowing what you can
             | really do with your tools" is philosophically challenged.
             | (With problems so intractable it leads people like Roger
             | Penrose to conclude "I couldn't do math if I wasn't a
             | thetan")
        
               | JustAndy wrote:
               | I'm not really sure I understand your sorting example,
               | maybe try it out in gpt and post the link to show exactly
               | what you mean.
               | 
               | The refusal of the model is something trained into the
               | model by the process of rlhf, and it can also be
               | untrained, by the process of abliteration [1].
               | 
               | Also, LLMs are capable of using tools in this very moment
               | [2].
               | 
               | [1]: https://huggingface.co/blog/mlabonne/abliteration
               | [2]: https://www.anthropic.com/news/analysis-tool
        
               | PaulHoule wrote:
               | I'm deliberately blurring refusal with having an accurate
               | picture of its own abilities and, past that, having an
               | accurate picture of of what it can do given tools. Both
               | are tested by                  "Can you X?"
               | 
               | With refusal you find just how shallow it is because it
               | really will answer all sorts of questions that are
               | "helpful" in making a nuclear bomb but when you ask it
               | directly it shuts up. In another sense nothing it does is
               | "helpful" because it's not going to hunt down some people
               | in central asia who have 50kg of U235 burning a hole in
               | their pocket for you, which is what would actually
               | "help".
               | 
               | I use tool using LLMs frequently, but I find they
               | frequently need help using their tools, it is a lot of
               | fun to talk to Windsurf about the struggles it has with
               | its tools and it feels strangely satisfying to help it
               | out.
        
             | saurik wrote:
             | You totally ignored "on these issues" and are essentially
             | saying there is no need to work on that as they worked on
             | something else, which is extremely strange for a thing
             | which feels like a really trivial win, and should be
             | shocking.
             | 
             | Whether you like it or not, it is entirely fair to look at
             | an entire ecosystem and ask why some trivial thing that
             | everyone talks about all the time hasn't seen any attention
             | even if the entire ecosystem is getting widespread
             | advancement.
             | 
             | Like, I think it would also be fair to complain about how
             | bad the hinge on AirPods are, causing the case to explode
             | when dropped and your earbuds to fly everywhere
             | (potentially getting very dirty) as well as wear out and
             | cause spurious activation (leading to audio routing issues
             | and rapid battery drain).
             | 
             | To then point out that this is one of the most successful
             | consumer devices in recent years and was a remarkable
             | improvement to what came before as well as a continuing
             | achievement of engineering as they do in fact get better in
             | amazing ways every couple years is more than just a non
             | sequitur: it is frankly just annoying.
        
           | entilzha wrote:
           | (Author Here)
           | 
           | There is at least some work on character based modeling, but
           | it hasn't scaled well before. The challenge I think with
           | something more adhoc for exceptional tokens is that it's hard
           | to see gains since they are by definition, infrequent. If the
           | text is rare enough, BPE should produce many single byte
           | tokens, so current models actually expend more compute on
           | these rare sequences.
           | 
           | BLT scales well because it expends less compute (by patching)
           | on more predictable (low entropy) byte sequences. Current
           | models only to some degree get this benefit, if it's a larger
           | BPE token, but that only goes so far.
           | 
           | So it's really two related, but different motivations.
        
         | rjtavares wrote:
         | Goodbye tokenization problems, hello encoding problems!
        
         | Vetch wrote:
         | !Long post warning!
         | 
         | Tokenization is often scapegoated for many transformer
         | limitations. I suppose it's because reading about the many
         | limitations of the transformer architecture is harder than
         | dumping everything on tokenization (which to be fair, is often
         | indirectly involved with or exacerbating some deeper issue).
         | 
         | > Why can't LLM spell words? Tokenization.
         | 
         | LLMs can spell if you ask them to though. And there have been
         | investigations into this capability (ref:2). Tokenization makes
         | computations that involve spelling more difficult, but this is
         | downstream of deeper computational limitations of the
         | architecture.
         | 
         | > Why can't LLM do super simple string processing tasks like
         | reversing a string?
         | 
         | Ditto.
         | 
         | > Why is LLM worse at non-English languages (e.g. Japanese)?
         | Tokenization.
         | 
         | Tokenization is also implicitly performing compression. If your
         | tokenizer's corpus is focused only on english, basic
         | information theory explains why it'll be less efficient for
         | other languages. The net effect is longer sequences where
         | tokens are less information dense for non-english languages on
         | average.
         | 
         | > Why is LLM bad at simple arithmetic? Tokenization.
         | 
         | Tokenization could treat digits separately and I believe,
         | llama2 did this. But OpenAI built tiktoken which does not do
         | this. llama3 uses tiktoken.
         | 
         | The transformer architecture also has limitations that make
         | (default) arithmetic computations involving carries difficult
         | to learn. You can read more about this in (ref:1).
         | 
         | > Why did my LLM abruptly halt when it sees the string
         | "<|endoftext|>"? Tokenization.
         | 
         | Why should it not? Either way, it doesn't have to halt, as the
         | sampler can just ignore this. But the distribution will still
         | condition on this as a change of topic switch. The question
         | should probably be, why did the LLM suddenly assign high
         | probability to a stop token before finishing whatever it was
         | writing?
         | 
         | > What is this weird warning I get about a "trailing
         | whitespace"? Tokenization.
         | 
         | Modeling decisions for how to treat whitespace is upstream of
         | tokenization. These choices affect how the LLM models word
         | boundaries. Things can be fine most of the time until they
         | aren't.
         | 
         | There's also the issue of softmax. The way softmax is typically
         | applied forces the model to always assign importance to some
         | tokens, even when no strong relationships exist between them.
         | This in turn leads to the model disproportionately dumping its
         | focus on often semantically unimportant tokens like whitespace
         | or punctuation. Misallocating attention in this manner can lead
         | to wasting representational capacity due to overemphasizing
         | unimportant tokens, perhaps inducing spurious correlations on
         | whitespace. This issue propagates through the model, possibly
         | leading to unexpected negative downstream effects.
         | 
         | > Why the LLM break if I ask it about "SolidGoldMagikarp"?
         | Tokenization.
         | 
         | One step down, it's really a result of high dimensional random
         | vectors.
         | 
         | > Why should I prefer to use YAML over JSON with LLMs?
         | Tokenization.
         | 
         | > Why did GPT-2 have more than necessary trouble coding in
         | Python? Tokenization.
         | 
         | Tokenization does make counting more difficult but the net
         | benefit to programming languages where whitespace can be
         | semantically meaningful is a strong positive. Even when
         | whitespace is not meaningful, long strings of them can often be
         | encountered. Not being careful about devoting tokenization
         | effort on whitespace will significantly degrade code modeling
         | ability in LLMs.
         | 
         | > Why is LLM not actually end-to-end language modeling?
         | Tokenization.
         | 
         | This is correct, but it is not necessarily the case that a
         | character or byte based model will automatically be better. The
         | issue is that LLMs as currently devised spend the same amount
         | of computation per token. This creates the immediate problem of
         | making meaningful sequences, which will now be substantially
         | longer, substantially more expensive to compute, generate and
         | store in memory. This is what the posted paper seeks to address
         | over naive byte level modeling. Although it's unclear from the
         | provided tables if what's claimed is actually what's occurring.
         | 
         | Character level modeling will also make learning long ranged
         | dependencies harder. Subword tokenization also aids in
         | memorization, which can be useful in learning from the tail of
         | the distribution. The following idea is based on (ref:5).
         | 
         | Next-token prediction can be modeled as a hierarchical sampling
         | process where problem instances (topics, natural language
         | tasks), which are mixture distributions, are drawn from a
         | metadistribution, and then data points (eg various strings) are
         | sampled from specific subpopulations (ie clusters of task
         | types) within those instances. Here, memorization is a key
         | strategy since there's initial uncertainty about which features
         | are relevant for predicting the next token. Particularly for
         | rare examples, memorizing their details acts as a starting
         | point for associating particular patterns with specific
         | subpopulations, in turn allowing more accurate prediction of
         | new points.
         | 
         | From that starting point, the model can eventually refine its
         | associations as it encounters more data. This is key for
         | example, when sampling from the tail of the distribution where
         | data about subpopulations will be more limited. Making
         | memorization and learning longer dependencies more challenging
         | can lead to final models that face more difficulty during ICL
         | inference, which depends, among other things, on the ability to
         | infer which task from a mixture distribution.
         | 
         | > What is the real root of suffering? Tokenization.
         | 
         | A better candidate is over-generalization.
         | 
         | 1: https://arxiv.org/abs/2310.16028
         | 
         | 2: What do tokens know about their characters and how do they
         | know it? (https://aclanthology.org/2022.naacl- main.179.pdf)
         | 
         | 3: https://arxiv.org/abs/2406.10851
         | 
         | 4: Between words and characters: A Brief History of Open-
         | Vocabulary Modeling and Tokenization in NLP
         | (https://arxiv.org/abs/2112.10508)
         | 
         | 5: https://arxiv.org/abs/2012.06421
        
       | fabmilo wrote:
       | I am gonna read this paper and the other latent sentence later
       | today. I always advocated for this kind of solutions together
       | with latent sentence search should get to the next level of AI.
       | Amazing work from Meta
        
         | CuriousSkeptic wrote:
         | Sentence thing being this one?
         | https://ai.meta.com/research/publications/large-concept-mode...
         | 
         | I don't get it, isn't this concept modelling exactly whats
         | going on in the deeper layers of current LLMs?
        
           | kmacdough wrote:
           | Perhaps it does some similar grouping of content, but this
           | more directly incentivizes longer term gripping of tokens
           | into abstract concepts. I agree that it's not obvious this
           | would perform better than letting the model build it's own
           | structures for grouping tokens, but the proof is in the
           | pudding; the technique led to improved results for a given
           | model & training size. This newer approach gives the model
           | the freedom to build it's own breakpoints, but still bakes
           | the idea into the algorithm itself.
           | 
           | What it means is a harder question. Perhaps transformers are
           | simply an inefficient computational structure for this
           | process? Perhaps a more flexible computational structure
           | would integrate this step more efficiently? Perhaps
           | Transformers are efficient enough, but our
           | learning/densifying isn't? Or perhaps it's such a core
           | powerful step that it might as well be built into the algo
           | regardless? Much to learn.
        
       | flimflamm wrote:
       | To create a patch, a small model is used to predict the
       | likelihood for the next character in the input string. Input
       | string: 'Lazy dog jumped over a fence.' Use the model to predict
       | the likelihood of each character.
       | 
       | For example:                   100% sure the next character is
       | 'a'.         Or maybe it's 10% sure it's 'a', 10% sure it's 'b',
       | and so on.
       | 
       | Then we chunk character estimates together. How many characters?
       | Enough characters so that the total uncertainty (entropy) in each
       | chunk is about the same. And there you have your 'patch' (or
       | 'token').
        
         | yorwba wrote:
         | > How many characters? Enough characters so that the total
         | uncertainty (entropy) in each chunk is about the same.
         | 
         | That's not how it's described in Section 2.3 of the paper. They
         | only use the entropy of the next byte and whether it exceeds a
         | threshold (Global Constraint) or is larger than the preceding
         | byte's entropy by another threshold (Approx. Monotonic
         | Constraint).
         | 
         | That does mean that long repetitive sequences can result in
         | pathologically long patches, as demonstrated in Appendix E.
         | 
         | But what I'm really curious about is the "small CNN byte-level
         | model with 2-byte context" in Figure 3 (f), because it's never
         | mentioned in any other part of the paper.
        
           | flimflamm wrote:
           | "That's not how it's described" - Thanks for the correction!
        
           | entilzha wrote:
           | (Author Here)
           | 
           | Good description! Maybe what parent got mixed up on is an
           | alternate way to view this is trying to chunk bytes to have
           | roughly similar information. EG we initially tried a bunch of
           | patching schemes, EG, keep a running total of entropy until
           | the total exceeds a threshold, but ended up finding simple
           | things worked better.
           | 
           | I'll see if we can add more information about the small CNN
           | in a next update to arXiv paper.
        
             | psb217 wrote:
             | One way of thinking about the "Approximate Monotonic
             | Constraint" is that you're running a quick and dirty edge
             | detector on the entropy. Ie, you're clipping based on the
             | gradient of per-byte entropy wrt timestep compared to
             | detecting an edge based on gradient of per-pixel intensity
             | wrt pixel coordinates. It would be interesting to look at
             | the raw sequences of per-byte entropies to see how strongly
             | these sorts of "edges" correlate with human interpretable
             | boundaries (words, prefixes, suffixes, etc).
        
               | yorwba wrote:
               | Figure 4 plots the entropy of each byte in "Daenerys
               | Targeryen is in Game of Thrones, a fantasy epic by George
               | R.R. Martin."
        
             | cschmidt wrote:
             | I'm curious if you're aware of some papers from around 2005
             | on using contextual entropy to do unsupervised word
             | segmentation on Chinese, and other languages that don't use
             | spaces for word boundaries.
             | 
             | https://aclanthology.org/Y03-1017/
             | https://aclanthology.org/I05-1009/
             | https://aclanthology.org/P06-2056/
             | 
             | Exactly the same approach of segmenting a word when the
             | entropy goes up compared to the previous byte.
        
               | entilzha wrote:
               | At least I wasn't aware of this work, but thanks for the
               | refs! I'm always curious to read papers from 10-20+ years
               | ago that have similarly inspired ideas. If it makes
               | sense, we'll mention those in the next related work
               | update.
        
               | ted_dunning wrote:
               | It is also quite similar to Carl de Marcken's work for
               | segmenting text and speech. He phrased everything in
               | terms of minimum description length (MDL), but that is
               | trivially the same thing as local entropy.
               | 
               | https://dspace.mit.edu/handle/1721.1/7191?show=full
        
         | dv_dt wrote:
         | So a variant might be to try using a some standard compression
         | algorithm to train with?
        
       | paraschopra wrote:
       | My notes:
       | 
       | It's a 3 component model.
       | 
       | - Encoder: Takes byte groupings and outputs a hidden
       | state/encoding called patches
       | 
       | - Transformer: Takes these encodings of patches in autoregressive
       | fashion
       | 
       | - Decoder: Takes processed encodings by transformers and outputs
       | bytes
       | 
       | Loss is on byte to byte crossentropy (Next byte prediction)
       | 
       | How they group bytes.
       | 
       | - Use entropy thresholds: If a sequence of bytes have entropy
       | lower than a threshold, group them
       | 
       | - This is a learned model (from data)
       | 
       | Why this helps over current byte-pair tokenization in LLMs.
       | 
       | - Encoder/decoder essentially act as "learnable" tokenization
       | scheme
       | 
       | - Better efficiency tradeoffs (as for highly predictable sequence
       | of bytes, encoder can "offload" computation effort from the main
       | transformer)
       | 
       | - History teaches us that end to end learned system beats human
       | designed mechanisms
        
         | CuriousSkeptic wrote:
         | > History teaches us that end to end learned system beats human
         | designed mechanisms
         | 
         | I think this may need some qualifiers
         | 
         | Even byte representations are human designed encodings. I would
         | think a human designed decoder of such encodings must be more
         | efficient than learning. Sure bytes encoding a stream of
         | unicode code points maps fairly easy to useful information. But
         | bytes representing a zip compressed collection of PDF files?
         | 
         | I did wonder though, training on text encoding vs pixel
         | encoding, perhaps brute forcing OCR, like humans, will be more
         | flexible in the end then being limited to text encodings.
        
           | paraschopra wrote:
           | >Even byte representations are human designed encodings
           | 
           | The point is that it can model any sequence of bytes. It's
           | what-follows-what that matters, not how we're encoding it.
        
         | hiddencost wrote:
         | > History teaches us that end to end learned system beats human
         | designed mechanisms
         | 
         | Depends how far back you go. History teaches us that everything
         | is a trade off between model size, inference time, training
         | time, and training data size, Once you're at the pareto
         | frontier. And that cheap approximations can allow you to trade
         | for more expensive computation elsewhere.
         | 
         | That lesson has been obscured for the last decade because (1)
         | "the bitter lesson" of scaling, and, (2), we're blowing past
         | benchmarks too quickly.
         | 
         | I do agree that learned models are better if they're free
         | (compare the distribution of filter banks learned by a neutral
         | acoustic model to those approximated by mel frequency cepstral
         | coefficients), but once you start hitting scaling limits, cheap
         | heuristics start creeping back in.
         | 
         | BPE was a huge advancement over fixed vocab, e.g.
        
           | entilzha wrote:
           | (Author Here)
           | 
           | Related thought, I think BPE is quite a good, cheap inductive
           | bias to have in a model, which is part of what made it
           | challenging to scale better against. I also suspect this is
           | part of why with less training FLOPs BPE is better (left side
           | of figure 1), BLT has to expend some of its FLOPs budget to
           | recover/learn some of this useful bias. With more training
           | FLOPs this becomes a smaller fraction of the budget though
           | leading to better scaling.
        
       | dr_dshiv wrote:
       | Does this mean AI can pre-train on binaries?
        
         | bloomingkales wrote:
         | Some believe AI can now output compiled binaries (e.g update
         | Notepad.exe with this feature).
         | 
         | We all think AI writing code for us will be the end, but it
         | might be an even simpler take over.
        
           | 8n4vidtmkvmk wrote:
           | That just sounds worse though? We can't validate the change
           | is correct if we can't read the code. It is interesting
           | though
        
             | hackernewds wrote:
             | at some point you can't or won't be allowed to do any
             | validations
        
       | dewijones92 wrote:
       | notebooklm
       | https://notebooklm.google.com/notebook/77fe83ee-35b3-4a9a-a3...
        
         | ricardobeat wrote:
         | Interesting, this is one of the worst NotebookLM examples I've
         | seen so far. They are interjecting way too often and breaking
         | the rhythm. Is generation quality going down due to the
         | popularity of the service?
        
           | bratao wrote:
           | Yeah, super strange. One cannot finish a sentence without the
           | other interjecting.
        
           | stuartjohnson12 wrote:
           | Big successful launch, hype for the product lead, product
           | lead moves on, product goes to shit. Another classic for the
           | Google graveyard.
        
             | marviel wrote:
             | Core team just moved on to something else:
             | https://werebuilding.ai/
        
               | 8n4vidtmkvmk wrote:
               | That has to be the worst landing page ever
        
               | marviel wrote:
               | no comment; not my site, just sharing
        
             | throwaway20222 wrote:
             | We are working directly with the Notebook team from the
             | outside, and while they have lost the original product
             | lead, the team in general is seemingly really well
             | supported, staffed with talented folks, and actively trying
             | to understand what the end user wants from the product.
             | Hardly a day goes by that they are not actively trying to
             | get more feedback and share where they are heading.
             | 
             | I do think it is fair to say they had been caught off guard
             | by the success of the program and are trying to catch up.
             | Maybe this is just a bit of drift as they are figuring it
             | all out? Or maybe I am too charitable.
        
               | mrbungie wrote:
               | > and while they have lost the original product lead
               | 
               | Doesn't matter how talented the team is, that is a
               | massive red flag as not even six months have passed since
               | the launch.
        
             | refulgentis wrote:
             | Uh, the product manager left last week.
        
         | yeahwhatever10 wrote:
         | People like this?
        
       | amelius wrote:
       | Why can't the tokenization be implicit, so we only feed bytes (or
       | characters) to the model?
        
         | killerstorm wrote:
         | It can work, but you have more tokens / weaker performance.
         | 
         | People tested it and it was worse.
        
         | entilzha wrote:
         | (Author Here)
         | 
         | Not sure what you mean by implicit? If you mean just treat
         | bytes as tokens, one issue you run into is your sequence
         | lengths get quite long, so compared to a regular token LLM, you
         | can't pack as many bytes in a batch, which means you're pretty
         | FLOP inefficient so scale worse. You could make the model
         | smaller to compensate, but then the model isn't as good.
        
       | PaulHoule wrote:
       | The summer that BERT came out I was working at a startup that was
       | using character-based CNN models for classification. We were
       | thinking a lot about alternate representations, other members of
       | the team were keen on word vectors but I wasn't, particularly
       | because it seemed the documents were were working on frequently
       | had out-of-dictionary words, because those words were important,
       | and because discarding them would lead to failure.
       | 
       | (We were working on "foundation models" too, so it's not just
       | being out-of-dictionary in the final model that's a problem but
       | being out-of-dictionary in the foundation model which is more
       | expensive to train.)
       | 
       | We were doing OK with character based models for classification
       | but people believed that storing the "dictionary" inside the
       | neural net was not a good use of the neural net so there was a
       | lot of enthusiasm for tokens.
       | 
       | Meanwhile I felt so sure that schemes like Word2Vec were doomed
       | that I had left an earlier project using RNNs where the goal was
       | text understanding with a foundation model made by training an
       | RNN to write fake abstracts for case reports from PubMed.
       | 
       | When byte-pair encoding was introduced I remember telling people
       | in a meeting that it was the first tokenization scheme we'd
       | looked at that I could endorse.
       | 
       | I have to admit though that I wish we could work at the character
       | label.
        
         | yndoendo wrote:
         | Do you mean that all produced output must be a chain or words
         | found in a dictionary?
         | 
         | The real-world for humans has them creating and using non-
         | dictionary words to communicate daily. A good example is
         | "notify", defined in the dictionary. "notifier", which is not
         | and is used to describe "a means to notify someone". The code
         | to send an email notification is an "email notifier", then
         | there is text message, voice call, call center call back
         | notifiers ....
         | 
         | All industries and organizations have jargon, custom defined
         | words not found in a dictionary and use non distinctive
         | acronyms.
         | 
         | How would a ML output be useful if it cannot handle real world
         | commutation and only lab based sanitization of in-dictionary
         | only responses?
        
           | phh wrote:
           | That's the OP's point. At the time, the community was split
           | between word-level, which has the shortcomings you're
           | describing, and byte-level which is uselessly compute
           | intensive. BPE was the first reasonable in-between. BLT
           | improves on BPE by having the the compression learnable
           | rather than precomputed
        
           | entilzha wrote:
           | (Author here)
           | 
           | If I understand your question right, this is one of the
           | reasons BPE is nice and the parent liked it. For any
           | character sequence, provided the characters are in the
           | alphabet used to create the BPE vocab, there are no unknown
           | words/sequences. One downside of some previous tokenization
           | methods is you could have unknown/UNK tokens, EG dictionary
           | based methods.
           | 
           | In our paper with bytes, we also avoid the UNK issue, since
           | we can have an embedding for every possible byte, since it's
           | not that many (and for sequences of bytes we use hash
           | embedding, although we did test n-gram lookups for the top K
           | frequent byte n-grams in the training data).
        
             | cs702 wrote:
             | Nice work. Thank you for commenting on HN!
             | 
             | Did you guys try using an RNN or some other kind of DNN to
             | encode the patches?
        
               | entilzha wrote:
               | I don't believe so, or at least if someone tried it
               | didn't work well enough that I remember :). Some of the
               | motivation for the architecture changes in encoding
               | patches stemmed from finding FLOP efficient ways to
               | express relationships between byte sequences. E.G.,
               | having a long context window makes sense when dealing
               | with tokens, but you don't need as long as an attention
               | window if you're attending byte sequences to make patch
               | representations, since the patch representations will
               | implicitly be part of a longer context window in terms of
               | number of patches.
        
               | cs702 wrote:
               | Thanks for the quick reply!
               | 
               | Interesting. I would have thought one of those "minimum
               | viable" RNNs (like https://arxiv.org/abs/2410.01201)
               | would have been ideal for this. I might tinker a bit with
               | this :-)
        
         | binarymax wrote:
         | I was really excited for CANINE [1] but it never really went
         | anywhere. Tokens are a hack. They work for the most part, but
         | it's clear when they don't.
         | 
         | [1] https://arxiv.org/abs/2103.06874
        
       | nodja wrote:
       | From my understanding this not only removes tokenization but also
       | sampling correct?
       | 
       | Sampling can be a pain point of LLMs, but they also can enable
       | interesting usages, like forcing grammar so the model always
       | outputs valid JSON or tuning temperature to get more varied
       | distribution, XTC sampling, etc.
       | 
       | What would be the equivalent of these in a BLT?
       | 
       | I can only think of providing the decoder an extra input of
       | allowed/prohibited bytes and run the decode over and over until
       | it outputs something valid, maybe there's a simpler and more
       | obvious approach.
        
         | yorwba wrote:
         | It doesn't remove sampling, and forcing grammar by specifying
         | allowed/prohibited bytes doesn't require running the decoder
         | over and over, you just compute the softmax at the output layer
         | over allowed bytes only and sample from those accordingly, same
         | as with BPE-based models.
        
       | modeless wrote:
       | I really hope this works out. Death to tokenizers!
       | 
       | Interesting that it's a hierarchical structure but only two
       | levels of hierarchy. Stacking more levels seems like an obvious
       | direction for further research.
       | 
       | Note: I posted this comment on another related story[1] and the
       | author replied:
       | 
       | "Author here :), I do think it's a good direction to look into!
       | That said, aside from it being a bit too much to do at once,
       | you'd also have to be careful about how you distributed your FLOP
       | budget across the hierarchy. With two levels, you can make one
       | level (bytes/local encoder) FLOP efficient and the other
       | (patches/global encoder) FLOP intensive. You'd also need to find
       | a way to group patches into larger units. But ya, there are many
       | directions to go from here!"
       | 
       | [1] https://news.ycombinator.com/item?id=42413430
        
       ___________________________________________________________________
       (page generated 2024-12-14 23:00 UTC)