[HN Gopher] Byte Latent Transformer: Patches Scale Better Than T...
___________________________________________________________________
Byte Latent Transformer: Patches Scale Better Than Tokens
Author : zxexz
Score : 264 points
Date : 2024-12-14 06:36 UTC (16 hours ago)
(HTM) web link (ai.meta.com)
(TXT) w3m dump (ai.meta.com)
| bloomingkales wrote:
| I thought we're supposed to be plateauing!?
| ArnoVW wrote:
| We are. Plateauing doesnt mean you don't book progress.
| Arguably that is what you would call "plateaud".
|
| The argument of plateauing is not that AI is fundamentally
| impossible. The argument is that just dumping more data and
| more compute on the problem, using the same approach, has
| diminishing returns.
|
| It's that statistical inference is not how the human mind works
| (not exclusively) and thus that we are not _guaranteed_ to be
| able to replicate all traits of human intelligence by brute
| forcing.
|
| Of course we can and will still improve the algorithms. But the
| question remains whether tweaks like these, as cool and useful
| they may be to solve certain issues, will be enough by
| themselves.
|
| Since it remains statistical in nature, my position is "no".
| logicchains wrote:
| > that we are not guaranteed to be able to replicate all
| traits of human intelligence by brute forcing.
|
| We know from complexity theory that transformers with chain
| of thought are guaranteed to be able to reproduce a
| significant fraction of human reasoning, anything in the
| complexity class PTIME: https://arxiv.org/abs/2310.07923
| fl0id wrote:
| I don't think this paper says what you claim. It says chain
| of reasoning and its length can improve transformer
| performance. Not that this represent a significant fraction
| of human reasoning or that it's even reasoning
| random3 wrote:
| who's "we"?
| qouteall wrote:
| Related quote from Karpathy:
|
| Tokenization is at the heart of much weirdness of LLMs. Do not
| brush it off.
|
| * Why can't LLM spell words? Tokenization.
|
| * Why can't LLM do super simple string processing tasks like
| reversing a string? Tokenization.
|
| * Why is LLM worse at non-English languages (e.g. Japanese)?
| Tokenization.
|
| * Why is LLM bad at simple arithmetic? Tokenization.
|
| * Why did GPT-2 have more than necessary trouble coding in
| Python? Tokenization.
|
| * Why did my LLM abruptly halt when it sees the string
| "<|endoftext|>"? Tokenization.
|
| * What is this weird warning I get about a "trailing whitespace"?
| Tokenization.
|
| * Why the LLM break if I ask it about "SolidGoldMagikarp"?
| Tokenization.
|
| * Why should I prefer to use YAML over JSON with LLMs?
| Tokenization.
|
| * Why is LLM not actually end-to-end language modeling?
| Tokenization.
|
| * What is the real root of suffering? Tokenization.
| withinboredom wrote:
| It's weird because I'm pretty sure my brain does something
| similar when I speed read. I don't actually, usually, read the
| words; instead I recognize the shape of the words (most common
| words) then I jump to the subject of the paragraphs and break
| down the meaning of the whole page in a second or so.
| Timwi wrote:
| That's generally true, but you also have the ability to stop
| and look closer if you want to. If someone asks you to count
| the letters in a word, you will stop to look at the letters
| individually. If you see an unfamiliar word like
| SolidGoldMagikarp, you can stop and break it apart.
| Tokenization prevents LLMs from doing this.
| kimixa wrote:
| Generally the current crop of LLMs seem pretty good
| analogues of the "scan reading" immediate instinctual
| response to stimulus, but seems to completely lack the
| higher level that can then go "Wait, that doesn't seem
| right, let's go back over that again". Like hallucinations
| and seeing "Faces" in dark shadows until you look again,
| it's like it's doing a pretty good emulation of some level
| of consciousness.
|
| Is that a fundamental difference to the level of
| processing? I haven't seen that sort of second-tier logic
| pop up from any emergence behaviors from increasing scale
| yet, but will that come with time? I'm not sure.
| visarga wrote:
| You can prompt the model to do that kind of "stream of
| mind" process. It will maximize modeling uncertainty.
| This is my prompt:
|
| > Write in a raw, real-time stream-of-consciousness
| style, as if actively solving a problem. Your response
| should feel like unpolished notes--messy, exploratory,
| and authentic. Show your full thought process, including
| missteps, dead ends, and course corrections. Use markers
| to signal mental states: Insights: "Wait -", "Hold on -",
| "Oh -", "Suddenly seeing -", "This connects to -".
| Testing: "Testing with -", "Breaking this down -",
| "Running an example -", "Checking if -". Problems: "Stuck
| on -", "This doesn't work because -", "Need to figure out
| -", "Not quite adding up -". Progress: "Making headway
| -", "Starting to see the pattern -", "Explains why -",
| "Now it makes sense -". Process: "Tracing the logic -",
| "Following this thread -", "Unpacking this idea -",
| "Exploring implications -". Uncertainty: "Maybe -",
| "Could be -", "Not sure yet -", "Might explain -".
| Transitions: "This leads to -", "Which means -",
| "Building on that -", "Connecting back to -". Lean into
| real-time realizations: "Wait, that won't work
| because..." or "Ah, I missed this..." Show evolving
| understanding through short paragraphs, with natural
| pauses where ideas shift. Structure your thought
| evolution as follows: Begin with an initial take: "This
| might work because..." or "At first glance..." Identify
| problems or angles: "Actually, this doesn't hold up
| because..." Test examples or counterexamples: "Let me try
| -", "What happens if -". Seek deeper patterns: "I'm
| seeing a connection -", "This ties back to -". Link
| broader implications: "This means -", "If this holds,
| then -". Admit confusion openly: "I don't get this yet",
| "Something's missing here". Reveal partial understanding:
| "I see why X, but not Y". Show failures and iterations:
| "Still not right - trying another approach". Embrace a
| debugging mindset, treating ideas like code--break them
| into steps, test logic, reveal failure modes, and
| iterate. Skip introductions and conclusions. Stop when
| you solve the problem or find clear next steps. Use
| short, direct sentences to mimic real-time thinking. The
| goal is to capture the messy, evolving nature of problem-
| solving and thought refinement.
|
| Just try this, you can insert at any point in a LLM chat
| session. I built it by reverse engineering the QwQ-32B
| model responses with Claude. QwQ itself is based on the
| GPT-o1 method.
| astrange wrote:
| I've tried prompts like this with Claude, but it can get
| so nitpicky of itself that it runs out of space for the
| actual answer. It seems it does help to train the model
| to do it.
| PaulHoule wrote:
| I've often wanted to talk with an LLM about its
| tokenization (e.g. how many tokens are there in "the
| simplest of phrases") I wonder if you fed it information
| about its tokenization (text like "rabbit is spelled r, a,
| b, b, i, t") if it could talk about it.
| dr_dshiv wrote:
| Well said!!
|
| I'm waiting for reading studies on AI generated text, that's
| a different kind of speed read
| entilzha wrote:
| (Author Here)
|
| In editing we couldn't find a good place for this so cut it
| in the current version, but at one point had discussed a
| parallel with information density of speech as described by
| one paper. Essentially the paper found that in languages that
| were less information dense per syllable, speakers spoke
| faster to achieve similar information density as languages
| with higher density per syllable. You could see patching by
| entropy paralleling this if you consider that low entropy
| bytes in terms of Shannon entropy are less information dense.
| orbital-decay wrote:
| Meta's approach doesn't seem to throw out character grouping
| entirely, it just makes it dynamic.
| saurik wrote:
| In all seriousness: why has it been years now and it feels like
| there is no incremental engineering-level progress on these
| issues? Like, it seems like doing some manual intervention to
| the tokenization to at least remove exceptional tokens and add
| some semantics to how they break up numbers seem like quick
| wins.
| falcor84 wrote:
| >In all seriousness: why has it been years now and it feels
| like there is no incremental engineering-level progress on
| these issues?
|
| From where I'm standing, LLMs appear to be the fastest moving
| technological field in history.
| PaulHoule wrote:
| A field can seem to be going quickly and going nowhere at
| the same time. Or rather a new technique can be invented
| and then exhausted in the time it takes somebody to get a
| PhD. (See
| https://en.wikipedia.org/wiki/Renormalization_group applied
| to phase transitions, which turned up just in time for the
| physics job crisis of 1970)
|
| I didn't ever believe that there was going to be a GPT-5
| trained with exponentially more text and resources. Not
| only is there not enough text, but that's the path to ruin.
| Why?
|
| Cycle time. Two years ago we had little idea of how those
| models work so I knew there was a huge room in improving
| performance. It gets the cost down, it lets you put the
| models on your device, and it speeds up development. If I
| can train 10 models in the time it takes you to train 1
| model I can make much faster progress.
|
| However even a GPT-15 trained with a Dyson sphere is going
| to struggle to sort things. (Structurally a pure LLM can't
| do that!) My #1 beef with Microsoft's Copilot is that you
| can ask it if it can sort a certain list of items (either a
| list you are discussing with it or say "states of the
| United States ordered by percent water area") it will say
| yes and if you ask it what it thinks the probability is
| that it will get it in the right order it will say "very
| high" but when you try it the list comes out totally wrong.
|
| It is equally unable to "help me make an atom bomb" except
| in the bomb case it will say that it can't but in the
| sorting case it says it can.
|
| The obvious answer is that it should use tools to sort.
| That's right but the problem of "knowing what you can
| really do with your tools" is philosophically challenged.
| (With problems so intractable it leads people like Roger
| Penrose to conclude "I couldn't do math if I wasn't a
| thetan")
| JustAndy wrote:
| I'm not really sure I understand your sorting example,
| maybe try it out in gpt and post the link to show exactly
| what you mean.
|
| The refusal of the model is something trained into the
| model by the process of rlhf, and it can also be
| untrained, by the process of abliteration [1].
|
| Also, LLMs are capable of using tools in this very moment
| [2].
|
| [1]: https://huggingface.co/blog/mlabonne/abliteration
| [2]: https://www.anthropic.com/news/analysis-tool
| PaulHoule wrote:
| I'm deliberately blurring refusal with having an accurate
| picture of its own abilities and, past that, having an
| accurate picture of of what it can do given tools. Both
| are tested by "Can you X?"
|
| With refusal you find just how shallow it is because it
| really will answer all sorts of questions that are
| "helpful" in making a nuclear bomb but when you ask it
| directly it shuts up. In another sense nothing it does is
| "helpful" because it's not going to hunt down some people
| in central asia who have 50kg of U235 burning a hole in
| their pocket for you, which is what would actually
| "help".
|
| I use tool using LLMs frequently, but I find they
| frequently need help using their tools, it is a lot of
| fun to talk to Windsurf about the struggles it has with
| its tools and it feels strangely satisfying to help it
| out.
| saurik wrote:
| You totally ignored "on these issues" and are essentially
| saying there is no need to work on that as they worked on
| something else, which is extremely strange for a thing
| which feels like a really trivial win, and should be
| shocking.
|
| Whether you like it or not, it is entirely fair to look at
| an entire ecosystem and ask why some trivial thing that
| everyone talks about all the time hasn't seen any attention
| even if the entire ecosystem is getting widespread
| advancement.
|
| Like, I think it would also be fair to complain about how
| bad the hinge on AirPods are, causing the case to explode
| when dropped and your earbuds to fly everywhere
| (potentially getting very dirty) as well as wear out and
| cause spurious activation (leading to audio routing issues
| and rapid battery drain).
|
| To then point out that this is one of the most successful
| consumer devices in recent years and was a remarkable
| improvement to what came before as well as a continuing
| achievement of engineering as they do in fact get better in
| amazing ways every couple years is more than just a non
| sequitur: it is frankly just annoying.
| entilzha wrote:
| (Author Here)
|
| There is at least some work on character based modeling, but
| it hasn't scaled well before. The challenge I think with
| something more adhoc for exceptional tokens is that it's hard
| to see gains since they are by definition, infrequent. If the
| text is rare enough, BPE should produce many single byte
| tokens, so current models actually expend more compute on
| these rare sequences.
|
| BLT scales well because it expends less compute (by patching)
| on more predictable (low entropy) byte sequences. Current
| models only to some degree get this benefit, if it's a larger
| BPE token, but that only goes so far.
|
| So it's really two related, but different motivations.
| rjtavares wrote:
| Goodbye tokenization problems, hello encoding problems!
| Vetch wrote:
| !Long post warning!
|
| Tokenization is often scapegoated for many transformer
| limitations. I suppose it's because reading about the many
| limitations of the transformer architecture is harder than
| dumping everything on tokenization (which to be fair, is often
| indirectly involved with or exacerbating some deeper issue).
|
| > Why can't LLM spell words? Tokenization.
|
| LLMs can spell if you ask them to though. And there have been
| investigations into this capability (ref:2). Tokenization makes
| computations that involve spelling more difficult, but this is
| downstream of deeper computational limitations of the
| architecture.
|
| > Why can't LLM do super simple string processing tasks like
| reversing a string?
|
| Ditto.
|
| > Why is LLM worse at non-English languages (e.g. Japanese)?
| Tokenization.
|
| Tokenization is also implicitly performing compression. If your
| tokenizer's corpus is focused only on english, basic
| information theory explains why it'll be less efficient for
| other languages. The net effect is longer sequences where
| tokens are less information dense for non-english languages on
| average.
|
| > Why is LLM bad at simple arithmetic? Tokenization.
|
| Tokenization could treat digits separately and I believe,
| llama2 did this. But OpenAI built tiktoken which does not do
| this. llama3 uses tiktoken.
|
| The transformer architecture also has limitations that make
| (default) arithmetic computations involving carries difficult
| to learn. You can read more about this in (ref:1).
|
| > Why did my LLM abruptly halt when it sees the string
| "<|endoftext|>"? Tokenization.
|
| Why should it not? Either way, it doesn't have to halt, as the
| sampler can just ignore this. But the distribution will still
| condition on this as a change of topic switch. The question
| should probably be, why did the LLM suddenly assign high
| probability to a stop token before finishing whatever it was
| writing?
|
| > What is this weird warning I get about a "trailing
| whitespace"? Tokenization.
|
| Modeling decisions for how to treat whitespace is upstream of
| tokenization. These choices affect how the LLM models word
| boundaries. Things can be fine most of the time until they
| aren't.
|
| There's also the issue of softmax. The way softmax is typically
| applied forces the model to always assign importance to some
| tokens, even when no strong relationships exist between them.
| This in turn leads to the model disproportionately dumping its
| focus on often semantically unimportant tokens like whitespace
| or punctuation. Misallocating attention in this manner can lead
| to wasting representational capacity due to overemphasizing
| unimportant tokens, perhaps inducing spurious correlations on
| whitespace. This issue propagates through the model, possibly
| leading to unexpected negative downstream effects.
|
| > Why the LLM break if I ask it about "SolidGoldMagikarp"?
| Tokenization.
|
| One step down, it's really a result of high dimensional random
| vectors.
|
| > Why should I prefer to use YAML over JSON with LLMs?
| Tokenization.
|
| > Why did GPT-2 have more than necessary trouble coding in
| Python? Tokenization.
|
| Tokenization does make counting more difficult but the net
| benefit to programming languages where whitespace can be
| semantically meaningful is a strong positive. Even when
| whitespace is not meaningful, long strings of them can often be
| encountered. Not being careful about devoting tokenization
| effort on whitespace will significantly degrade code modeling
| ability in LLMs.
|
| > Why is LLM not actually end-to-end language modeling?
| Tokenization.
|
| This is correct, but it is not necessarily the case that a
| character or byte based model will automatically be better. The
| issue is that LLMs as currently devised spend the same amount
| of computation per token. This creates the immediate problem of
| making meaningful sequences, which will now be substantially
| longer, substantially more expensive to compute, generate and
| store in memory. This is what the posted paper seeks to address
| over naive byte level modeling. Although it's unclear from the
| provided tables if what's claimed is actually what's occurring.
|
| Character level modeling will also make learning long ranged
| dependencies harder. Subword tokenization also aids in
| memorization, which can be useful in learning from the tail of
| the distribution. The following idea is based on (ref:5).
|
| Next-token prediction can be modeled as a hierarchical sampling
| process where problem instances (topics, natural language
| tasks), which are mixture distributions, are drawn from a
| metadistribution, and then data points (eg various strings) are
| sampled from specific subpopulations (ie clusters of task
| types) within those instances. Here, memorization is a key
| strategy since there's initial uncertainty about which features
| are relevant for predicting the next token. Particularly for
| rare examples, memorizing their details acts as a starting
| point for associating particular patterns with specific
| subpopulations, in turn allowing more accurate prediction of
| new points.
|
| From that starting point, the model can eventually refine its
| associations as it encounters more data. This is key for
| example, when sampling from the tail of the distribution where
| data about subpopulations will be more limited. Making
| memorization and learning longer dependencies more challenging
| can lead to final models that face more difficulty during ICL
| inference, which depends, among other things, on the ability to
| infer which task from a mixture distribution.
|
| > What is the real root of suffering? Tokenization.
|
| A better candidate is over-generalization.
|
| 1: https://arxiv.org/abs/2310.16028
|
| 2: What do tokens know about their characters and how do they
| know it? (https://aclanthology.org/2022.naacl- main.179.pdf)
|
| 3: https://arxiv.org/abs/2406.10851
|
| 4: Between words and characters: A Brief History of Open-
| Vocabulary Modeling and Tokenization in NLP
| (https://arxiv.org/abs/2112.10508)
|
| 5: https://arxiv.org/abs/2012.06421
| fabmilo wrote:
| I am gonna read this paper and the other latent sentence later
| today. I always advocated for this kind of solutions together
| with latent sentence search should get to the next level of AI.
| Amazing work from Meta
| CuriousSkeptic wrote:
| Sentence thing being this one?
| https://ai.meta.com/research/publications/large-concept-mode...
|
| I don't get it, isn't this concept modelling exactly whats
| going on in the deeper layers of current LLMs?
| kmacdough wrote:
| Perhaps it does some similar grouping of content, but this
| more directly incentivizes longer term gripping of tokens
| into abstract concepts. I agree that it's not obvious this
| would perform better than letting the model build it's own
| structures for grouping tokens, but the proof is in the
| pudding; the technique led to improved results for a given
| model & training size. This newer approach gives the model
| the freedom to build it's own breakpoints, but still bakes
| the idea into the algorithm itself.
|
| What it means is a harder question. Perhaps transformers are
| simply an inefficient computational structure for this
| process? Perhaps a more flexible computational structure
| would integrate this step more efficiently? Perhaps
| Transformers are efficient enough, but our
| learning/densifying isn't? Or perhaps it's such a core
| powerful step that it might as well be built into the algo
| regardless? Much to learn.
| flimflamm wrote:
| To create a patch, a small model is used to predict the
| likelihood for the next character in the input string. Input
| string: 'Lazy dog jumped over a fence.' Use the model to predict
| the likelihood of each character.
|
| For example: 100% sure the next character is
| 'a'. Or maybe it's 10% sure it's 'a', 10% sure it's 'b',
| and so on.
|
| Then we chunk character estimates together. How many characters?
| Enough characters so that the total uncertainty (entropy) in each
| chunk is about the same. And there you have your 'patch' (or
| 'token').
| yorwba wrote:
| > How many characters? Enough characters so that the total
| uncertainty (entropy) in each chunk is about the same.
|
| That's not how it's described in Section 2.3 of the paper. They
| only use the entropy of the next byte and whether it exceeds a
| threshold (Global Constraint) or is larger than the preceding
| byte's entropy by another threshold (Approx. Monotonic
| Constraint).
|
| That does mean that long repetitive sequences can result in
| pathologically long patches, as demonstrated in Appendix E.
|
| But what I'm really curious about is the "small CNN byte-level
| model with 2-byte context" in Figure 3 (f), because it's never
| mentioned in any other part of the paper.
| flimflamm wrote:
| "That's not how it's described" - Thanks for the correction!
| entilzha wrote:
| (Author Here)
|
| Good description! Maybe what parent got mixed up on is an
| alternate way to view this is trying to chunk bytes to have
| roughly similar information. EG we initially tried a bunch of
| patching schemes, EG, keep a running total of entropy until
| the total exceeds a threshold, but ended up finding simple
| things worked better.
|
| I'll see if we can add more information about the small CNN
| in a next update to arXiv paper.
| psb217 wrote:
| One way of thinking about the "Approximate Monotonic
| Constraint" is that you're running a quick and dirty edge
| detector on the entropy. Ie, you're clipping based on the
| gradient of per-byte entropy wrt timestep compared to
| detecting an edge based on gradient of per-pixel intensity
| wrt pixel coordinates. It would be interesting to look at
| the raw sequences of per-byte entropies to see how strongly
| these sorts of "edges" correlate with human interpretable
| boundaries (words, prefixes, suffixes, etc).
| yorwba wrote:
| Figure 4 plots the entropy of each byte in "Daenerys
| Targeryen is in Game of Thrones, a fantasy epic by George
| R.R. Martin."
| cschmidt wrote:
| I'm curious if you're aware of some papers from around 2005
| on using contextual entropy to do unsupervised word
| segmentation on Chinese, and other languages that don't use
| spaces for word boundaries.
|
| https://aclanthology.org/Y03-1017/
| https://aclanthology.org/I05-1009/
| https://aclanthology.org/P06-2056/
|
| Exactly the same approach of segmenting a word when the
| entropy goes up compared to the previous byte.
| entilzha wrote:
| At least I wasn't aware of this work, but thanks for the
| refs! I'm always curious to read papers from 10-20+ years
| ago that have similarly inspired ideas. If it makes
| sense, we'll mention those in the next related work
| update.
| ted_dunning wrote:
| It is also quite similar to Carl de Marcken's work for
| segmenting text and speech. He phrased everything in
| terms of minimum description length (MDL), but that is
| trivially the same thing as local entropy.
|
| https://dspace.mit.edu/handle/1721.1/7191?show=full
| dv_dt wrote:
| So a variant might be to try using a some standard compression
| algorithm to train with?
| paraschopra wrote:
| My notes:
|
| It's a 3 component model.
|
| - Encoder: Takes byte groupings and outputs a hidden
| state/encoding called patches
|
| - Transformer: Takes these encodings of patches in autoregressive
| fashion
|
| - Decoder: Takes processed encodings by transformers and outputs
| bytes
|
| Loss is on byte to byte crossentropy (Next byte prediction)
|
| How they group bytes.
|
| - Use entropy thresholds: If a sequence of bytes have entropy
| lower than a threshold, group them
|
| - This is a learned model (from data)
|
| Why this helps over current byte-pair tokenization in LLMs.
|
| - Encoder/decoder essentially act as "learnable" tokenization
| scheme
|
| - Better efficiency tradeoffs (as for highly predictable sequence
| of bytes, encoder can "offload" computation effort from the main
| transformer)
|
| - History teaches us that end to end learned system beats human
| designed mechanisms
| CuriousSkeptic wrote:
| > History teaches us that end to end learned system beats human
| designed mechanisms
|
| I think this may need some qualifiers
|
| Even byte representations are human designed encodings. I would
| think a human designed decoder of such encodings must be more
| efficient than learning. Sure bytes encoding a stream of
| unicode code points maps fairly easy to useful information. But
| bytes representing a zip compressed collection of PDF files?
|
| I did wonder though, training on text encoding vs pixel
| encoding, perhaps brute forcing OCR, like humans, will be more
| flexible in the end then being limited to text encodings.
| paraschopra wrote:
| >Even byte representations are human designed encodings
|
| The point is that it can model any sequence of bytes. It's
| what-follows-what that matters, not how we're encoding it.
| hiddencost wrote:
| > History teaches us that end to end learned system beats human
| designed mechanisms
|
| Depends how far back you go. History teaches us that everything
| is a trade off between model size, inference time, training
| time, and training data size, Once you're at the pareto
| frontier. And that cheap approximations can allow you to trade
| for more expensive computation elsewhere.
|
| That lesson has been obscured for the last decade because (1)
| "the bitter lesson" of scaling, and, (2), we're blowing past
| benchmarks too quickly.
|
| I do agree that learned models are better if they're free
| (compare the distribution of filter banks learned by a neutral
| acoustic model to those approximated by mel frequency cepstral
| coefficients), but once you start hitting scaling limits, cheap
| heuristics start creeping back in.
|
| BPE was a huge advancement over fixed vocab, e.g.
| entilzha wrote:
| (Author Here)
|
| Related thought, I think BPE is quite a good, cheap inductive
| bias to have in a model, which is part of what made it
| challenging to scale better against. I also suspect this is
| part of why with less training FLOPs BPE is better (left side
| of figure 1), BLT has to expend some of its FLOPs budget to
| recover/learn some of this useful bias. With more training
| FLOPs this becomes a smaller fraction of the budget though
| leading to better scaling.
| dr_dshiv wrote:
| Does this mean AI can pre-train on binaries?
| bloomingkales wrote:
| Some believe AI can now output compiled binaries (e.g update
| Notepad.exe with this feature).
|
| We all think AI writing code for us will be the end, but it
| might be an even simpler take over.
| 8n4vidtmkvmk wrote:
| That just sounds worse though? We can't validate the change
| is correct if we can't read the code. It is interesting
| though
| hackernewds wrote:
| at some point you can't or won't be allowed to do any
| validations
| dewijones92 wrote:
| notebooklm
| https://notebooklm.google.com/notebook/77fe83ee-35b3-4a9a-a3...
| ricardobeat wrote:
| Interesting, this is one of the worst NotebookLM examples I've
| seen so far. They are interjecting way too often and breaking
| the rhythm. Is generation quality going down due to the
| popularity of the service?
| bratao wrote:
| Yeah, super strange. One cannot finish a sentence without the
| other interjecting.
| stuartjohnson12 wrote:
| Big successful launch, hype for the product lead, product
| lead moves on, product goes to shit. Another classic for the
| Google graveyard.
| marviel wrote:
| Core team just moved on to something else:
| https://werebuilding.ai/
| 8n4vidtmkvmk wrote:
| That has to be the worst landing page ever
| marviel wrote:
| no comment; not my site, just sharing
| throwaway20222 wrote:
| We are working directly with the Notebook team from the
| outside, and while they have lost the original product
| lead, the team in general is seemingly really well
| supported, staffed with talented folks, and actively trying
| to understand what the end user wants from the product.
| Hardly a day goes by that they are not actively trying to
| get more feedback and share where they are heading.
|
| I do think it is fair to say they had been caught off guard
| by the success of the program and are trying to catch up.
| Maybe this is just a bit of drift as they are figuring it
| all out? Or maybe I am too charitable.
| mrbungie wrote:
| > and while they have lost the original product lead
|
| Doesn't matter how talented the team is, that is a
| massive red flag as not even six months have passed since
| the launch.
| refulgentis wrote:
| Uh, the product manager left last week.
| yeahwhatever10 wrote:
| People like this?
| amelius wrote:
| Why can't the tokenization be implicit, so we only feed bytes (or
| characters) to the model?
| killerstorm wrote:
| It can work, but you have more tokens / weaker performance.
|
| People tested it and it was worse.
| entilzha wrote:
| (Author Here)
|
| Not sure what you mean by implicit? If you mean just treat
| bytes as tokens, one issue you run into is your sequence
| lengths get quite long, so compared to a regular token LLM, you
| can't pack as many bytes in a batch, which means you're pretty
| FLOP inefficient so scale worse. You could make the model
| smaller to compensate, but then the model isn't as good.
| PaulHoule wrote:
| The summer that BERT came out I was working at a startup that was
| using character-based CNN models for classification. We were
| thinking a lot about alternate representations, other members of
| the team were keen on word vectors but I wasn't, particularly
| because it seemed the documents were were working on frequently
| had out-of-dictionary words, because those words were important,
| and because discarding them would lead to failure.
|
| (We were working on "foundation models" too, so it's not just
| being out-of-dictionary in the final model that's a problem but
| being out-of-dictionary in the foundation model which is more
| expensive to train.)
|
| We were doing OK with character based models for classification
| but people believed that storing the "dictionary" inside the
| neural net was not a good use of the neural net so there was a
| lot of enthusiasm for tokens.
|
| Meanwhile I felt so sure that schemes like Word2Vec were doomed
| that I had left an earlier project using RNNs where the goal was
| text understanding with a foundation model made by training an
| RNN to write fake abstracts for case reports from PubMed.
|
| When byte-pair encoding was introduced I remember telling people
| in a meeting that it was the first tokenization scheme we'd
| looked at that I could endorse.
|
| I have to admit though that I wish we could work at the character
| label.
| yndoendo wrote:
| Do you mean that all produced output must be a chain or words
| found in a dictionary?
|
| The real-world for humans has them creating and using non-
| dictionary words to communicate daily. A good example is
| "notify", defined in the dictionary. "notifier", which is not
| and is used to describe "a means to notify someone". The code
| to send an email notification is an "email notifier", then
| there is text message, voice call, call center call back
| notifiers ....
|
| All industries and organizations have jargon, custom defined
| words not found in a dictionary and use non distinctive
| acronyms.
|
| How would a ML output be useful if it cannot handle real world
| commutation and only lab based sanitization of in-dictionary
| only responses?
| phh wrote:
| That's the OP's point. At the time, the community was split
| between word-level, which has the shortcomings you're
| describing, and byte-level which is uselessly compute
| intensive. BPE was the first reasonable in-between. BLT
| improves on BPE by having the the compression learnable
| rather than precomputed
| entilzha wrote:
| (Author here)
|
| If I understand your question right, this is one of the
| reasons BPE is nice and the parent liked it. For any
| character sequence, provided the characters are in the
| alphabet used to create the BPE vocab, there are no unknown
| words/sequences. One downside of some previous tokenization
| methods is you could have unknown/UNK tokens, EG dictionary
| based methods.
|
| In our paper with bytes, we also avoid the UNK issue, since
| we can have an embedding for every possible byte, since it's
| not that many (and for sequences of bytes we use hash
| embedding, although we did test n-gram lookups for the top K
| frequent byte n-grams in the training data).
| cs702 wrote:
| Nice work. Thank you for commenting on HN!
|
| Did you guys try using an RNN or some other kind of DNN to
| encode the patches?
| entilzha wrote:
| I don't believe so, or at least if someone tried it
| didn't work well enough that I remember :). Some of the
| motivation for the architecture changes in encoding
| patches stemmed from finding FLOP efficient ways to
| express relationships between byte sequences. E.G.,
| having a long context window makes sense when dealing
| with tokens, but you don't need as long as an attention
| window if you're attending byte sequences to make patch
| representations, since the patch representations will
| implicitly be part of a longer context window in terms of
| number of patches.
| cs702 wrote:
| Thanks for the quick reply!
|
| Interesting. I would have thought one of those "minimum
| viable" RNNs (like https://arxiv.org/abs/2410.01201)
| would have been ideal for this. I might tinker a bit with
| this :-)
| binarymax wrote:
| I was really excited for CANINE [1] but it never really went
| anywhere. Tokens are a hack. They work for the most part, but
| it's clear when they don't.
|
| [1] https://arxiv.org/abs/2103.06874
| nodja wrote:
| From my understanding this not only removes tokenization but also
| sampling correct?
|
| Sampling can be a pain point of LLMs, but they also can enable
| interesting usages, like forcing grammar so the model always
| outputs valid JSON or tuning temperature to get more varied
| distribution, XTC sampling, etc.
|
| What would be the equivalent of these in a BLT?
|
| I can only think of providing the decoder an extra input of
| allowed/prohibited bytes and run the decode over and over until
| it outputs something valid, maybe there's a simpler and more
| obvious approach.
| yorwba wrote:
| It doesn't remove sampling, and forcing grammar by specifying
| allowed/prohibited bytes doesn't require running the decoder
| over and over, you just compute the softmax at the output layer
| over allowed bytes only and sample from those accordingly, same
| as with BPE-based models.
| modeless wrote:
| I really hope this works out. Death to tokenizers!
|
| Interesting that it's a hierarchical structure but only two
| levels of hierarchy. Stacking more levels seems like an obvious
| direction for further research.
|
| Note: I posted this comment on another related story[1] and the
| author replied:
|
| "Author here :), I do think it's a good direction to look into!
| That said, aside from it being a bit too much to do at once,
| you'd also have to be careful about how you distributed your FLOP
| budget across the hierarchy. With two levels, you can make one
| level (bytes/local encoder) FLOP efficient and the other
| (patches/global encoder) FLOP intensive. You'd also need to find
| a way to group patches into larger units. But ya, there are many
| directions to go from here!"
|
| [1] https://news.ycombinator.com/item?id=42413430
___________________________________________________________________
(page generated 2024-12-14 23:00 UTC)