[HN Gopher] What Happened to Bert and T5?
       ___________________________________________________________________
        
       What Happened to Bert and T5?
        
       Author : fzliu
       Score  : 103 points
       Date   : 2024-07-19 18:54 UTC (4 hours ago)
        
 (HTM) web link (www.yitay.net)
 (TXT) w3m dump (www.yitay.net)
        
       | caprock wrote:
       | Yi is a good source in this area, and a good follow on Twitter.
        
       | GaggiX wrote:
       | >If BERT worked so well, why not scale it?
       | 
       | I mean, the scaling already happened in 2019 with RoBERTa, my
       | guess is that these models are already good enough at what they
       | need to do (creating meaningful text embeddings), and making them
       | extremely large wasn't feasible for deployment.
        
         | PaulHoule wrote:
         | For text classification/clustering/retrieval I am pretty happy
         | with BERT-family models. It's only the last few month that I've
         | seen better models come out that are practical (e.g. not sell
         | all your children to Open AI to afford them)
        
           | murkt wrote:
           | What would you say are the better models nowadays that are
           | practical?
        
             | PaulHoule wrote:
             | For my recommender/object sorter I have not been in a hurry
             | to upgrade because I have other things to think about. This
             | table should give you some idea of the time-space-accuracy
             | trade offs
             | 
             | https://huggingface.co/spaces/mteb/leaderboard
             | 
             | In a lot of cases you will see two models with a huge
             | difference in size but a tiny difference in accuracy. I
             | could fit either the big or small Stella on my 4080.
        
           | msp26 wrote:
           | Classification is just too damn convenient with LLMs.
        
       | minimaxir wrote:
       | What happened is that "transformers go whrrrrrr." (yes, that's
       | the academic term)
       | 
       | In the end, LLMs using causal language modeling or masked
       | language modeling learn to best solve their objectives by
       | creating an efficient global model of language patterns: CLM is
       | actually a harder problem to solve since MLM can leak information
       | through surrounding context, and with transformer scaling law
       | research post-BERT/GPT it's not a surprise CLM won out in the
       | long run.
        
       | lalaland1125 wrote:
       | I think the big reason why BERT and T5 have fallen out of favor
       | is the lack of zero shot (or few shot) ability.
       | 
       | When you have hundreds or thousands of examples, BERT works
       | great. But that is very restricting.
        
         | jerrygenser wrote:
         | Yes but you can use an llm to label data and then train a bert
         | model which then costs a small fraction of time and money to
         | run than the original llm.
        
           | hdhshdhshdjd wrote:
           | Shhh, don't tell everybody the secret. ;-)
        
         | byefruit wrote:
         | Yes, no zero shot. Few shot is possible for some use cases with
         | setfit: https://github.com/huggingface/setfit and the very
         | recent Fastfit: https://github.com/IBM/fastfit (
         | https://arxiv.org/pdf/2404.12365 )
        
       | andy_xor_andrew wrote:
       | I'm a bit embarrassed to admit, but I still don't understand
       | decoder vs encoder vs decoder/encoder models.
       | 
       | Is the input/output of these models any different? Are they all
       | just "text context goes in, scores for all tokens in the
       | vocabulary come out" ? Is the difference only in how they achieve
       | this output?
        
         | lalaland1125 wrote:
         | The key to understanding the difference is that transformers
         | are attention models where tokens can "attend" to different
         | tokens.
         | 
         | Encoder models allow all tokens to attend to every other token.
         | This increases the number of connections and makes it easier
         | for the model to reason, but requires all tokens at once to
         | produce any output. These models generally can't generate text.
         | 
         | Decoder models only allow tokens to attend to previous tokens
         | in the sequence. This decreases the amount of tokens, but
         | allows the model to be run incrementally, one token at a time.
         | This incremental processing is key to allowing the models to
         | generate text.
        
         | mbowcut2 wrote:
         | The biggest difference is when you feed a sequence into a
         | decoder only model, it will only attend to previous tokens when
         | computing hidden states for the current token. So the hidden
         | states for the nth token is only based on tokens <n. This is
         | where you hear the talk about "causal masking", as the
         | attention matrix is masked to achieve this restriction. Encoder
         | architectures on the other hand allow for each position in the
         | sequence to attend to every other position in the sequence.
         | 
         | Encoder architectures have been used for semantic analysis, and
         | feature extraction of sequences, and encoder only for
         | generation (i.e. next token prediction).
        
         | kelseyfrog wrote:
         | You can think of encoder/decoder models as specifically
         | addressing the translation problem. They are also known as
         | sequence-to-sequence models.
         | 
         | Take the task of translation. A translator needs to keep in
         | mind the original text and the translation so far in order to
         | predict the next translated token. The original text is
         | encoded, and the translation so far is passed into the decoder
         | to generate the next translated token. The next token is
         | appended to the translation and the process repeats
         | autoregressively.
         | 
         | Decoder-only models use just the decoder architecture of
         | encoder/decoders. They are prompted and generate completions
         | autoregressively.
         | 
         | Encoder-only models use just the encoder architecture which you
         | can think of similarly to embedding. A task here is, producing
         | vectors where vector distance is related to the semantic
         | similarity of the input documents. This can be useful for
         | retrieval tasks among other things.
         | 
         | You can of course translate using just the decoder, by
         | constructing a "please translate this from A to B, <original
         | text>" prompt and generating tokens just using the decoder.
         | I'll leave it to people with more expertise than I do describe
         | the pros and cons of these.
        
         | ambrozk wrote:
         | Encoder: Text tokens -> Fixed representation vector
         | 
         | Decoder: Fixed representation vector + N decoded text tokens ->
         | N+1th text token
         | 
         | Encoder/Decoder architecture: You take some tokenized text, run
         | an encoder on it to get a fixed representation vector, and then
         | recursively apply the decoder to your fixed representation
         | vector and the 0...N tokens you've already produced to produce
         | the N+1th token.
         | 
         | Decoder-only architecture: You take some tokenized text, and
         | recursively apply a decoder to the 0...N tokens you've already
         | produced to produce the N+1th token (without ever using an
         | encoded representation vector).
         | 
         | Basically, an encoder produces this intermediate output which a
         | decoder knows how to combine with some existing output to
         | create more output (imagine, e.g., encoding a sentence in
         | French, and then feeding a decoder the vector representation of
         | that sentence plus the three words you've translated so far, so
         | that it can figure out the next word in the translation). A
         | decoder can be made to require an intermediate context vector,
         | or (this is how it's done in decoder-only architectures) it can
         | be made to require only the text produced so far.
        
           | opprobium wrote:
           | Encoder in the T5 sense doesn't produce a fixed vector, it
           | produces one encoded vector for every step of input and all
           | of that is given to the decoder.
           | 
           | The only difference between encoder/decoder and decoder-only
           | is masking:
           | 
           | In an encoder, none of the tokens are masked at any step, and
           | are all visible in both directions to the encoder. Each
           | output of the encoder can attend to any input of the encoder.
           | 
           | In the decoder, the tokens are masked causally - each N+1
           | token can only attend to the previous N tokens.
        
         | chant4747 wrote:
         | Don't be embarrassed. This article makes the mistake of
         | _saying_ they're going catch the under-informed up to speed but
         | then immediately dives all the way in to the deep end.
        
         | thomasahle wrote:
         | If you look at the classical [transformer architecture picture]
         | (https://en.wikipedia.org/wiki/Transformer_(deep_learning_arc..
         | .) there is an "encoder" tower on the left and a "decoder"
         | tower on the right.
         | 
         | - Bert is encoder only.
         | 
         | - GPT is decoder only.
         | 
         | - T5 uses both the encoder and the decoder.
        
       | k8si wrote:
       | I believe many high-quality embedding models are still based on
       | BERT, even recent ones, so I don't think it's entirely fair to
       | characterize it as "deprecated".
        
       | htrp wrote:
       | feels like large language models sucked all the air out of the
       | room because it was a lot easier to scale compute and data, and
       | after roberta, no one was willing to continue exploring.
        
         | nshm wrote:
         | No, there are mathematical reasons LLMs are better. They are
         | trained with multiobjective loss (coding skills, translation
         | skills, etc) so they understand the world much better than MLM.
         | Original post discuss that but with more words and points than
         | necessary.
        
           | Der_Einzige wrote:
           | Call it a CLM vs MLM, not LLM vs MLM. Soon LMLM's will exist,
           | which will be LLMs too...
        
         | riku_iki wrote:
         | T5 is LLM, I think first one of them.
        
       | bugglebeetle wrote:
       | Wasn't there a recent paper that demonstrated BERT models are
       | still competitive or beat LLMs in many tasks?
        
       | vintermann wrote:
       | For people like me who gave up trying to follow Arxiv ML papers
       | 3+ years ago, articles like these are gold. I would love a
       | Youtube channel or blog which does retrospectives on "big" papers
       | of the last decade (those that everyone paid attention to at the
       | time) and look at where the ideas are today.
        
       | janalsncm wrote:
       | BERT didn't go anywhere and I have seen fine-tuned BERT backbones
       | everywhere. They are useful for generating embeddings to be used
       | downstream, and small enough to be handled on consumer (pre
       | Ampere) hardware. One of the trends I have seen is scaling BERT
       | down rather than up, since BERT already gave good performance, we
       | want to be able to do it faster and cheaper. That gave rise to
       | RoBERTa, ALBERT and distillBERT.
       | 
       | T5 I have worked less with but I would be curious about its head
       | to head performance with decoder-only models these days. My guess
       | is the downsides from before (context window limitations) are
       | less of a factor than they used to be.
        
         | hdhshdhshdjd wrote:
         | I tried some large scale translation tasks with T5 and results
         | were iffy at best. I'm going to try the same task with the
         | newest Mistral small models and compare. My guess is Mistral
         | will be better.
        
           | llm_trw wrote:
           | T5 is not Bert, translation is not embedding.
        
             | hdhshdhshdjd wrote:
             | The article mentions T5 and translation is something T5 is
             | supposedly good at - just sharing I was less than
             | impressed.
        
       | hdhshdhshdjd wrote:
       | Maybe in SOTA ml/nlp research, but in the world of building
       | useful tools and products, BERT models are dead simple to tune,
       | work great if you have decent training data, and most importantly
       | are very very fast and very very cheap to run.
       | 
       | I have a small Swiss army collection of custom BERT fine tunes
       | that are equal or better than the best LLM and execute document
       | classification tasks in 2.4ms. Find me an LLM that can do
       | anything in 2.4ms.
        
         | llm_trw wrote:
         | Yeah, pretty much. When you have 2b files you need to troll
         | through good luck using anything but a vector database. Once
         | you do a level or two of pruning of the results then you can
         | feed it into an LLM for final classification.
        
       | jszymborski wrote:
       | > It is also worth to note that, generally speaking, an Encoder-
       | Decoders of 2N parameters has the same compute cost as a decoder-
       | only model of N parameters which gives it a different FLOP to
       | parameter count ratio.
       | 
       | Can someone explain this to me? I'm not sure how the compute
       | costs are the same between the 2N and N nets.
        
         | phillypham wrote:
         | You can break your sequence into two parts. One part goes
         | through the encoder and the other goes through the decoder, so
         | each token only goes through one transformer stack.
        
       ___________________________________________________________________
       (page generated 2024-07-19 23:01 UTC)