[HN Gopher] A Replacement for BERT
       ___________________________________________________________________
        
       A Replacement for BERT
        
       Author : cubie
       Score  : 202 points
       Date   : 2024-12-19 16:53 UTC (6 hours ago)
        
 (HTM) web link (huggingface.co)
 (TXT) w3m dump (huggingface.co)
        
       | jbellis wrote:
       | Looks great, thanks for training this!                 - Can I
       | fine tune it with SentenceTransformers?       - I see ColBERT in
       | the benchmarks, is there an answerai-colbert-small-v2 coming
       | soon?
        
         | gunalx wrote:
         | Seems like it. They even have example training scripts
         | available.
         | https://github.com/AnswerDotAI/ModernBERT/blob/main/examples...
         | 
         | Check out their documentation page linked on the bottom of the
         | article.
         | https://huggingface.co/docs/transformers/main/en/model_doc/m...
        
         | jph00 wrote:
         | The creator of answerai-colbert-small-v2 (bclavie) is also the
         | person that launched the ModernBERT project, so yes, you can
         | expect to see a lot of activity in this space! :D
         | 
         | (Also yes, it works great with ST and we provide a full example
         | script.)
        
       | zelias wrote:
       | missed opportunity to call it ERNIE
        
         | amrrs wrote:
         | I remember back in the day there was an Ernie model
        
           | axpy906 wrote:
           | Don't forget ELMO. The bi-lstm.
        
         | lrog wrote:
         | yep, too late:
         | https://huggingface.co/docs/transformers/en/model_doc/ernie
        
         | behnamoh wrote:
         | I never liked the names BERT and its derivatives. Of all the
         | names on the world, they chose words that are ugly, specific to
         | one culture, and frankly childish.
        
           | Cthulhu_ wrote:
           | Sesame Street has been broadcast in 140 countries; Bert (and
           | Ernie) have been localized to 18 languages, including Arabic,
           | Hindi, Japanese, Hebrew and Chinese, with China having an AI
           | called ERNIE because of course.
           | 
           | Or to make an overly worded / researched reply to a petulant
           | comment short, they are very much not specific to one
           | culture.
        
         | chriswarbo wrote:
         | Tangentially:
         | 
         | ERNIE is probably the most famous "computer" in the UK, which
         | has been picking winners for the UK's premium bonds scheme
         | since the 1950s. It was heavily marketed, to get the public
         | used to the new-fangled idea of electronics, and is sometimes
         | considered one of the first computers; though (a) it was more
         | of a special-purpose random number generator rather than a
         | computer, and (b) it descended from the earlier Colossus code-
         | breaking machines of World War II (though the latter's
         | existence was kept secret for decades). The latest ERNIE is
         | version 5, which uses quantum effects to generate its random
         | numbers (earlier versions used electrical and thermal noise).
         | 
         | https://en.wikipedia.org/wiki/Premium_Bonds#ERNIE
        
         | timClicks wrote:
         | More generally, using the prefix "Modern" haunts every product
         | name that uses it. Technologies move fast and modern becomes
         | antiquated very quickly.
        
       | Arcuru wrote:
       | I'm not sure I am understanding where exactly this slots in, but
       | isn't this an embedding model? Shouldn't they be comparing it to
       | a service like Voyage AI?
       | 
       | - https://docs.voyageai.com/docs/embeddings
        
         | spott wrote:
         | Embedding models are frequently based on Bert style models, but
         | Bert models can be finetuned to do a lot more than just
         | embeddings.
         | 
         | So an embedding focused finetune of modern Bert should be
         | compared to something like voyageai, but not modern Bert
         | itself.
        
           | KTibow wrote:
           | What are the people who keep downloading Bert doing then? Are
           | they the minority who directly use it for embeddings?
        
             | spott wrote:
             | I'm honestly not sure why Bert-based-uncased is so
             | popular... the model isn't that useful on its own. From
             | their huggingface page:
             | 
             | > You can use the raw model for either masked language
             | modeling or next sentence prediction, but it's mostly
             | intended to be fine-tuned on a downstream task. See the
             | model hub to look for fine-tuned versions of a task that
             | interests you.
             | 
             | > Note that this model is primarily aimed at being fine-
             | tuned on tasks that use the whole sentence (potentially
             | masked) to make decisions, such as sequence classification,
             | token classification or question answering. For tasks such
             | as text generation you should look at model like GPT2.
        
               | metanonsense wrote:
               | I am out of the game for a year or so (and was never
               | completely in the game), but back then BERT was the basis
               | for lots of interesting applications. The original Vision
               | Transformer (ViT) was based (or at least inspired by)
               | BERT, it was used for graph transformers, visual language
               | understanding, etc.
        
               | strangecasts wrote:
               | I think this comes down to the Huggingface libraries
               | defaulting to downloading the model from HF if they
               | cannot locate the weights - so "make your own text
               | classifier" tutorial notebooks default to bert-based-
               | uncased as a "standard" pretrained encoder you can put a
               | classification head on top of and finetune, and in turn
               | people run them in Google Colab and just download another
               | copy of the weights on startup, which counts towards the
               | total
        
         | janalsncm wrote:
         | You're comparing SaaS to open weights. A SaaS will never
         | compete on the flexibility of adding a classification head to
         | BERT (where the gradients flow all the way back), training it,
         | knowledge transferring to a similar domain, distilling it down,
         | pruning layers, fine-tuning some more, etc. which is a common
         | ML workflow.
        
       | pantsforbirds wrote:
       | Awesome news and something I really want to checkout for work.
       | Has anyone seen any RAG evals for ModernBERT yet?
        
         | cubie wrote:
         | Not yet - these are base models, or "foundational models".
         | They're great for molding into different use cases via
         | finetuning, better than common models like BERT, RoBERTa, etc.
         | in fact, but like those models, these ModernBERT checkpoints
         | can only do one thing: mask filling.
         | 
         | For other tasks, such as retrieval, we still need people to
         | finetune them for it. The ModernBERT documentation has some
         | scripts for finetuning with Sentence Transformers and PyLate
         | for retrieval:
         | https://huggingface.co/docs/transformers/main/en/model_doc/m...
         | But people still need to make and release these models. I have
         | high hopes for them.
        
       | jph00 wrote:
       | Hi gang, Jeremy from Answer.AI here. Nice to see this on HN! :)
       | We're very excited about this model release -- it feels like it
       | could be the basis of all kinds of interesting new startups and
       | projects.
       | 
       | In fact, the stuff mentioned in the blog post is only the tip of
       | the iceberg. There's a lot of opportunities to fine tune the
       | model in all kinds ways, which I expect will go far beyond what
       | we've managed to achieve in our limited exploration so far.
       | 
       | Anyhoo, if anyone has any questions, feel free to ask!
        
         | ZQ-Dev8 wrote:
         | Jeremy, this is awesome! Personally excited for a new wave of
         | sentence transformers built off ModernBERT. A poster below
         | provided the link to a sample ST training script in the
         | ModernBERT repo, so that's great.
         | 
         | Do you expect the ModernBERT STs to carry the same advantages
         | over ModernBERT that BERT STs had over the original BERT? Or
         | would you expect caveats based on ModernBERT's updated
         | architecture and capabilities?
        
           | jph00 wrote:
           | Yes absolutely the same advantages -- in fact the maintainer
           | of ST is on the paper team, and it's been a key goal from day
           | one to make this work well.
        
         | derbaum wrote:
         | Hey Jeremy, very exciting release! I'm currently building my
         | first product with RoBERTa as one central component, and I'm
         | very excited to see how ModernBERT compares. Quick question:
         | When do you think the first multilingual versions will show up?
         | Any plans of you training your own?
        
         | TheTaytay wrote:
         | Thank you for this. I can't wait to try this, especially on
         | GLiNER tasks.
        
         | querez wrote:
         | Two questions:
         | 
         | 1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is
         | roughly as fast as BERT-BAse. Given its architecture
         | (especially Alternating Attention), I'm curious why the model
         | not considerably faster than its predecessor. Any insight you
         | could share on that?
         | 
         | 2) Most modern LLMs are Encoder+Decoder model. Why not chop of
         | the decoder of one of these (e.g. a small Llama or Mistral or
         | other liberally-licensed model) and train a short head on top?
        
           | yorwba wrote:
           | Llama and Mistral are decoder-only models; there is no
           | encoder you could put a head on.
           | 
           | You could put it on the decoder instead, but then you have
           | the problem that in the causal language-modeling setting that
           | the model was trained for, every token can only attend to
           | preceding tokens and is blind to subsequent ones.
        
           | spott wrote:
           | ModernBERT-Base is larger than BERT-Base by 39M parameters.
        
           | janalsncm wrote:
           | On your second point, most modern LLMs are decoder only. And
           | as for why adding a classification head isn't optimal, the
           | decoders you're referring to have 10x the parameters, and
           | aren't trained on encoder-type tasks like MLM. So there's no
           | advantage on any dimension really.
        
           | cubie wrote:
           | Beyond what the others have said about 1) ModernBERT-base
           | being 149M parameters vs BERT-base's 110M and 2) most LLMs
           | being decoder-only models, also consider that alternating
           | attention (local vs global) only starts helping once you're
           | processing longer texts. With short texts, local attention is
           | equivalent to global attention. I'm not sure what length was
           | used in the picture, but GLUE is mostly pretty short text.
        
         | LunaSea wrote:
         | Hi Jeremy, do you have plans to adapt this model for different
         | languages?
        
         | newfocogi wrote:
         | Thank you so much for doing this work. I expect many NLP
         | projects and organizations are going to benefit from this, and
         | I'm looking forward to all the models that will be derived from
         | this. I'm already imagining the things I might try to build
         | with it over the holiday break.
         | 
         | Tiny feedback maybe you can pass along to whoever maintains the
         | HuggingFace blog -- the GTE-en-MLM link is broken.
         | 
         | https://huggingface.co/thenlper/gte-en-mlm-large should be
         | https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base
        
         | bertobugreport wrote:
         | Trying to fine tune on single rig multi-gpu and it crashes,
         | going back down to 1 GPU fixes and training continues (excited
         | to see its results).
         | 
         | Script is near identical with the one below, updated with new
         | imports;
         | 
         | https://huggingface.co/docs/transformers/en/tasks/token_clas...
        
       | carschno wrote:
       | The model cars says only English, is that correct? Are there any
       | plans to publish a multilingual model or monolingual ones for
       | other languages?
        
         | amunozo wrote:
         | Yes, the paper says that is only English.
        
       | readthenotes1 wrote:
       | I guess the next release is going to be postmodern bert.
        
       | janalsncm wrote:
       | > encoder-only models add up to over a billion downloads per
       | month, nearly three times more than decoder-only models
       | 
       | This is partially because people using decoders aren't using
       | huggingface at all (they would use an API call) but also because
       | encoders are the unsung heroes of most serious ML applications.
       | 
       | If you want to do any ranking, recommendation, RAG, etc it will
       | probably require an encoder. And typically that meant something
       | in the BERT/RoBERTa/ALBERT family. So this is huge.
        
         | EGreg wrote:
         | Can you go into detail for those of us who aren't as well
         | versed in the tech?
         | 
         | What do the encoders do vs the decoders, in this ecosystem?
         | What are some good links to learn about these concepts on a
         | high level? I find all most of the writing about different
         | layers and architectures a bit arcane and inscrutable,
         | especially when it comes to Attention and Self-Attention with
         | multiple heads.
        
           | cubie wrote:
           | On a very high level, for NLP:
           | 
           | 1. an encoder takes an input (e.g. text), and turns it into a
           | numerical representation (e.g. an embedding).
           | 
           | 2. a decoder takes an input (e.g. text), and then extends the
           | text.
           | 
           | (There's also encoder-decoders, but I won't go into those)
           | 
           | These two simple definitions immediately give information on
           | how they can be used. Decoders are at the heart of text
           | generation models, whereas encoders return embeddings with
           | which you can do further computations. For example, if your
           | encoder model is finetuned for it, the embeddings can be fed
           | through another linear layer to give you classes (e.g. token
           | classification like NER, or sequence classification for full
           | texts). Or the embeddings can be compared with cosine
           | similarity to determine the similarity of questions and
           | answers. This is at the core of information retrieval/search
           | (see https://sbert.net/). Such similarity between embeddings
           | can also be used for clustering, etc.
           | 
           | In my humble opinion (but it's perhaps a dated opinion),
           | (encoder-)decoders are for when your output is text
           | (chatbots, summarization, translation), and encoders are for
           | when your output is literally anything else. Embeddings are
           | your toolbox, you can shape them into anything, and encoders
           | are the wonderful providers of these embeddings.
        
       | dmezzetti wrote:
       | Great news here. Will takes some time for it to trickle
       | downstream but expect to see better vector embeddings models,
       | entity extraction and more.
        
         | cubie wrote:
         | Spot on
        
       | crimsoneer wrote:
       | Answer.ai team are DELIVERING today. Well done Jeremy and team!
        
       | wenc wrote:
       | Can I ask where BERT models are used in production these days?
       | 
       | I was given to understand that they are a better alternative to
       | LLM type models for specific tasks like topic classification
       | because they are trained to discriminate rather than to generate
       | (plus they are bidirectional so they can "understand" context
       | better through lookahead). But LLMs are pretty strong so I wonder
       | if the difference is negligible?
        
       ___________________________________________________________________
       (page generated 2024-12-19 23:01 UTC)