[HN Gopher] A Replacement for BERT
___________________________________________________________________
A Replacement for BERT
Author : cubie
Score : 202 points
Date : 2024-12-19 16:53 UTC (6 hours ago)
(HTM) web link (huggingface.co)
(TXT) w3m dump (huggingface.co)
| jbellis wrote:
| Looks great, thanks for training this! - Can I
| fine tune it with SentenceTransformers? - I see ColBERT in
| the benchmarks, is there an answerai-colbert-small-v2 coming
| soon?
| gunalx wrote:
| Seems like it. They even have example training scripts
| available.
| https://github.com/AnswerDotAI/ModernBERT/blob/main/examples...
|
| Check out their documentation page linked on the bottom of the
| article.
| https://huggingface.co/docs/transformers/main/en/model_doc/m...
| jph00 wrote:
| The creator of answerai-colbert-small-v2 (bclavie) is also the
| person that launched the ModernBERT project, so yes, you can
| expect to see a lot of activity in this space! :D
|
| (Also yes, it works great with ST and we provide a full example
| script.)
| zelias wrote:
| missed opportunity to call it ERNIE
| amrrs wrote:
| I remember back in the day there was an Ernie model
| axpy906 wrote:
| Don't forget ELMO. The bi-lstm.
| lrog wrote:
| yep, too late:
| https://huggingface.co/docs/transformers/en/model_doc/ernie
| behnamoh wrote:
| I never liked the names BERT and its derivatives. Of all the
| names on the world, they chose words that are ugly, specific to
| one culture, and frankly childish.
| Cthulhu_ wrote:
| Sesame Street has been broadcast in 140 countries; Bert (and
| Ernie) have been localized to 18 languages, including Arabic,
| Hindi, Japanese, Hebrew and Chinese, with China having an AI
| called ERNIE because of course.
|
| Or to make an overly worded / researched reply to a petulant
| comment short, they are very much not specific to one
| culture.
| chriswarbo wrote:
| Tangentially:
|
| ERNIE is probably the most famous "computer" in the UK, which
| has been picking winners for the UK's premium bonds scheme
| since the 1950s. It was heavily marketed, to get the public
| used to the new-fangled idea of electronics, and is sometimes
| considered one of the first computers; though (a) it was more
| of a special-purpose random number generator rather than a
| computer, and (b) it descended from the earlier Colossus code-
| breaking machines of World War II (though the latter's
| existence was kept secret for decades). The latest ERNIE is
| version 5, which uses quantum effects to generate its random
| numbers (earlier versions used electrical and thermal noise).
|
| https://en.wikipedia.org/wiki/Premium_Bonds#ERNIE
| timClicks wrote:
| More generally, using the prefix "Modern" haunts every product
| name that uses it. Technologies move fast and modern becomes
| antiquated very quickly.
| Arcuru wrote:
| I'm not sure I am understanding where exactly this slots in, but
| isn't this an embedding model? Shouldn't they be comparing it to
| a service like Voyage AI?
|
| - https://docs.voyageai.com/docs/embeddings
| spott wrote:
| Embedding models are frequently based on Bert style models, but
| Bert models can be finetuned to do a lot more than just
| embeddings.
|
| So an embedding focused finetune of modern Bert should be
| compared to something like voyageai, but not modern Bert
| itself.
| KTibow wrote:
| What are the people who keep downloading Bert doing then? Are
| they the minority who directly use it for embeddings?
| spott wrote:
| I'm honestly not sure why Bert-based-uncased is so
| popular... the model isn't that useful on its own. From
| their huggingface page:
|
| > You can use the raw model for either masked language
| modeling or next sentence prediction, but it's mostly
| intended to be fine-tuned on a downstream task. See the
| model hub to look for fine-tuned versions of a task that
| interests you.
|
| > Note that this model is primarily aimed at being fine-
| tuned on tasks that use the whole sentence (potentially
| masked) to make decisions, such as sequence classification,
| token classification or question answering. For tasks such
| as text generation you should look at model like GPT2.
| metanonsense wrote:
| I am out of the game for a year or so (and was never
| completely in the game), but back then BERT was the basis
| for lots of interesting applications. The original Vision
| Transformer (ViT) was based (or at least inspired by)
| BERT, it was used for graph transformers, visual language
| understanding, etc.
| strangecasts wrote:
| I think this comes down to the Huggingface libraries
| defaulting to downloading the model from HF if they
| cannot locate the weights - so "make your own text
| classifier" tutorial notebooks default to bert-based-
| uncased as a "standard" pretrained encoder you can put a
| classification head on top of and finetune, and in turn
| people run them in Google Colab and just download another
| copy of the weights on startup, which counts towards the
| total
| janalsncm wrote:
| You're comparing SaaS to open weights. A SaaS will never
| compete on the flexibility of adding a classification head to
| BERT (where the gradients flow all the way back), training it,
| knowledge transferring to a similar domain, distilling it down,
| pruning layers, fine-tuning some more, etc. which is a common
| ML workflow.
| pantsforbirds wrote:
| Awesome news and something I really want to checkout for work.
| Has anyone seen any RAG evals for ModernBERT yet?
| cubie wrote:
| Not yet - these are base models, or "foundational models".
| They're great for molding into different use cases via
| finetuning, better than common models like BERT, RoBERTa, etc.
| in fact, but like those models, these ModernBERT checkpoints
| can only do one thing: mask filling.
|
| For other tasks, such as retrieval, we still need people to
| finetune them for it. The ModernBERT documentation has some
| scripts for finetuning with Sentence Transformers and PyLate
| for retrieval:
| https://huggingface.co/docs/transformers/main/en/model_doc/m...
| But people still need to make and release these models. I have
| high hopes for them.
| jph00 wrote:
| Hi gang, Jeremy from Answer.AI here. Nice to see this on HN! :)
| We're very excited about this model release -- it feels like it
| could be the basis of all kinds of interesting new startups and
| projects.
|
| In fact, the stuff mentioned in the blog post is only the tip of
| the iceberg. There's a lot of opportunities to fine tune the
| model in all kinds ways, which I expect will go far beyond what
| we've managed to achieve in our limited exploration so far.
|
| Anyhoo, if anyone has any questions, feel free to ask!
| ZQ-Dev8 wrote:
| Jeremy, this is awesome! Personally excited for a new wave of
| sentence transformers built off ModernBERT. A poster below
| provided the link to a sample ST training script in the
| ModernBERT repo, so that's great.
|
| Do you expect the ModernBERT STs to carry the same advantages
| over ModernBERT that BERT STs had over the original BERT? Or
| would you expect caveats based on ModernBERT's updated
| architecture and capabilities?
| jph00 wrote:
| Yes absolutely the same advantages -- in fact the maintainer
| of ST is on the paper team, and it's been a key goal from day
| one to make this work well.
| derbaum wrote:
| Hey Jeremy, very exciting release! I'm currently building my
| first product with RoBERTa as one central component, and I'm
| very excited to see how ModernBERT compares. Quick question:
| When do you think the first multilingual versions will show up?
| Any plans of you training your own?
| TheTaytay wrote:
| Thank you for this. I can't wait to try this, especially on
| GLiNER tasks.
| querez wrote:
| Two questions:
|
| 1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is
| roughly as fast as BERT-BAse. Given its architecture
| (especially Alternating Attention), I'm curious why the model
| not considerably faster than its predecessor. Any insight you
| could share on that?
|
| 2) Most modern LLMs are Encoder+Decoder model. Why not chop of
| the decoder of one of these (e.g. a small Llama or Mistral or
| other liberally-licensed model) and train a short head on top?
| yorwba wrote:
| Llama and Mistral are decoder-only models; there is no
| encoder you could put a head on.
|
| You could put it on the decoder instead, but then you have
| the problem that in the causal language-modeling setting that
| the model was trained for, every token can only attend to
| preceding tokens and is blind to subsequent ones.
| spott wrote:
| ModernBERT-Base is larger than BERT-Base by 39M parameters.
| janalsncm wrote:
| On your second point, most modern LLMs are decoder only. And
| as for why adding a classification head isn't optimal, the
| decoders you're referring to have 10x the parameters, and
| aren't trained on encoder-type tasks like MLM. So there's no
| advantage on any dimension really.
| cubie wrote:
| Beyond what the others have said about 1) ModernBERT-base
| being 149M parameters vs BERT-base's 110M and 2) most LLMs
| being decoder-only models, also consider that alternating
| attention (local vs global) only starts helping once you're
| processing longer texts. With short texts, local attention is
| equivalent to global attention. I'm not sure what length was
| used in the picture, but GLUE is mostly pretty short text.
| LunaSea wrote:
| Hi Jeremy, do you have plans to adapt this model for different
| languages?
| newfocogi wrote:
| Thank you so much for doing this work. I expect many NLP
| projects and organizations are going to benefit from this, and
| I'm looking forward to all the models that will be derived from
| this. I'm already imagining the things I might try to build
| with it over the holiday break.
|
| Tiny feedback maybe you can pass along to whoever maintains the
| HuggingFace blog -- the GTE-en-MLM link is broken.
|
| https://huggingface.co/thenlper/gte-en-mlm-large should be
| https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base
| bertobugreport wrote:
| Trying to fine tune on single rig multi-gpu and it crashes,
| going back down to 1 GPU fixes and training continues (excited
| to see its results).
|
| Script is near identical with the one below, updated with new
| imports;
|
| https://huggingface.co/docs/transformers/en/tasks/token_clas...
| carschno wrote:
| The model cars says only English, is that correct? Are there any
| plans to publish a multilingual model or monolingual ones for
| other languages?
| amunozo wrote:
| Yes, the paper says that is only English.
| readthenotes1 wrote:
| I guess the next release is going to be postmodern bert.
| janalsncm wrote:
| > encoder-only models add up to over a billion downloads per
| month, nearly three times more than decoder-only models
|
| This is partially because people using decoders aren't using
| huggingface at all (they would use an API call) but also because
| encoders are the unsung heroes of most serious ML applications.
|
| If you want to do any ranking, recommendation, RAG, etc it will
| probably require an encoder. And typically that meant something
| in the BERT/RoBERTa/ALBERT family. So this is huge.
| EGreg wrote:
| Can you go into detail for those of us who aren't as well
| versed in the tech?
|
| What do the encoders do vs the decoders, in this ecosystem?
| What are some good links to learn about these concepts on a
| high level? I find all most of the writing about different
| layers and architectures a bit arcane and inscrutable,
| especially when it comes to Attention and Self-Attention with
| multiple heads.
| cubie wrote:
| On a very high level, for NLP:
|
| 1. an encoder takes an input (e.g. text), and turns it into a
| numerical representation (e.g. an embedding).
|
| 2. a decoder takes an input (e.g. text), and then extends the
| text.
|
| (There's also encoder-decoders, but I won't go into those)
|
| These two simple definitions immediately give information on
| how they can be used. Decoders are at the heart of text
| generation models, whereas encoders return embeddings with
| which you can do further computations. For example, if your
| encoder model is finetuned for it, the embeddings can be fed
| through another linear layer to give you classes (e.g. token
| classification like NER, or sequence classification for full
| texts). Or the embeddings can be compared with cosine
| similarity to determine the similarity of questions and
| answers. This is at the core of information retrieval/search
| (see https://sbert.net/). Such similarity between embeddings
| can also be used for clustering, etc.
|
| In my humble opinion (but it's perhaps a dated opinion),
| (encoder-)decoders are for when your output is text
| (chatbots, summarization, translation), and encoders are for
| when your output is literally anything else. Embeddings are
| your toolbox, you can shape them into anything, and encoders
| are the wonderful providers of these embeddings.
| dmezzetti wrote:
| Great news here. Will takes some time for it to trickle
| downstream but expect to see better vector embeddings models,
| entity extraction and more.
| cubie wrote:
| Spot on
| crimsoneer wrote:
| Answer.ai team are DELIVERING today. Well done Jeremy and team!
| wenc wrote:
| Can I ask where BERT models are used in production these days?
|
| I was given to understand that they are a better alternative to
| LLM type models for specific tasks like topic classification
| because they are trained to discriminate rather than to generate
| (plus they are bidirectional so they can "understand" context
| better through lookahead). But LLMs are pretty strong so I wonder
| if the difference is negligible?
___________________________________________________________________
(page generated 2024-12-19 23:01 UTC)