[HN Gopher] SentenceTransformers: Python framework for sentence,...
       ___________________________________________________________________
        
       SentenceTransformers: Python framework for sentence, text and image
       embeddings
        
       Author : tosh
       Score  : 149 points
       Date   : 2024-04-07 10:23 UTC (12 hours ago)
        
 (HTM) web link (www.sbert.net)
 (TXT) w3m dump (www.sbert.net)
        
       | nmstoker wrote:
       | Should probably have "(2019)" appended to the title as per HN
       | conventions, given the date in the citation paper links (1x 2019
       | and 2x 2020) and the fact that this site has been around for
       | quite a few years...
        
         | zwaps wrote:
         | It should be mentioned that the original author Nils Reimers
         | has moved on and the repo has been stale since 2021/2022. It
         | has recently (e.g. end of 2023) gotten a new team and has since
         | seen updates.
         | 
         | This is obviously quite significant given how important
         | sentence embedding models are.
        
       | PaulHoule wrote:
       | I use these all the time. For many of the classification tasks I
       | do it works very well to pass an image or text through an
       | embedding and then apply some kind of classical machine learning
       | like the SVM. Model training is super reliable and takes maybe 3
       | minutes to train multiple models and cross-validate. In maybe 45
       | minutes I can train a single fine-tuned model but the results are
       | really hit and miss.
        
         | spxneo wrote:
         | would love to try this out myself. seems like there are lighter
         | solutions than relying a giant LLM
        
       | gillesjacobs wrote:
       | Used this library quite a bit. I still have no idea if there is a
       | good reason this API is not packaged within
       | huggingface/transformers.
       | 
       | Probably historic reasons, anyway solid 9/10 API.
        
       | anon373839 wrote:
       | These are extremely useful embedding models, and some are small
       | enough to use in the frontend (with transformers.js) for on-
       | device semantic search.
       | 
       | One issue I've run into is that they produce a very good ranking,
       | but the actual cosine similarity scores appear to be meaningless.
       | (E.g., "chocolate chip cookies" and "PLS6;YJBXSRF&/" could have a
       | similarity of 0.8.) Consequently, I've had a hard time selecting
       | sensible cutoff values to balance recall and precision. Has
       | anyone found a good approach to this?
        
         | ilaksh wrote:
         | maybe the normalize_embeddings flag on encode?
        
           | Clueed wrote:
           | Not sure about the specific implementation here but the very
           | definition of cosine similarity includes normalization. [0]
           | 
           | [0] https://en.wikipedia.org/wiki/Cosine_similarity
        
         | eigenvalue wrote:
         | Yes, check out my library for vector similarity that has
         | various other measures which are more discriminative:
         | 
         | https://github.com/Dicklesworthstone/fast_vector_similarity
         | 
         | pip install fast_vector_similarity
        
         | refibrillator wrote:
         | Even if you choose a more sophisticated similarity measure as
         | suggested by another commenter, you'll still need to set a
         | threshold on that metric to perform binary classification.
         | 
         | In my experience there are two paths forward, the one I
         | recommend is to train an MLP classifier on the embeddings to
         | produce a binary classification (ie similar or not). The
         | advantage is that you no longer need to set a numeric threshold
         | on a distance metric, however you will need labeled training
         | data to define what is "similar" in the context of your use
         | case.
         | 
         | The other path is to calculate the statistics of pair-wise
         | distances for every record in some unlabeled dataset you have,
         | and use the resulting distribution to inform choice of
         | threshold. This will at least give you an estimate of what % of
         | records will be classified as similar for a given threshold
         | value.
        
           | whakim wrote:
           | I've had a lot of success training a classifier on top of
           | these embeddings. An MLP works well, as does an SVM. If you
           | don't have labeled data, I've also found that various active
           | learning techniques can produce extremely strong results
           | while only requiring a small amount of human labeling.
        
             | abhgh wrote:
             | I second the point about learning a classifier over them -
             | it is practically quite useful. But I'd caution against
             | believing Active Learning (AL) would be unequivocally be
             | useful; most positive performances seem to arise in very
             | specific circumstances [1]. Esp. see Table 1: a Linear SVM
             | with MPNet (one of the embeddings sbert supports) has
             | indeed strong performance, but, in general, random sampling
             | (over an AL strategy) performs quite well!
             | 
             | [1] https://arxiv.org/pdf/2403.15744v1.pdf
        
             | lmeyerov wrote:
             | Yep, we do precisely this, st + SVM => pre filtering /
             | routing => RAG pipelines for some of louie.ai
             | 
             | In our cases, < 100 labels on modern models (per task) goes
             | far. We can do better, but more useful problems to solve.
             | Impressive how far these have come!
        
           | kurt_goedel wrote:
           | Could you maybe tell me more about this approach? How do I
           | have to build my training dataset?
           | 
           | Any paper or other source about this approach?
        
           | anon373839 wrote:
           | These are excellent suggestions, and much appreciated.
           | Thanks!
           | 
           | If you train a binary classifier on the embeddings, have you
           | found that the resulting probabilities are also good for
           | ranking? Or do you stick with a distance measure for that?
        
         | pietz wrote:
         | I haven't noticed this behavior in such extreme ways as you.
         | I've worked with many different embeddings using this library
         | and the OpenAI API for data normalization, extraction and
         | structuring tasks.
         | 
         | On one hand it's super impressive what you can get out of these
         | embeddings. On the other I'm with you that there seems to be a
         | missing piece to make them work great.
         | 
         | An additional MLP or Siamese Network works really well but it
         | seems like there should be something easier and unsupervised
         | given your set of vectorized samples.
         | 
         | Thinking out loud, could PCA help with this by centering the
         | data and removing dimensions that mostly model noise?
        
           | anon373839 wrote:
           | > I haven't noticed this behavior in such extreme ways as
           | you.
           | 
           | I should have clarified that it really depends on the model.
           | I have had this problem to a greater extent with the GTE and
           | BGE embeddings. But I use them anyway, because they're so
           | strong overall.
           | 
           | PCA is an interesting idea, and worth looking at.
        
         | YossarianFrPrez wrote:
         | Er, is it possible that you are using
         | `scipy.spatial.distance.cosine` to compute the similarity? If
         | so, note that this computes the cosine distance, and the cosine
         | similarity is defined as _1-cosine distance._
         | 
         | I tried out your example using the following code:
         | from sentence_transformers import SentenceTransformer
         | import scipy.spatial as ssp              model =
         | SentenceTransformer("all-mpnet-base-v2")       A =
         | model.encode(['chocolate chip cookies','PLS6;YJBXSRF&/'])
         | CosineDistance = ssp.distance.cosine(A[0],A[1])
         | 
         | Where `CosineDistance == 0.953`
         | 
         | This means the model is actually working quite well, were these
         | to be similar to each other we'd expect CosineDistance to be
         | much closer to 0.
         | 
         | The other comments about such distances being useful for
         | _relative comparisons_ also apply: I 've used
         | SentenceTransformers quite successfully for nearest-neighbor
         | searches.
        
         | refulgentis wrote:
         | Something is off if you're seeing behavior like that, 0.1-0.15
         | with MiniLM L6 V3 are good for "has any relevancy whatsoever"
        
           | anon373839 wrote:
           | I should have added that it depends on the model. MiniLM
           | didn't exhibit this behavior, but it unfortunately didn't
           | perform as well on recall or ranking as other models that
           | did.
           | 
           | GTE comes to mind. You can try the demo widget on HuggingFace
           | and see this: https://huggingface.co/thenlper/gte-large
           | 
           | As an example, against "Chocolate chip cookies", "Oreos" has
           | a cosine similarity of .808, while "Bubonic plague" is at
           | .709.
        
         | fermisea wrote:
         | Bootstrap a pvalue by creating a set of thousands of random
         | words, calculate the metric for those and either explicitly
         | keep these numbers to compute the rank, or fit some normal
         | distribution and use this mean and std to estimate the
         | probability that it's similar.
        
       | mapmeld wrote:
       | Major kudos to this library for supporting
       | Matryoshka/nested/adaptive embeddings
       | (https://huggingface.co/blog/matryoshka), which I needed to train
       | a model recently
        
       | edshiro wrote:
       | I don't have much experience with embeddings...
       | 
       | Could someone more knowledgeable suggest when it would make sense
       | to use the SentenceTransformers library vs for instance relying
       | on the OpenAI API to get embeddings for a sentence?
        
         | rolisz wrote:
         | Up until a month ago, the OpenAI embeddings where very poor.
         | But they recently released a new model which is much better
         | then they're previous one.
         | 
         | Now it depends un specific usecase (domain, language, length of
         | texts)
        
         | montebicyclelo wrote:
         | It's fairly easy to use, not that compute intensive (e.g. can
         | run on even a small-ish CPU VM), the embeddings tend to perform
         | well and you can avoid sending your data to a third party.
         | Also, there are models fine tuned for particular domains on HF-
         | hub, that can potentially give better embeddings for content in
         | that domain.
        
           | edshiro wrote:
           | I see - thanks for the clarifications
           | 
           | I presume if your customers are enterprise companies then you
           | may opt to use this library vs sending their data to OpenAI
           | etc.
           | 
           | And you can get more customisation/fine-tuning from this
           | library too.
        
           | estreeper wrote:
           | Just to add to this, a great resource is the Massive Text
           | Embedding Benchmark (MTEB) leaderboard which you can use to
           | find good models to evaluate, and there are many open models
           | that outperform i.e. OpenAI's text-embedding-ada-002,
           | currently ranked #46 for retrieval, which you can use with
           | SentenceTransformers.
           | 
           | https://huggingface.co/spaces/mteb/leaderboard
        
         | tinyhouse wrote:
         | Embeddings is one of those things that using OpenAI (or any
         | other provider) isn't really necessary. There are many small
         | open source embedding models that perform very well. Plus, you
         | can finetune them on your task. You can also run locally and
         | not worry about all the constraints (latency, rate limits etc)
         | of using an external provider endpoint. If performance is
         | important for you, then you'll need a GPU.
         | 
         | The main reason to use one of those providers is if you want
         | something that performs well out of the box without doing any
         | work and you don't mind paying for it. Those companies like
         | OpenAI, Cohere and others, already did they work to make those
         | models work well on various domains. They may also use larger
         | models that are not as easy to deal with yourself. (although as
         | I mentioned previously, a small embeddings model fine-tuned on
         | your task is likely to perform as well as a much bigger general
         | model)
        
         | VHRanger wrote:
         | You should basically never use the openAI embeddings.
         | 
         | There isn't a single usecase where they're better than the free
         | models, and they're slower, needlessly large, and outrageously
         | expensive for what they are.
        
       | marban wrote:
       | Also to be used with
       | https://maartengr.github.io/BERTopic/index.html
        
       | deepsdev wrote:
       | I have used this library for a few years and is reliable. As
       | someone mentioned, sometimes 2 things that are not related can
       | have the same cosine similarity. Easy to use and get started
       | with.
        
       | estreeper wrote:
       | I'm curious how people are handling multi-lingual embeddings.
       | 
       | I've found LASER[1] which originally had the idea to embed all
       | languages in the same vector space, though it's a bit harder to
       | use than models available through SentenceTransformers. LASER2
       | stuck with this approach, but LASER3 switched to language-
       | specific models. However, I haven't found benchmarks for these
       | models, and they were released about 2 years ago.
       | 
       | Another alternative would be to translate everything before
       | embedding, which would introduce some amount of error, though
       | maybe it wouldn't be significant.
       | 
       | 1. https://github.com/facebookresearch/LASER
        
         | VHRanger wrote:
         | The transformer models handle multilingual directly.
         | 
         | For good old embedding models (eg. GLoVe), you have a few
         | choices:
         | 
         | 1. LASER as you mentioned. The performance tends to suck
         | though.
         | 
         | 2. Language prediction + one embedding model per supported
         | language. Libraries like whichlang make this nice, and MUSE has
         | embedding modes aligned per language for 100ish languages.
         | 
         | Fastembed is a good library for this.
         | 
         | Note that for most people, 32 dimension glove is all they need
         | if you benchmark it.
         | 
         | As the length of the text you're embedding goes up, or as the
         | specificity goes up (eg. You have only medical documents and
         | want difference between them) you'll need richer embeddings
         | (more dimensions, or a transformer model, or both)
         | 
         | People never benchmark their embeddings and I find it
         | incredible how they end up with needlessly overenginneered
         | systems.
        
           | gregw134 wrote:
           | Any idea which model has the best performance across
           | languages? I'm checking out model performance on the
           | Huggingface leaderboard and the top models for English aren't
           | even in the top 20 for Polish, French and Chinese
        
             | VHRanger wrote:
             | Depends on what your usecase is.
             | 
             | For the normal user that just wants something across
             | languages, the minilm-paraphrase-multilingual in the OP
             | library is great.
             | 
             | Of you want better than that (either bigger model, or
             | specifically for a subset of languages, etc.) then you need
             | to think about your task, priorities, target languages,
             | etc.
        
       | andai wrote:
       | Has anyone used FlagEmbedding? I'm testing a model that comes
       | with examples for both SentenceTransformers and FlagEmbedding,
       | but it's hard to find any information about it.
        
       | riku_iki wrote:
       | it doesn't look they fine-tune those models on any of modern
       | foundational models, which likely produces huge performance gap
       | compared to OpenAI embedding for example..
        
       | andai wrote:
       | What is everyone using embeddings for, and which mdels? I built a
       | RAG last year (for searching long documents) but found it a bit
       | disappointing. I tested it with OpenAI and with
       | SentenceTransformers (instructor-xl or a smaller variant).
       | Apparently they've come a long way since then though.
       | 
       | Currently I'm working on an old fashioned search engine (tf-if +
       | keyword expansion). Apparently that can work better than vector
       | databases in some cases:
       | 
       | https://news.ycombinator.com/item?id=38703943
        
       | itissid wrote:
       | I recall paragraph2Vec. It was the earliest way to experiment
       | with position embeddings that I recall reading. At first when I
       | read it, it felt kind of crazy/weird that it worked:
       | 
       | You feed in the paragraph ID(yes just an integer of the para
       | where the word was) via its own layer along with the word to
       | CBOW/SkipGram set up. Then throw away that part after training.
       | Then, during inference, you attach a new randmly initialized
       | layer for where that old layer was and "re-train" just that part
       | before generating the embedding for words.
        
       | Der_Einzige wrote:
       | >tfw it's 2024 and people STILL aren't using span compression to
       | implement a "medium term" memory (RAG is long term and the
       | context length is short term) for LLMs.
       | 
       | >tfw it's 2024 and we just accept that the context "falls out" of
       | the model if we push it beyond it's regular context length
       | 
       | So everyone forgot that we can put large N number of tokens into
       | small N number of embeddings because???
        
         | register wrote:
         | What do you mean by span compression? We have experimented with
         | various embedding contes lengths and we have found that bigger
         | embeddings aren't the ones providing the best recall. We have
         | hit the best results with something between 65% to 75% of the
         | maximum embedding context length. We have been using OpenAI
         | embeddings models though.
        
       | spxneo wrote:
       | how are people using this with/without LLM?
       | 
       | is the feature more of "reliable and accurate without
       | hallucination" than LLM?
       | 
       | where are you using it? text interface?
        
       | gregw134 wrote:
       | How are you guys deciding what parts of a document to turn into
       | embeddings? I've heard paragraph embeddings aren't that reliable,
       | so I'm planning on using tf-idf first to extract keywords from a
       | document, and then just create embeddings from those keywords.
        
         | nestorD wrote:
         | Take your document, cut it (cleanly!) into pieces small enough
         | to fit into your sentence embedder's context window, and
         | generate _several_ embeddings that all point to the same
         | document.
         | 
         | I would recommend against merging (averaging, etc.) the
         | embeddings (unless you want a blurry idea of what your document
         | contains), as well as feeding very large pieces of text to the
         | embedder (some models have massive context lengths, but the
         | result is similarly vague).
        
           | gregw134 wrote:
           | > [recommend against] feeding very large pieces of text to
           | the embedder
           | 
           | Sounds right, I've heard this from multiple sources. That's
           | why I'm leaning towards just embedding the keywords.
        
         | therealdrag0 wrote:
         | Keyword would leave you with semantics of word definitions but
         | lose sentence meaning/context right?
        
           | gregw134 wrote:
           | I'm sure it would lose a ton of meaning, but for me it's
           | easier to fit into a traditional search pipeline.
        
       ___________________________________________________________________
       (page generated 2024-04-07 23:00 UTC)