[HN Gopher] SentenceTransformers: Python framework for sentence,...
___________________________________________________________________
SentenceTransformers: Python framework for sentence, text and image
embeddings
Author : tosh
Score : 149 points
Date : 2024-04-07 10:23 UTC (12 hours ago)
(HTM) web link (www.sbert.net)
(TXT) w3m dump (www.sbert.net)
| nmstoker wrote:
| Should probably have "(2019)" appended to the title as per HN
| conventions, given the date in the citation paper links (1x 2019
| and 2x 2020) and the fact that this site has been around for
| quite a few years...
| zwaps wrote:
| It should be mentioned that the original author Nils Reimers
| has moved on and the repo has been stale since 2021/2022. It
| has recently (e.g. end of 2023) gotten a new team and has since
| seen updates.
|
| This is obviously quite significant given how important
| sentence embedding models are.
| PaulHoule wrote:
| I use these all the time. For many of the classification tasks I
| do it works very well to pass an image or text through an
| embedding and then apply some kind of classical machine learning
| like the SVM. Model training is super reliable and takes maybe 3
| minutes to train multiple models and cross-validate. In maybe 45
| minutes I can train a single fine-tuned model but the results are
| really hit and miss.
| spxneo wrote:
| would love to try this out myself. seems like there are lighter
| solutions than relying a giant LLM
| gillesjacobs wrote:
| Used this library quite a bit. I still have no idea if there is a
| good reason this API is not packaged within
| huggingface/transformers.
|
| Probably historic reasons, anyway solid 9/10 API.
| anon373839 wrote:
| These are extremely useful embedding models, and some are small
| enough to use in the frontend (with transformers.js) for on-
| device semantic search.
|
| One issue I've run into is that they produce a very good ranking,
| but the actual cosine similarity scores appear to be meaningless.
| (E.g., "chocolate chip cookies" and "PLS6;YJBXSRF&/" could have a
| similarity of 0.8.) Consequently, I've had a hard time selecting
| sensible cutoff values to balance recall and precision. Has
| anyone found a good approach to this?
| ilaksh wrote:
| maybe the normalize_embeddings flag on encode?
| Clueed wrote:
| Not sure about the specific implementation here but the very
| definition of cosine similarity includes normalization. [0]
|
| [0] https://en.wikipedia.org/wiki/Cosine_similarity
| eigenvalue wrote:
| Yes, check out my library for vector similarity that has
| various other measures which are more discriminative:
|
| https://github.com/Dicklesworthstone/fast_vector_similarity
|
| pip install fast_vector_similarity
| refibrillator wrote:
| Even if you choose a more sophisticated similarity measure as
| suggested by another commenter, you'll still need to set a
| threshold on that metric to perform binary classification.
|
| In my experience there are two paths forward, the one I
| recommend is to train an MLP classifier on the embeddings to
| produce a binary classification (ie similar or not). The
| advantage is that you no longer need to set a numeric threshold
| on a distance metric, however you will need labeled training
| data to define what is "similar" in the context of your use
| case.
|
| The other path is to calculate the statistics of pair-wise
| distances for every record in some unlabeled dataset you have,
| and use the resulting distribution to inform choice of
| threshold. This will at least give you an estimate of what % of
| records will be classified as similar for a given threshold
| value.
| whakim wrote:
| I've had a lot of success training a classifier on top of
| these embeddings. An MLP works well, as does an SVM. If you
| don't have labeled data, I've also found that various active
| learning techniques can produce extremely strong results
| while only requiring a small amount of human labeling.
| abhgh wrote:
| I second the point about learning a classifier over them -
| it is practically quite useful. But I'd caution against
| believing Active Learning (AL) would be unequivocally be
| useful; most positive performances seem to arise in very
| specific circumstances [1]. Esp. see Table 1: a Linear SVM
| with MPNet (one of the embeddings sbert supports) has
| indeed strong performance, but, in general, random sampling
| (over an AL strategy) performs quite well!
|
| [1] https://arxiv.org/pdf/2403.15744v1.pdf
| lmeyerov wrote:
| Yep, we do precisely this, st + SVM => pre filtering /
| routing => RAG pipelines for some of louie.ai
|
| In our cases, < 100 labels on modern models (per task) goes
| far. We can do better, but more useful problems to solve.
| Impressive how far these have come!
| kurt_goedel wrote:
| Could you maybe tell me more about this approach? How do I
| have to build my training dataset?
|
| Any paper or other source about this approach?
| anon373839 wrote:
| These are excellent suggestions, and much appreciated.
| Thanks!
|
| If you train a binary classifier on the embeddings, have you
| found that the resulting probabilities are also good for
| ranking? Or do you stick with a distance measure for that?
| pietz wrote:
| I haven't noticed this behavior in such extreme ways as you.
| I've worked with many different embeddings using this library
| and the OpenAI API for data normalization, extraction and
| structuring tasks.
|
| On one hand it's super impressive what you can get out of these
| embeddings. On the other I'm with you that there seems to be a
| missing piece to make them work great.
|
| An additional MLP or Siamese Network works really well but it
| seems like there should be something easier and unsupervised
| given your set of vectorized samples.
|
| Thinking out loud, could PCA help with this by centering the
| data and removing dimensions that mostly model noise?
| anon373839 wrote:
| > I haven't noticed this behavior in such extreme ways as
| you.
|
| I should have clarified that it really depends on the model.
| I have had this problem to a greater extent with the GTE and
| BGE embeddings. But I use them anyway, because they're so
| strong overall.
|
| PCA is an interesting idea, and worth looking at.
| YossarianFrPrez wrote:
| Er, is it possible that you are using
| `scipy.spatial.distance.cosine` to compute the similarity? If
| so, note that this computes the cosine distance, and the cosine
| similarity is defined as _1-cosine distance._
|
| I tried out your example using the following code:
| from sentence_transformers import SentenceTransformer
| import scipy.spatial as ssp model =
| SentenceTransformer("all-mpnet-base-v2") A =
| model.encode(['chocolate chip cookies','PLS6;YJBXSRF&/'])
| CosineDistance = ssp.distance.cosine(A[0],A[1])
|
| Where `CosineDistance == 0.953`
|
| This means the model is actually working quite well, were these
| to be similar to each other we'd expect CosineDistance to be
| much closer to 0.
|
| The other comments about such distances being useful for
| _relative comparisons_ also apply: I 've used
| SentenceTransformers quite successfully for nearest-neighbor
| searches.
| refulgentis wrote:
| Something is off if you're seeing behavior like that, 0.1-0.15
| with MiniLM L6 V3 are good for "has any relevancy whatsoever"
| anon373839 wrote:
| I should have added that it depends on the model. MiniLM
| didn't exhibit this behavior, but it unfortunately didn't
| perform as well on recall or ranking as other models that
| did.
|
| GTE comes to mind. You can try the demo widget on HuggingFace
| and see this: https://huggingface.co/thenlper/gte-large
|
| As an example, against "Chocolate chip cookies", "Oreos" has
| a cosine similarity of .808, while "Bubonic plague" is at
| .709.
| fermisea wrote:
| Bootstrap a pvalue by creating a set of thousands of random
| words, calculate the metric for those and either explicitly
| keep these numbers to compute the rank, or fit some normal
| distribution and use this mean and std to estimate the
| probability that it's similar.
| mapmeld wrote:
| Major kudos to this library for supporting
| Matryoshka/nested/adaptive embeddings
| (https://huggingface.co/blog/matryoshka), which I needed to train
| a model recently
| edshiro wrote:
| I don't have much experience with embeddings...
|
| Could someone more knowledgeable suggest when it would make sense
| to use the SentenceTransformers library vs for instance relying
| on the OpenAI API to get embeddings for a sentence?
| rolisz wrote:
| Up until a month ago, the OpenAI embeddings where very poor.
| But they recently released a new model which is much better
| then they're previous one.
|
| Now it depends un specific usecase (domain, language, length of
| texts)
| montebicyclelo wrote:
| It's fairly easy to use, not that compute intensive (e.g. can
| run on even a small-ish CPU VM), the embeddings tend to perform
| well and you can avoid sending your data to a third party.
| Also, there are models fine tuned for particular domains on HF-
| hub, that can potentially give better embeddings for content in
| that domain.
| edshiro wrote:
| I see - thanks for the clarifications
|
| I presume if your customers are enterprise companies then you
| may opt to use this library vs sending their data to OpenAI
| etc.
|
| And you can get more customisation/fine-tuning from this
| library too.
| estreeper wrote:
| Just to add to this, a great resource is the Massive Text
| Embedding Benchmark (MTEB) leaderboard which you can use to
| find good models to evaluate, and there are many open models
| that outperform i.e. OpenAI's text-embedding-ada-002,
| currently ranked #46 for retrieval, which you can use with
| SentenceTransformers.
|
| https://huggingface.co/spaces/mteb/leaderboard
| tinyhouse wrote:
| Embeddings is one of those things that using OpenAI (or any
| other provider) isn't really necessary. There are many small
| open source embedding models that perform very well. Plus, you
| can finetune them on your task. You can also run locally and
| not worry about all the constraints (latency, rate limits etc)
| of using an external provider endpoint. If performance is
| important for you, then you'll need a GPU.
|
| The main reason to use one of those providers is if you want
| something that performs well out of the box without doing any
| work and you don't mind paying for it. Those companies like
| OpenAI, Cohere and others, already did they work to make those
| models work well on various domains. They may also use larger
| models that are not as easy to deal with yourself. (although as
| I mentioned previously, a small embeddings model fine-tuned on
| your task is likely to perform as well as a much bigger general
| model)
| VHRanger wrote:
| You should basically never use the openAI embeddings.
|
| There isn't a single usecase where they're better than the free
| models, and they're slower, needlessly large, and outrageously
| expensive for what they are.
| marban wrote:
| Also to be used with
| https://maartengr.github.io/BERTopic/index.html
| deepsdev wrote:
| I have used this library for a few years and is reliable. As
| someone mentioned, sometimes 2 things that are not related can
| have the same cosine similarity. Easy to use and get started
| with.
| estreeper wrote:
| I'm curious how people are handling multi-lingual embeddings.
|
| I've found LASER[1] which originally had the idea to embed all
| languages in the same vector space, though it's a bit harder to
| use than models available through SentenceTransformers. LASER2
| stuck with this approach, but LASER3 switched to language-
| specific models. However, I haven't found benchmarks for these
| models, and they were released about 2 years ago.
|
| Another alternative would be to translate everything before
| embedding, which would introduce some amount of error, though
| maybe it wouldn't be significant.
|
| 1. https://github.com/facebookresearch/LASER
| VHRanger wrote:
| The transformer models handle multilingual directly.
|
| For good old embedding models (eg. GLoVe), you have a few
| choices:
|
| 1. LASER as you mentioned. The performance tends to suck
| though.
|
| 2. Language prediction + one embedding model per supported
| language. Libraries like whichlang make this nice, and MUSE has
| embedding modes aligned per language for 100ish languages.
|
| Fastembed is a good library for this.
|
| Note that for most people, 32 dimension glove is all they need
| if you benchmark it.
|
| As the length of the text you're embedding goes up, or as the
| specificity goes up (eg. You have only medical documents and
| want difference between them) you'll need richer embeddings
| (more dimensions, or a transformer model, or both)
|
| People never benchmark their embeddings and I find it
| incredible how they end up with needlessly overenginneered
| systems.
| gregw134 wrote:
| Any idea which model has the best performance across
| languages? I'm checking out model performance on the
| Huggingface leaderboard and the top models for English aren't
| even in the top 20 for Polish, French and Chinese
| VHRanger wrote:
| Depends on what your usecase is.
|
| For the normal user that just wants something across
| languages, the minilm-paraphrase-multilingual in the OP
| library is great.
|
| Of you want better than that (either bigger model, or
| specifically for a subset of languages, etc.) then you need
| to think about your task, priorities, target languages,
| etc.
| andai wrote:
| Has anyone used FlagEmbedding? I'm testing a model that comes
| with examples for both SentenceTransformers and FlagEmbedding,
| but it's hard to find any information about it.
| riku_iki wrote:
| it doesn't look they fine-tune those models on any of modern
| foundational models, which likely produces huge performance gap
| compared to OpenAI embedding for example..
| andai wrote:
| What is everyone using embeddings for, and which mdels? I built a
| RAG last year (for searching long documents) but found it a bit
| disappointing. I tested it with OpenAI and with
| SentenceTransformers (instructor-xl or a smaller variant).
| Apparently they've come a long way since then though.
|
| Currently I'm working on an old fashioned search engine (tf-if +
| keyword expansion). Apparently that can work better than vector
| databases in some cases:
|
| https://news.ycombinator.com/item?id=38703943
| itissid wrote:
| I recall paragraph2Vec. It was the earliest way to experiment
| with position embeddings that I recall reading. At first when I
| read it, it felt kind of crazy/weird that it worked:
|
| You feed in the paragraph ID(yes just an integer of the para
| where the word was) via its own layer along with the word to
| CBOW/SkipGram set up. Then throw away that part after training.
| Then, during inference, you attach a new randmly initialized
| layer for where that old layer was and "re-train" just that part
| before generating the embedding for words.
| Der_Einzige wrote:
| >tfw it's 2024 and people STILL aren't using span compression to
| implement a "medium term" memory (RAG is long term and the
| context length is short term) for LLMs.
|
| >tfw it's 2024 and we just accept that the context "falls out" of
| the model if we push it beyond it's regular context length
|
| So everyone forgot that we can put large N number of tokens into
| small N number of embeddings because???
| register wrote:
| What do you mean by span compression? We have experimented with
| various embedding contes lengths and we have found that bigger
| embeddings aren't the ones providing the best recall. We have
| hit the best results with something between 65% to 75% of the
| maximum embedding context length. We have been using OpenAI
| embeddings models though.
| spxneo wrote:
| how are people using this with/without LLM?
|
| is the feature more of "reliable and accurate without
| hallucination" than LLM?
|
| where are you using it? text interface?
| gregw134 wrote:
| How are you guys deciding what parts of a document to turn into
| embeddings? I've heard paragraph embeddings aren't that reliable,
| so I'm planning on using tf-idf first to extract keywords from a
| document, and then just create embeddings from those keywords.
| nestorD wrote:
| Take your document, cut it (cleanly!) into pieces small enough
| to fit into your sentence embedder's context window, and
| generate _several_ embeddings that all point to the same
| document.
|
| I would recommend against merging (averaging, etc.) the
| embeddings (unless you want a blurry idea of what your document
| contains), as well as feeding very large pieces of text to the
| embedder (some models have massive context lengths, but the
| result is similarly vague).
| gregw134 wrote:
| > [recommend against] feeding very large pieces of text to
| the embedder
|
| Sounds right, I've heard this from multiple sources. That's
| why I'm leaning towards just embedding the keywords.
| therealdrag0 wrote:
| Keyword would leave you with semantics of word definitions but
| lose sentence meaning/context right?
| gregw134 wrote:
| I'm sure it would lose a ton of meaning, but for me it's
| easier to fit into a traditional search pipeline.
___________________________________________________________________
(page generated 2024-04-07 23:00 UTC)