[HN Gopher] Show HN: Model2vec-Rs - Fast Static Text Embeddings ...
___________________________________________________________________
Show HN: Model2vec-Rs - Fast Static Text Embeddings in Rust
Hey HN! We've just open-sourced model2vec-rs, a Rust crate for
loading and running Model2Vec static embedding models with zero
Python dependency. This allows you to embed text at (very) high
throughput; for example, in a Rust-based microservice or CLI tool.
This can be used for semantic search, retrieval, RAG, or any other
text embedding usecase. Main Features: - Rust-native inference:
Load any Model2Vec model from Hugging Face or your local path with
StaticModel::from_pretrained(...). - Tiny footprint: The crate
itself is only ~1.7 mb, with embedding models between 7 and 30 mb.
Performance: We benchmarked single-threaded on a CPU: - Python:
~4650 embeddings/sec - Rust: ~8000 embeddings/sec (~1.7x speedup)
First open-source project in Rust for us, so would be great to get
some feedback!
Author : Tananon
Score : 46 points
Date : 2025-05-18 15:01 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| noahbp wrote:
| What is your preferred static text embedding model?
|
| For someone looking to build a large embedding search, fast
| static embeddings seem like a good deal, but almost too good to
| be true. What quality tradeoff are you seeing with these models
| versus embedding models with attention mechanisms?
| Tananon wrote:
| It depends a bit on the task and language, but my go-to is
| usually minishlab/potion-base-8M for every task except
| retrieval (classification, clustering, etc). For retrieval
| minishlab/potion-retrieval-32M works best. If performance is
| critical minishlab/potion-base-32M is best, although it's a bit
| bigger (~100mb).
|
| There's definitely a quality trade-off. We have extensive
| benchmarks here: https://github.com/MinishLab/model2vec/blob/ma
| in/results/REA.... potion-base-32M reaches ~92% of the
| performance of MiniLM while being much faster (about 70x faster
| on CPU). It depends a bit on your constraints: if you have
| limited hardware and very high throughput, these models will
| allow you to still make decent quality embeddings, but ofcourse
| an attention based model will be better, but more expensive.
| refulgentis wrote:
| Thanks man this is incredible work, really appreciate the
| details you went into.
|
| I've been chewing on if there was a miracle that could make
| embeddings 10x faster for my search app that uses minilmv3,
| sounds like there is :) I never would have dreamed. I'll
| definitely be trying potion-base in my library for Flutter x
| ONNX.
|
| EDIT: I was thanking you for thorough benchmarking, then it
| dawned on me you were on the team that built the model -
| _fantastic_ work, I can 't wait to try this.
|
| I am but a poor caveman mobile developer by origin, I convert
| minilm to ONNX by using some old one-off notebook from 3
| years ago. Dunno if that even matters, but if it does, I'd
| love "official" / "blessed" ONNX versions from you all if
| it's easy.
| Havoc wrote:
| Surprised it is so much faster. I would have thought the python
| one is C under the hood
| Tananon wrote:
| Indeed, I also didn't expect it to be so much faster! I think
| it's because most of the time is actually spent on tokenization
| (which also happens in Rust in the Python package), but there
| is some transfer overhead there between Rust and Python. The
| other operations should be the same speed I think.
___________________________________________________________________
(page generated 2025-05-18 23:00 UTC)