[HN Gopher] The best way to use text embeddings portably is with...
___________________________________________________________________
The best way to use text embeddings portably is with Parquet and
Polars
Author : minimaxir
Score : 84 points
Date : 2025-02-24 18:27 UTC (4 hours ago)
(HTM) web link (minimaxir.com)
(TXT) w3m dump (minimaxir.com)
| whinvik wrote:
| Since we are talking about an embedded solution shouldn't the
| benchmark be something like sqlite with a vector extension or
| lancedb?
| minimaxir wrote:
| I mention sqlite + sqlite-vec at the end, noting it requires
| technical overhead and it's not as easy as read_parquet() and
| write_parquet().
|
| I just became aware of lancedb and am looking into that,
| although from glancing at the README it has similar issues to
| faiss with regards to usability for casual use, although much
| better than faiss in that it can work with colocated metadata.
| 0cf8612b2e1e wrote:
| My natural point of comparison without actually be DuckDB plus
| their vector search extension.
| thelastbender12 wrote:
| This is pretty neat.
|
| IMO a hindrance to this was lack of built-in fixed-size list
| array support in the Arrow format, until recently. Some
| implementations/clients supported it, while others didn't. Else,
| it could have been used as the default storage format for numpy
| arrays, torch tensors, too.
|
| (You could always store arrays as variable length list arrays
| with fixed strides and handle the conversion).
| banku_brougham wrote:
| Is your example of a float32 number correct, holding 24 ascii
| char representation? I had thought single-precision gonna be 7
| digits and the exponent, sign and exp sign. Something like
| 7+2+1+1 or 10 char ascii representation? Rather than the 24 you
| mentioned?
| minimaxir wrote:
| It depends on the default print format. The example string I
| mentioned is pulled from what np.savetxt() does (fmt='%.18e')
| and there isn't any precision loss in that number. But I admit
| I'm not a sprintf() guru.
|
| In practice numbers with that much precision is overkill and
| verbose so tools don't print float32s to that level of
| precision.
| PaulHoule wrote:
| One of the things I remember from my PhD work is that you can
| do a stupendous number of FLOPs on floating point numbers in
| the time it takes to serialize/deserialize them to ASCII.
| banku_brougham wrote:
| Really cool article, I've enjoyed your work for a long time. You
| might add a note for those jumping into a sqlite implementation,
| that duckdb reads parquet and launched a few vector similarity
| functions which cover this use-case perfectly:
|
| https://duckdb.org/2024/05/03/vector-similarity-search-vss.h...
| jt_b wrote:
| I have tinkered with using DuckDB as a poor man's vector
| database for a POC and had great results.
|
| One thing I'd love to see is being able to do some sort of row
| group level metadata statistics for embeddings within a parquet
| file - something that would allow various readers to push
| predicates down to an HTTP request metadata level and
| completely avoid loading in non-relevant rows to the database
| from a remote file - particularly one stored on S3 compatible
| storage that supports byte-range requests. I'm not sure what
| the implementation would look like to define sorting the
| algorithm to organize the "close" rows together, how the
| metadata would be calculated, or what the reader implementation
| would look like, but I'd love to be able to implement some of
| the same patterns with vector search as with geoparquet.
| kernelsanderz wrote:
| For another library that has great performance and features like
| full text indexing and the ability to version changes I'd
| recommend lancedb https://lancedb.github.io/lancedb/
|
| Yes, it's a vector database and has more complexity. But you can
| use it without creating indexes and it has excellent polars and
| pandas zero copy arrow support also.
| esafak wrote:
| Lance is made for this stuff; parquet is not.
| daveguy wrote:
| Since a lot of ML data is stored as parquet, I found this to be
| a useful tidbit from lancedb's documentation:
|
| > Data storage is columnar and is interoperable with other
| columnar formats (such as Parquet) via Arrow
|
| https://lancedb.github.io/lancedb/concepts/data_management/
|
| Edit: That said, I am personally a fan of parquet, arrow, and
| ibis. So many data wrangling options out there it's easy to get
| analysis paralysis.
| stephantul wrote:
| Check out Unum's usearch. It beats anything, and is super easy to
| use. It just does exactly what you need.
|
| https://github.com/unum-cloud/usearch
| esafak wrote:
| Have you tested it against Lance? Does it do predicate pushdown
| for filtering?
| stephantul wrote:
| Usearch is a vector store afaik, not a vector db. At least
| that's how I use it.
|
| I haven't compared it to lancedb, I reached for it here
| because the author mentioned Faiss being difficult to use and
| install. usearch is a great alternative to Faiss.
|
| But thanks for the suggestion, I'll check it out
| ashvardanian wrote:
| USearch author here :)
|
| The engine supports arbitrary predicates for C, C++, and Rust
| users. In higher level languages it's hard to combine
| callbacks and concurrent state management.
|
| In terms of scalability and efficiency, the only tool I've
| seen coming close is Nvidia's cuVS if you have GPUs
| available. FAISS HNSW implementation can easily be 10x slower
| and most commercial & venture-backed alternatives are even
| slower: https://www.unum.cloud/blog/2023-11-07-scaling-
| vector-search...
|
| In this use-case, I believe SimSIMD raw kernels may be a
| better choice. Just replace NumPy and enjoy speedups. It
| provides hundreds of hand-written SIMD kernels for all kinds
| of vector-vector operations for AVX, AVX-512, NEON, and SVE
| across F64, F32, BF16, F16, I8, and binary vectors, mostly
| operating in mixed precision to avoid overflow and
| instability: https://github.com/ashvardanian/SimSIMD
| kipukun wrote:
| To the second footnote: you could utilize Polar's lazyframe API
| to do that cosine similarity in a streaming fashion for large
| files.
| minimaxir wrote:
| That would get around memory limitations but I still think that
| would be slow.
| kipukun wrote:
| You'd be surprised. As long as your query is using Polars
| natives and not a UDF (which drops it down to Python), you
| may get good results.
| jtrueb wrote:
| Polars + Parquet is awesome for portability and performance. This
| post focused on python portability, but Polars has an easy-to-use
| Rust API for embedding the engine all over the place.
| robschmidt90 wrote:
| Nice read. I agree that for a lot of hobby use cases you can just
| load the embeddings from parquet and compute the similarities in-
| memory.
|
| To find similarity between my blogposts [1] I wanted to
| experiment with a local vector database and found ChromaDB fairly
| easy to use (similar to SQLite just a file on your machine).
|
| [1] https://staticnotes.org/posts/how-recommendations-work/
| jononor wrote:
| At 33k items in memory is quite fast, 10 ms is very responsive.
| With 10x/330k items given same hardware the expected time is 1
| second. That might be too slow for some applications (but not
| all). Especially if one just does retrieval of a rather small
| amount of matches, an index will help a lot for 100k++ datasets.
| noahbp wrote:
| Wow! How much did this cost you in GPU credits? And did you
| consider using your MacBook?
| minimaxir wrote:
| It took 1:17 to encode all ~32k cards using a preemptible L4
| GPU on Google Cloud Platform (g2-standard-4) at ~$0.28/hour,
| costing < $0.01 overall: https://github.com/minimaxir/mtg-
| embeddings/blob/main/mtg_em...
|
| The base ModernBERT uses CUDA tricks not available in MPS, so I
| suspect it would take much longer.
|
| For the 2D UMAP, it took 3:33 because I wanted to do 1 million
| epochs to be _thorough_ : https://github.com/minimaxir/mtg-
| embeddings/blob/main/mtg_em...
| rcarmo wrote:
| I'm a huge fan of polars, but I hadn't considered using it to
| store embeddings in this way (I've been fiddling with sqlite-
| vec). Seems like an interesting idea indeed.
| thomasfromcdnjs wrote:
| Lots of great findings
|
| ---
|
| I'm curious if anyone knows whether it is better to pass
| structured data or unstructured data to embedding api's? If I ask
| ChatGPT, it says it is better to send unstructured data. (looking
| at the authors github, it looks like he generated embeddings from
| json strings)
|
| My use case is for jsonresume, I am creating embeddings by
| sending full json versions as strings, but I've been
| experimenting with using models to translate resume.json's into
| full text versions first before creating embeddings. The results
| seem to be better but I haven't seen any concrete opinions on
| this.
|
| My understanding is that unstructured data is better because it
| contains textual/semantic meaning because of natural lanaguage
| aka skills: ['Javascript', 'Python']
|
| is worse than; Thomas excels at Javascript and
| Python
|
| Another question: What if the search was also a json embedding?
| JSON <> JSON embeddings could also be great?
| minimaxir wrote:
| In general I like to send structured data (see the input format
| here: https://github.com/minimaxir/mtg-embeddings), but the
| ModernBERT base for the embedding model used here
| _specifically_ has better benefits implicitly for structured
| data compared to previous models. That 's worth another blog
| post explaining why.
___________________________________________________________________
(page generated 2025-02-24 23:00 UTC)