[HN Gopher] The best way to use text embeddings portably is with...
___________________________________________________________________
The best way to use text embeddings portably is with Parquet and
Polars
Author : minimaxir
Score : 237 points
Date : 2025-02-24 18:27 UTC (1 days ago)
(HTM) web link (minimaxir.com)
(TXT) w3m dump (minimaxir.com)
| whinvik wrote:
| Since we are talking about an embedded solution shouldn't the
| benchmark be something like sqlite with a vector extension or
| lancedb?
| minimaxir wrote:
| I mention sqlite + sqlite-vec at the end, noting it requires
| technical overhead and it's not as easy as read_parquet() and
| write_parquet().
|
| I just became aware of lancedb and am looking into that,
| although from glancing at the README it has similar issues to
| faiss with regards to usability for casual use, although much
| better than faiss in that it can work with colocated metadata.
| 0cf8612b2e1e wrote:
| My natural point of comparison without actually be DuckDB plus
| their vector search extension.
| thelastbender12 wrote:
| This is pretty neat.
|
| IMO a hindrance to this was lack of built-in fixed-size list
| array support in the Arrow format, until recently. Some
| implementations/clients supported it, while others didn't. Else,
| it could have been used as the default storage format for numpy
| arrays, torch tensors, too.
|
| (You could always store arrays as variable length list arrays
| with fixed strides and handle the conversion).
| banku_brougham wrote:
| Is your example of a float32 number correct, holding 24 ascii
| char representation? I had thought single-precision gonna be 7
| digits and the exponent, sign and exp sign. Something like
| 7+2+1+1 or 10 char ascii representation? Rather than the 24 you
| mentioned?
| minimaxir wrote:
| It depends on the default print format. The example string I
| mentioned is pulled from what np.savetxt() does (fmt='%.18e')
| and there isn't any precision loss in that number. But I admit
| I'm not a sprintf() guru.
|
| In practice numbers with that much precision is overkill and
| verbose so tools don't print float32s to that level of
| precision.
| PaulHoule wrote:
| One of the things I remember from my PhD work is that you can
| do a stupendous number of FLOPs on floating point numbers in
| the time it takes to serialize/deserialize them to ASCII.
| banku_brougham wrote:
| Really cool article, I've enjoyed your work for a long time. You
| might add a note for those jumping into a sqlite implementation,
| that duckdb reads parquet and launched a few vector similarity
| functions which cover this use-case perfectly:
|
| https://duckdb.org/2024/05/03/vector-similarity-search-vss.h...
| jt_b wrote:
| I have tinkered with using DuckDB as a poor man's vector
| database for a POC and had great results.
|
| One thing I'd love to see is being able to do some sort of row
| group level metadata statistics for embeddings within a parquet
| file - something that would allow various readers to push
| predicates down to an HTTP request metadata level and
| completely avoid loading in non-relevant rows to the database
| from a remote file - particularly one stored on S3 compatible
| storage that supports byte-range requests. I'm not sure what
| the implementation would look like to define sorting the
| algorithm to organize the "close" rows together, how the
| metadata would be calculated, or what the reader implementation
| would look like, but I'd love to be able to implement some of
| the same patterns with vector search as with geoparquet.
| jt_b wrote:
| I thought about this some more and did some research - and
| found an indexing approach using HNSW, serialized to parquet,
| and queried from the browser here:
|
| https://github.com/jasonjmcghee/portable-hnsw
|
| Opens up efficient query patterns for larger datasets for RAG
| projects where you may not have the resources to run an
| expensive vector database
| jasonjmcghee wrote:
| Hey that's my little research project- lmk if you're
| interested in chatting about this stuff.
|
| As others have mentioned in other threads, parquet isn't a
| great tool for the job here, but you could theoretically
| build a different file format that lends itself better to
| the problem of static file(s) representing a vector
| database.
| kernelsanderz wrote:
| For another library that has great performance and features like
| full text indexing and the ability to version changes I'd
| recommend lancedb https://lancedb.github.io/lancedb/
|
| Yes, it's a vector database and has more complexity. But you can
| use it without creating indexes and it has excellent polars and
| pandas zero copy arrow support also.
| esafak wrote:
| Lance is made for this stuff; parquet is not.
| daveguy wrote:
| Since a lot of ML data is stored as parquet, I found this to be
| a useful tidbit from lancedb's documentation:
|
| > Data storage is columnar and is interoperable with other
| columnar formats (such as Parquet) via Arrow
|
| https://lancedb.github.io/lancedb/concepts/data_management/
|
| Edit: That said, I am personally a fan of parquet, arrow, and
| ibis. So many data wrangling options out there it's easy to get
| analysis paralysis.
| 3abiton wrote:
| How well does it scale?
| stephantul wrote:
| Check out Unum's usearch. It beats anything, and is super easy to
| use. It just does exactly what you need.
|
| https://github.com/unum-cloud/usearch
| esafak wrote:
| Have you tested it against Lance? Does it do predicate pushdown
| for filtering?
| stephantul wrote:
| Usearch is a vector store afaik, not a vector db. At least
| that's how I use it.
|
| I haven't compared it to lancedb, I reached for it here
| because the author mentioned Faiss being difficult to use and
| install. usearch is a great alternative to Faiss.
|
| But thanks for the suggestion, I'll check it out
| ashvardanian wrote:
| USearch author here :)
|
| The engine supports arbitrary predicates for C, C++, and Rust
| users. In higher level languages it's hard to combine
| callbacks and concurrent state management.
|
| In terms of scalability and efficiency, the only tool I've
| seen coming close is Nvidia's cuVS if you have GPUs
| available. FAISS HNSW implementation can easily be 10x slower
| and most commercial & venture-backed alternatives are even
| slower: https://www.unum.cloud/blog/2023-11-07-scaling-
| vector-search...
|
| In this use-case, I believe SimSIMD raw kernels may be a
| better choice. Just replace NumPy and enjoy speedups. It
| provides hundreds of hand-written SIMD kernels for all kinds
| of vector-vector operations for AVX, AVX-512, NEON, and SVE
| across F64, F32, BF16, F16, I8, and binary vectors, mostly
| operating in mixed precision to avoid overflow and
| instability: https://github.com/ashvardanian/SimSIMD
| kipukun wrote:
| To the second footnote: you could utilize Polar's lazyframe API
| to do that cosine similarity in a streaming fashion for large
| files.
| minimaxir wrote:
| That would get around memory limitations but I still think that
| would be slow.
| kipukun wrote:
| You'd be surprised. As long as your query is using Polars
| natives and not a UDF (which drops it down to Python), you
| may get good results.
| jononor wrote:
| A (simple) benchmark would be great to figure out where the
| practical limits of such an approach are. Runtime is
| expected to grow with O(n*2) which will get painful at some
| point.
| jtrueb wrote:
| Polars + Parquet is awesome for portability and performance. This
| post focused on python portability, but Polars has an easy-to-use
| Rust API for embedding the engine all over the place.
| blooalien wrote:
| Gotta love stuff that has multiple language bindings. Always
| really enjoyed finding powerful libraries in Python and then
| seeing they also have matching bindings for Go and Rust. Nice
| to have easy portability and cross-language compatibility.
| robschmidt90 wrote:
| Nice read. I agree that for a lot of hobby use cases you can just
| load the embeddings from parquet and compute the similarities in-
| memory.
|
| To find similarity between my blogposts [1] I wanted to
| experiment with a local vector database and found ChromaDB fairly
| easy to use (similar to SQLite just a file on your machine).
|
| [1] https://staticnotes.org/posts/how-recommendations-work/
| jononor wrote:
| At 33k items in memory is quite fast, 10 ms is very responsive.
| With 10x/330k items given same hardware the expected time is 1
| second. That might be too slow for some applications (but not
| all). Especially if one just does retrieval of a rather small
| amount of matches, an index will help a lot for 100k++ datasets.
| noahbp wrote:
| Wow! How much did this cost you in GPU credits? And did you
| consider using your MacBook?
| minimaxir wrote:
| It took 1:17 to encode all ~32k cards using a preemptible L4
| GPU on Google Cloud Platform (g2-standard-4) at ~$0.28/hour,
| costing < $0.01 overall: https://github.com/minimaxir/mtg-
| embeddings/blob/main/mtg_em...
|
| The base ModernBERT uses CUDA tricks not available in MPS, so I
| suspect it would take much longer.
|
| For the 2D UMAP, it took 3:33 because I wanted to do 1 million
| epochs to be _thorough_ : https://github.com/minimaxir/mtg-
| embeddings/blob/main/mtg_em...
| rcarmo wrote:
| I'm a huge fan of polars, but I hadn't considered using it to
| store embeddings in this way (I've been fiddling with sqlite-
| vec). Seems like an interesting idea indeed.
| thomasfromcdnjs wrote:
| Lots of great findings
|
| ---
|
| I'm curious if anyone knows whether it is better to pass
| structured data or unstructured data to embedding api's? If I ask
| ChatGPT, it says it is better to send unstructured data. (looking
| at the authors github, it looks like he generated embeddings from
| json strings)
|
| My use case is for jsonresume, I am creating embeddings by
| sending full json versions as strings, but I've been
| experimenting with using models to translate resume.json's into
| full text versions first before creating embeddings. The results
| seem to be better but I haven't seen any concrete opinions on
| this.
|
| My understanding is that unstructured data is better because it
| contains textual/semantic meaning because of natural lanaguage
| aka skills: ['Javascript', 'Python']
|
| is worse than; Thomas excels at Javascript and
| Python
|
| Another question: What if the search was also a json embedding?
| JSON <> JSON embeddings could also be great?
| minimaxir wrote:
| In general I like to send structured data (see the input format
| here: https://github.com/minimaxir/mtg-embeddings), but the
| ModernBERT base for the embedding model used here
| _specifically_ has better benefits implicitly for structured
| data compared to previous models. That 's worth another blog
| post explaining why.
| notpublic wrote:
| please do explain why
| minimaxir wrote:
| tl;dr the base ModernBERT was trained with code in mind
| unlike most encoder-only models (therefore assuming it was
| also trained on JSON/YAML objects) and also includes a
| custom tokenizer to support that, which is why I mention
| that indentation is important since different levels of
| indentation have different single tokens.
|
| This is mostly theoetical and does require a deeper dive to
| confirm.
| vunderba wrote:
| I'd say the more important consideration is "consistency"
| between incoming query input and stored vectors.
|
| I have a huge vector database that gets updated/regenerated
| from a personal knowledge store (markdown library). Since the
| user is most likely to input a comparison query in the form of
| a question "Where does X factor into the Y system?" - I use a
| small 7b parameter LLM to pregenerate a list of a dozen
| possible theoretical questions a user might pose to a given
| embedding chunk. These are saved as 1536 dimension sized
| embeddings into the vector database (Qdrant) and linked to the
| chunks.
|
| The real question you need to ask is - what's the input query
| that you'll be comparing to the embeddings? If it's incoming as
| structured, then store structured, etc.
|
| I've also seen (anecdotally) similarity degradation for smaller
| chunks as well - so keep that in mind as well.
| octernion wrote:
| or you could just use postgres + pgvector? which many apps
| already have installed by default.
| jononor wrote:
| Many ways to skin a cat. At least of this size (33k items). And
| at the size given, string up a database would have no
| advantages. Which I believe is the main point of the post! If
| you have a simple problem, use a simple solution.
|
| If one had instead 1M items, the situation would be completely
| different.
| dwagnerkc wrote:
| If you want to try it out. Can lazily load from HF and apply
| filtering this way. df = (
| pl.scan_parquet('hf://datasets/minimaxir/mtg-
| embeddings/mtg_embeddings.parquet') .filter(
| pl.col("type").str.contains("Sorcery"),
| pl.col("manaCost").str.contains("B"), )
| .collect()
|
| )
|
| Polars is awesome to use, would highly recommend. Single node it
| is excellent at saturating CPUs, if you need to distribute the
| work put it in a Ray Actor with some POLARS_MAX_THREADS applied
| depending on how much it saturates a single node.
| WatchDog wrote:
| Parquet is fine and all, but I love the simplicity and simple
| interoperability of CSV.
|
| You can save a huge amount of overhead just by base64 encoding
| the vectors, they aren't exactly human readable anyway.
|
| I imagine the resulting file would only be approximately 33%
| larger than the pickle version.
| llm_trw wrote:
| >The second incorrect method to save a matrix of embeddings to
| disk is to save it as a Python pickle object [...] But it comes
| with two major caveats: pickled files are a massive security risk
| as they can execute arbitrary code, and the pickled file may not
| be guaranteed to be able to be opened on other machines or Python
| versions. It's 2025, just stop pickling if you can.
|
| Security: absolutely.
|
| Portability: who cares? Frameworks move so quickly that unless
| you carry your whole dependency graph between machines you will
| not get bit compatible results with even minor version changes.
| It's a dirty secret that no one seems to want to fix or care
| about.
|
| In short: everything is so fucked that pickle + conda is more
| than good enough for whatever project you want to serve to
| >10,000 users.
| PaulHoule wrote:
| In 2017 I was working on a model trainer for text classification
| and sequence labeling [1] that had limited success because the
| models weren't good enough.
|
| I have a minilm + pooling + svm classifier which works pretty
| well for some things (topics, "will I like this article?") but
| doesn't work so well for sentiment, emotional tone and other
| things where the order of the words matter. I'm planning to
| upgrade my current classifier's front end to use ModernBert and
| add an LSTM-based back end that I think will equal or beat fine-
| tuned BERT and, more importantly, can be trained reliably with
| early stopping. I'd like to open source the thing, focused on
| reliability, because I'm an application programmer at heart.
|
| I want it to provide an interface which is text-in and labels-out
| and hide the embeddings from most users but I'm definitely
| thinking about how to handle them, and there's the worse problem
| here that the LSTM needs a vector for each token, not each
| document, so text gets puffed up by a factor of 1000 or so which
| is not insurmountable (1 MB of training text puffs up to 1 GB of
| vectors)
|
| Since it's expensive to compute the embeddings _and_ expensive to
| store them I 'm thinking about whether and how to cache them,
| considering that I expect to present the same samples to the
| trainer multiple times and to do a lot of model selection in the
| process of model development (e.g. what exact shape LSTM to to
| use) and in the case of end-user training (it will probably try a
| few models, not least do a shootout between the expensive model
| and a cheap model)_
|
| [1] think of a "magic magic marker" which learns to mark up text
| the same way you do; this could mark "needless words" you could
| delete from a title, parts of speech, named entities, etc.
| intalentive wrote:
| The problem with Parquet is it's static. Not good for use cases
| that involve continuous writes and updates. Although I have had
| good results with DuckDB and Parquet files in object storage.
| Fast load times.
|
| If you host your own embedding model, then you can transmit numpy
| float32 compressed arrays as bytes, then decode back into numpy
| arrays.
|
| Personally I prefer using SQLite with usearch extension. Binary
| vectors then rerank top 100 with float32. It's about 2 ms for
| ~20k items, which beats LanceDB in my tests. Maybe Lance wins on
| bigger collections. But for my use case it works great, as each
| user has their own dedicated SQLite file.
|
| For portability there's Litestream.
| jt_b wrote:
| > The problem with Parquet is it's static. Not good for use
| cases that involve continuous writes and updates. Although I
| have had good results with DuckDB and Parquet files in object
| storage. Fast load times.
|
| You can use glob patterns in DuckDB to query remote parquets
| though to get around this? Maybe break things up using a hive
| partitioning scheme or similar.
| memhole wrote:
| I like the pattern described too. Only snag is deletes and
| updates. Ime, you have to delete the underlying file or
| create and maintain a view that handles the data you want
| visible.
| dijksterhuis wrote:
| > The problem with Parquet is it's static. Not good for use
| cases that involve continuous writes and updates.
|
| parquet is columnar storage, so it's use case is lots of heavy
| filtering/aggregation within analytical workloads (OLAP).
|
| consistent writes / updates, i.e. basically transactional
| (OLTP), use cases are never going to have great performance in
| columnar storage. its the wrong format to use for that.
|
| for faster writes/updates you'd want row-based, i.e. CSV or an
| actual database. which i'm glad to see is where you kind of
| ended up anyway.
| yorwba wrote:
| There's no reason why an update query that doesn't change the
| file layout and only twiddles some values in place couldn't
| be made fast with columnar storage.
|
| When you run a read query, there's one phase that determines
| the offsets where values are stored and another that reads
| the value at a given offset. For an update query that doesn't
| change the offsets, you can change the direction from reading
| the value at an offset to writing a new value to that
| location instead, and it should be plenty fast.
|
| Parquet libraries just don't seem to consider that use case
| worth supporting for some reason and expect people to
| generate an entire new file with mostly the same content
| instead. Which definitely doesn't have great performance!
| rbetts wrote:
| Columnar storage systems rarely store the raw value at
| fixed position. They store values as run length encoded,
| dictionary encoded, delta encoded, etc... and then store
| metadata about chunk of values for pruning at query time.
| So rarely can you seek to an offset and update a value. The
| compression achieved means less data to read from disk when
| doing large scans and lower storage costs for very-large-
| datasets that are largely immutable - some of the important
| benefits of columnar storage.
|
| Also, many applications that require updates also update
| conditionally (update a where b = c). This requires re-
| synthesizing (at least some of) the row to make a
| comparison, another relatively expensive operation for a
| column store.
| lmeyerov wrote:
| Also typically stored with binary compression (snappy,
| lib) after the snappy compression. In-memory might only
| be semantic, eg, arrow.
|
| But it's... Fine? Batch writes and rewrite dirty parts.
| Most of our cases are either appending events, or
| enriching with new columns, which can be modeled
| columnarly. It is a bit more painful in GPU land bc we
| like big chunks (250MB-1GB) for saturating reads, but CPU
| land is generally fine for us.
|
| We have been eyeing iceberg and friends as a way to
| automate that, so I've been curious how much of the
| optimization, if any, they take for us
| csunbird wrote:
| Parquet files being immutable is not a bug, it is a feature.
| That is how you accomplish good compression and keep the
| columnar data organized.
|
| Yes, it is not useful for continuous writes and updates, but it
| is not what it is designed for. Use a database (e.g. SQLite
| just like you suggested) if you want to ingest real
| time/streaming data.
| pantsforbirds wrote:
| I've had great luck using either Athena or DuckDB with parquet
| files in s3 using a few partitions. You can query across the
| partitions pretty efficiently and if date/time is one of your
| partitions, then it's very efficient to add new data.
| th24o3j4324234 wrote:
| The trouble with Parquet (and columnar storage) in ML is,
|
| 1. You don't really care too-much about accessing subsets of
| columns
|
| 2. You can't easily append stuff to closed Parquet files.
|
| 3. Batched-row access is presumably slower due to lower cache-
| hits.
|
| It's okay for map-reduce style stuff where this doesn't matter,
| but in ML these limitations are an annoyance.
|
| HDF5 (or Zarr, less portably) solves some/many of these issues
| but it's not quite a settled affair.
| jononor wrote:
| Re 2. Parquet can easily be used with chunked/partitioned
| files. Then appending is just adding another file/chunk.
|
| The case of 1. really depends on the workload. For embeddings
| etc selecting column subsets is rare. In order cases, where one
| has a a bunch of separate features, doing column subsetting
| might be rather common. But yes, it is far from every case.
| ismailmaj wrote:
| Parquet is only a mess if you try to mutate it, usually you
| consider them as immutable and have the data stored across many
| files.
|
| Also batched-row access is negligible given the compression
| benefits you get with the columnar format, which is probably
| why it's still king in ML; I think given what I'm seeing in the
| industry and recent trends (e.g. Velox).
| mhh__ wrote:
| I still don't like dataframes but oh my God Polars is so much
| better than pandas.
|
| I was doing some time series calculations, simple equity price
| adjustments basically, in Polars and my two thoughts were:
|
| - WTF, I can actually read the code and test it.
|
| - it's running so fast it seems like it's broken.
| eskaytwo wrote:
| There's some nice plugins too, some are finance related:
| https://github.com/ddotta/awesome-polars
| mhh__ wrote:
| The one thing I really want is for someone to make it so I
| can use it in F#. Presumably it's possible given how the
| python bit is implemented under the hood?
| whyever wrote:
| It uses pyo3 to generate the bindings, so you would have to
| find a similar crate for F#/.NET and port the polars Python
| FFI to it. If such a crate does not exist, it will be even
| more work.
| LaurensBER wrote:
| Yeah, the readability difference is immense. I worked for years
| with Pandas and I still cannot "scan" it as quickly as with a
| "normal" programming language or SQL. Then there's the whole
| issue with (multi)-indexes, serialisation, etc.
|
| Polars makes programming fun again instead of a chore.
| k2so wrote:
| A neat trick in Vespa (vectors DB among other things)
| documentation is to use hex representation of vectors after
| converting them to binary.
|
| This trick can be used to reduce your payload sizes. In Vespa,
| they support this format which is particularly useful when the
| same vectors are referenced multiple times in a document. For
| ColBERT or ColPaLi like cases (where you have many embedding
| vectors), this can reduce the size of the vectors stored on disk
| massively.
|
| https://docs.vespa.ai/en/reference/document-json-format.html...
|
| Not sure why this is not more commonly adopted though
___________________________________________________________________
(page generated 2025-02-25 23:02 UTC)