hngopher.com

       [HN Gopher] The best way to use text embeddings portably is with...
       ___________________________________________________________________
        
       The best way to use text embeddings portably is with Parquet and
       Polars
        
       Author : minimaxir
       Score  : 237 points
       Date   : 2025-02-24 18:27 UTC (1 days ago)
        
 (HTM) web link (minimaxir.com)
 (TXT) w3m dump (minimaxir.com)
        
       | whinvik wrote:
       | Since we are talking about an embedded solution shouldn't the
       | benchmark be something like sqlite with a vector extension or
       | lancedb?
        
         | minimaxir wrote:
         | I mention sqlite + sqlite-vec at the end, noting it requires
         | technical overhead and it's not as easy as read_parquet() and
         | write_parquet().
         | 
         | I just became aware of lancedb and am looking into that,
         | although from glancing at the README it has similar issues to
         | faiss with regards to usability for casual use, although much
         | better than faiss in that it can work with colocated metadata.
        
         | 0cf8612b2e1e wrote:
         | My natural point of comparison without actually be DuckDB plus
         | their vector search extension.
        
       | thelastbender12 wrote:
       | This is pretty neat.
       | 
       | IMO a hindrance to this was lack of built-in fixed-size list
       | array support in the Arrow format, until recently. Some
       | implementations/clients supported it, while others didn't. Else,
       | it could have been used as the default storage format for numpy
       | arrays, torch tensors, too.
       | 
       | (You could always store arrays as variable length list arrays
       | with fixed strides and handle the conversion).
        
       | banku_brougham wrote:
       | Is your example of a float32 number correct, holding 24 ascii
       | char representation? I had thought single-precision gonna be 7
       | digits and the exponent, sign and exp sign. Something like
       | 7+2+1+1 or 10 char ascii representation? Rather than the 24 you
       | mentioned?
        
         | minimaxir wrote:
         | It depends on the default print format. The example string I
         | mentioned is pulled from what np.savetxt() does (fmt='%.18e')
         | and there isn't any precision loss in that number. But I admit
         | I'm not a sprintf() guru.
         | 
         | In practice numbers with that much precision is overkill and
         | verbose so tools don't print float32s to that level of
         | precision.
        
         | PaulHoule wrote:
         | One of the things I remember from my PhD work is that you can
         | do a stupendous number of FLOPs on floating point numbers in
         | the time it takes to serialize/deserialize them to ASCII.
        
       | banku_brougham wrote:
       | Really cool article, I've enjoyed your work for a long time. You
       | might add a note for those jumping into a sqlite implementation,
       | that duckdb reads parquet and launched a few vector similarity
       | functions which cover this use-case perfectly:
       | 
       | https://duckdb.org/2024/05/03/vector-similarity-search-vss.h...
        
         | jt_b wrote:
         | I have tinkered with using DuckDB as a poor man's vector
         | database for a POC and had great results.
         | 
         | One thing I'd love to see is being able to do some sort of row
         | group level metadata statistics for embeddings within a parquet
         | file - something that would allow various readers to push
         | predicates down to an HTTP request metadata level and
         | completely avoid loading in non-relevant rows to the database
         | from a remote file - particularly one stored on S3 compatible
         | storage that supports byte-range requests. I'm not sure what
         | the implementation would look like to define sorting the
         | algorithm to organize the "close" rows together, how the
         | metadata would be calculated, or what the reader implementation
         | would look like, but I'd love to be able to implement some of
         | the same patterns with vector search as with geoparquet.
        
           | jt_b wrote:
           | I thought about this some more and did some research - and
           | found an indexing approach using HNSW, serialized to parquet,
           | and queried from the browser here:
           | 
           | https://github.com/jasonjmcghee/portable-hnsw
           | 
           | Opens up efficient query patterns for larger datasets for RAG
           | projects where you may not have the resources to run an
           | expensive vector database
        
             | jasonjmcghee wrote:
             | Hey that's my little research project- lmk if you're
             | interested in chatting about this stuff.
             | 
             | As others have mentioned in other threads, parquet isn't a
             | great tool for the job here, but you could theoretically
             | build a different file format that lends itself better to
             | the problem of static file(s) representing a vector
             | database.
        
       | kernelsanderz wrote:
       | For another library that has great performance and features like
       | full text indexing and the ability to version changes I'd
       | recommend lancedb https://lancedb.github.io/lancedb/
       | 
       | Yes, it's a vector database and has more complexity. But you can
       | use it without creating indexes and it has excellent polars and
       | pandas zero copy arrow support also.
        
         | esafak wrote:
         | Lance is made for this stuff; parquet is not.
        
         | daveguy wrote:
         | Since a lot of ML data is stored as parquet, I found this to be
         | a useful tidbit from lancedb's documentation:
         | 
         | > Data storage is columnar and is interoperable with other
         | columnar formats (such as Parquet) via Arrow
         | 
         | https://lancedb.github.io/lancedb/concepts/data_management/
         | 
         | Edit: That said, I am personally a fan of parquet, arrow, and
         | ibis. So many data wrangling options out there it's easy to get
         | analysis paralysis.
        
         | 3abiton wrote:
         | How well does it scale?
        
       | stephantul wrote:
       | Check out Unum's usearch. It beats anything, and is super easy to
       | use. It just does exactly what you need.
       | 
       | https://github.com/unum-cloud/usearch
        
         | esafak wrote:
         | Have you tested it against Lance? Does it do predicate pushdown
         | for filtering?
        
           | stephantul wrote:
           | Usearch is a vector store afaik, not a vector db. At least
           | that's how I use it.
           | 
           | I haven't compared it to lancedb, I reached for it here
           | because the author mentioned Faiss being difficult to use and
           | install. usearch is a great alternative to Faiss.
           | 
           | But thanks for the suggestion, I'll check it out
        
           | ashvardanian wrote:
           | USearch author here :)
           | 
           | The engine supports arbitrary predicates for C, C++, and Rust
           | users. In higher level languages it's hard to combine
           | callbacks and concurrent state management.
           | 
           | In terms of scalability and efficiency, the only tool I've
           | seen coming close is Nvidia's cuVS if you have GPUs
           | available. FAISS HNSW implementation can easily be 10x slower
           | and most commercial & venture-backed alternatives are even
           | slower: https://www.unum.cloud/blog/2023-11-07-scaling-
           | vector-search...
           | 
           | In this use-case, I believe SimSIMD raw kernels may be a
           | better choice. Just replace NumPy and enjoy speedups. It
           | provides hundreds of hand-written SIMD kernels for all kinds
           | of vector-vector operations for AVX, AVX-512, NEON, and SVE
           | across F64, F32, BF16, F16, I8, and binary vectors, mostly
           | operating in mixed precision to avoid overflow and
           | instability: https://github.com/ashvardanian/SimSIMD
        
       | kipukun wrote:
       | To the second footnote: you could utilize Polar's lazyframe API
       | to do that cosine similarity in a streaming fashion for large
       | files.
        
         | minimaxir wrote:
         | That would get around memory limitations but I still think that
         | would be slow.
        
           | kipukun wrote:
           | You'd be surprised. As long as your query is using Polars
           | natives and not a UDF (which drops it down to Python), you
           | may get good results.
        
             | jononor wrote:
             | A (simple) benchmark would be great to figure out where the
             | practical limits of such an approach are. Runtime is
             | expected to grow with O(n*2) which will get painful at some
             | point.
        
       | jtrueb wrote:
       | Polars + Parquet is awesome for portability and performance. This
       | post focused on python portability, but Polars has an easy-to-use
       | Rust API for embedding the engine all over the place.
        
         | blooalien wrote:
         | Gotta love stuff that has multiple language bindings. Always
         | really enjoyed finding powerful libraries in Python and then
         | seeing they also have matching bindings for Go and Rust. Nice
         | to have easy portability and cross-language compatibility.
        
       | robschmidt90 wrote:
       | Nice read. I agree that for a lot of hobby use cases you can just
       | load the embeddings from parquet and compute the similarities in-
       | memory.
       | 
       | To find similarity between my blogposts [1] I wanted to
       | experiment with a local vector database and found ChromaDB fairly
       | easy to use (similar to SQLite just a file on your machine).
       | 
       | [1] https://staticnotes.org/posts/how-recommendations-work/
        
       | jononor wrote:
       | At 33k items in memory is quite fast, 10 ms is very responsive.
       | With 10x/330k items given same hardware the expected time is 1
       | second. That might be too slow for some applications (but not
       | all). Especially if one just does retrieval of a rather small
       | amount of matches, an index will help a lot for 100k++ datasets.
        
       | noahbp wrote:
       | Wow! How much did this cost you in GPU credits? And did you
       | consider using your MacBook?
        
         | minimaxir wrote:
         | It took 1:17 to encode all ~32k cards using a preemptible L4
         | GPU on Google Cloud Platform (g2-standard-4) at ~$0.28/hour,
         | costing < $0.01 overall: https://github.com/minimaxir/mtg-
         | embeddings/blob/main/mtg_em...
         | 
         | The base ModernBERT uses CUDA tricks not available in MPS, so I
         | suspect it would take much longer.
         | 
         | For the 2D UMAP, it took 3:33 because I wanted to do 1 million
         | epochs to be _thorough_ : https://github.com/minimaxir/mtg-
         | embeddings/blob/main/mtg_em...
        
       | rcarmo wrote:
       | I'm a huge fan of polars, but I hadn't considered using it to
       | store embeddings in this way (I've been fiddling with sqlite-
       | vec). Seems like an interesting idea indeed.
        
       | thomasfromcdnjs wrote:
       | Lots of great findings
       | 
       | ---
       | 
       | I'm curious if anyone knows whether it is better to pass
       | structured data or unstructured data to embedding api's? If I ask
       | ChatGPT, it says it is better to send unstructured data. (looking
       | at the authors github, it looks like he generated embeddings from
       | json strings)
       | 
       | My use case is for jsonresume, I am creating embeddings by
       | sending full json versions as strings, but I've been
       | experimenting with using models to translate resume.json's into
       | full text versions first before creating embeddings. The results
       | seem to be better but I haven't seen any concrete opinions on
       | this.
       | 
       | My understanding is that unstructured data is better because it
       | contains textual/semantic meaning because of natural lanaguage
       | aka                 skills: ['Javascript', 'Python']
       | 
       | is worse than;                 Thomas excels at Javascript and
       | Python
       | 
       | Another question: What if the search was also a json embedding?
       | JSON <> JSON embeddings could also be great?
        
         | minimaxir wrote:
         | In general I like to send structured data (see the input format
         | here: https://github.com/minimaxir/mtg-embeddings), but the
         | ModernBERT base for the embedding model used here
         | _specifically_ has better benefits implicitly for structured
         | data compared to previous models. That 's worth another blog
         | post explaining why.
        
           | notpublic wrote:
           | please do explain why
        
             | minimaxir wrote:
             | tl;dr the base ModernBERT was trained with code in mind
             | unlike most encoder-only models (therefore assuming it was
             | also trained on JSON/YAML objects) and also includes a
             | custom tokenizer to support that, which is why I mention
             | that indentation is important since different levels of
             | indentation have different single tokens.
             | 
             | This is mostly theoetical and does require a deeper dive to
             | confirm.
        
         | vunderba wrote:
         | I'd say the more important consideration is "consistency"
         | between incoming query input and stored vectors.
         | 
         | I have a huge vector database that gets updated/regenerated
         | from a personal knowledge store (markdown library). Since the
         | user is most likely to input a comparison query in the form of
         | a question "Where does X factor into the Y system?" - I use a
         | small 7b parameter LLM to pregenerate a list of a dozen
         | possible theoretical questions a user might pose to a given
         | embedding chunk. These are saved as 1536 dimension sized
         | embeddings into the vector database (Qdrant) and linked to the
         | chunks.
         | 
         | The real question you need to ask is - what's the input query
         | that you'll be comparing to the embeddings? If it's incoming as
         | structured, then store structured, etc.
         | 
         | I've also seen (anecdotally) similarity degradation for smaller
         | chunks as well - so keep that in mind as well.
        
       | octernion wrote:
       | or you could just use postgres + pgvector? which many apps
       | already have installed by default.
        
         | jononor wrote:
         | Many ways to skin a cat. At least of this size (33k items). And
         | at the size given, string up a database would have no
         | advantages. Which I believe is the main point of the post! If
         | you have a simple problem, use a simple solution.
         | 
         | If one had instead 1M items, the situation would be completely
         | different.
        
       | dwagnerkc wrote:
       | If you want to try it out. Can lazily load from HF and apply
       | filtering this way.                 df = (
       | pl.scan_parquet('hf://datasets/minimaxir/mtg-
       | embeddings/mtg_embeddings.parquet')         .filter(
       | pl.col("type").str.contains("Sorcery"),
       | pl.col("manaCost").str.contains("B"),         )
       | .collect()
       | 
       | )
       | 
       | Polars is awesome to use, would highly recommend. Single node it
       | is excellent at saturating CPUs, if you need to distribute the
       | work put it in a Ray Actor with some POLARS_MAX_THREADS applied
       | depending on how much it saturates a single node.
        
       | WatchDog wrote:
       | Parquet is fine and all, but I love the simplicity and simple
       | interoperability of CSV.
       | 
       | You can save a huge amount of overhead just by base64 encoding
       | the vectors, they aren't exactly human readable anyway.
       | 
       | I imagine the resulting file would only be approximately 33%
       | larger than the pickle version.
        
       | llm_trw wrote:
       | >The second incorrect method to save a matrix of embeddings to
       | disk is to save it as a Python pickle object [...] But it comes
       | with two major caveats: pickled files are a massive security risk
       | as they can execute arbitrary code, and the pickled file may not
       | be guaranteed to be able to be opened on other machines or Python
       | versions. It's 2025, just stop pickling if you can.
       | 
       | Security: absolutely.
       | 
       | Portability: who cares? Frameworks move so quickly that unless
       | you carry your whole dependency graph between machines you will
       | not get bit compatible results with even minor version changes.
       | It's a dirty secret that no one seems to want to fix or care
       | about.
       | 
       | In short: everything is so fucked that pickle + conda is more
       | than good enough for whatever project you want to serve to
       | >10,000 users.
        
       | PaulHoule wrote:
       | In 2017 I was working on a model trainer for text classification
       | and sequence labeling [1] that had limited success because the
       | models weren't good enough.
       | 
       | I have a minilm + pooling + svm classifier which works pretty
       | well for some things (topics, "will I like this article?") but
       | doesn't work so well for sentiment, emotional tone and other
       | things where the order of the words matter. I'm planning to
       | upgrade my current classifier's front end to use ModernBert and
       | add an LSTM-based back end that I think will equal or beat fine-
       | tuned BERT and, more importantly, can be trained reliably with
       | early stopping. I'd like to open source the thing, focused on
       | reliability, because I'm an application programmer at heart.
       | 
       | I want it to provide an interface which is text-in and labels-out
       | and hide the embeddings from most users but I'm definitely
       | thinking about how to handle them, and there's the worse problem
       | here that the LSTM needs a vector for each token, not each
       | document, so text gets puffed up by a factor of 1000 or so which
       | is not insurmountable (1 MB of training text puffs up to 1 GB of
       | vectors)
       | 
       | Since it's expensive to compute the embeddings _and_ expensive to
       | store them I 'm thinking about whether and how to cache them,
       | considering that I expect to present the same samples to the
       | trainer multiple times and to do a lot of model selection in the
       | process of model development (e.g. what exact shape LSTM to to
       | use) and in the case of end-user training (it will probably try a
       | few models, not least do a shootout between the expensive model
       | and a cheap model)_
       | 
       | [1] think of a "magic magic marker" which learns to mark up text
       | the same way you do; this could mark "needless words" you could
       | delete from a title, parts of speech, named entities, etc.
        
       | intalentive wrote:
       | The problem with Parquet is it's static. Not good for use cases
       | that involve continuous writes and updates. Although I have had
       | good results with DuckDB and Parquet files in object storage.
       | Fast load times.
       | 
       | If you host your own embedding model, then you can transmit numpy
       | float32 compressed arrays as bytes, then decode back into numpy
       | arrays.
       | 
       | Personally I prefer using SQLite with usearch extension. Binary
       | vectors then rerank top 100 with float32. It's about 2 ms for
       | ~20k items, which beats LanceDB in my tests. Maybe Lance wins on
       | bigger collections. But for my use case it works great, as each
       | user has their own dedicated SQLite file.
       | 
       | For portability there's Litestream.
        
         | jt_b wrote:
         | > The problem with Parquet is it's static. Not good for use
         | cases that involve continuous writes and updates. Although I
         | have had good results with DuckDB and Parquet files in object
         | storage. Fast load times.
         | 
         | You can use glob patterns in DuckDB to query remote parquets
         | though to get around this? Maybe break things up using a hive
         | partitioning scheme or similar.
        
           | memhole wrote:
           | I like the pattern described too. Only snag is deletes and
           | updates. Ime, you have to delete the underlying file or
           | create and maintain a view that handles the data you want
           | visible.
        
         | dijksterhuis wrote:
         | > The problem with Parquet is it's static. Not good for use
         | cases that involve continuous writes and updates.
         | 
         | parquet is columnar storage, so it's use case is lots of heavy
         | filtering/aggregation within analytical workloads (OLAP).
         | 
         | consistent writes / updates, i.e. basically transactional
         | (OLTP), use cases are never going to have great performance in
         | columnar storage. its the wrong format to use for that.
         | 
         | for faster writes/updates you'd want row-based, i.e. CSV or an
         | actual database. which i'm glad to see is where you kind of
         | ended up anyway.
        
           | yorwba wrote:
           | There's no reason why an update query that doesn't change the
           | file layout and only twiddles some values in place couldn't
           | be made fast with columnar storage.
           | 
           | When you run a read query, there's one phase that determines
           | the offsets where values are stored and another that reads
           | the value at a given offset. For an update query that doesn't
           | change the offsets, you can change the direction from reading
           | the value at an offset to writing a new value to that
           | location instead, and it should be plenty fast.
           | 
           | Parquet libraries just don't seem to consider that use case
           | worth supporting for some reason and expect people to
           | generate an entire new file with mostly the same content
           | instead. Which definitely doesn't have great performance!
        
             | rbetts wrote:
             | Columnar storage systems rarely store the raw value at
             | fixed position. They store values as run length encoded,
             | dictionary encoded, delta encoded, etc... and then store
             | metadata about chunk of values for pruning at query time.
             | So rarely can you seek to an offset and update a value. The
             | compression achieved means less data to read from disk when
             | doing large scans and lower storage costs for very-large-
             | datasets that are largely immutable - some of the important
             | benefits of columnar storage.
             | 
             | Also, many applications that require updates also update
             | conditionally (update a where b = c). This requires re-
             | synthesizing (at least some of) the row to make a
             | comparison, another relatively expensive operation for a
             | column store.
        
               | lmeyerov wrote:
               | Also typically stored with binary compression (snappy,
               | lib) after the snappy compression. In-memory might only
               | be semantic, eg, arrow.
               | 
               | But it's... Fine? Batch writes and rewrite dirty parts.
               | Most of our cases are either appending events, or
               | enriching with new columns, which can be modeled
               | columnarly. It is a bit more painful in GPU land bc we
               | like big chunks (250MB-1GB) for saturating reads, but CPU
               | land is generally fine for us.
               | 
               | We have been eyeing iceberg and friends as a way to
               | automate that, so I've been curious how much of the
               | optimization, if any, they take for us
        
         | csunbird wrote:
         | Parquet files being immutable is not a bug, it is a feature.
         | That is how you accomplish good compression and keep the
         | columnar data organized.
         | 
         | Yes, it is not useful for continuous writes and updates, but it
         | is not what it is designed for. Use a database (e.g. SQLite
         | just like you suggested) if you want to ingest real
         | time/streaming data.
        
         | pantsforbirds wrote:
         | I've had great luck using either Athena or DuckDB with parquet
         | files in s3 using a few partitions. You can query across the
         | partitions pretty efficiently and if date/time is one of your
         | partitions, then it's very efficient to add new data.
        
       | th24o3j4324234 wrote:
       | The trouble with Parquet (and columnar storage) in ML is,
       | 
       | 1. You don't really care too-much about accessing subsets of
       | columns
       | 
       | 2. You can't easily append stuff to closed Parquet files.
       | 
       | 3. Batched-row access is presumably slower due to lower cache-
       | hits.
       | 
       | It's okay for map-reduce style stuff where this doesn't matter,
       | but in ML these limitations are an annoyance.
       | 
       | HDF5 (or Zarr, less portably) solves some/many of these issues
       | but it's not quite a settled affair.
        
         | jononor wrote:
         | Re 2. Parquet can easily be used with chunked/partitioned
         | files. Then appending is just adding another file/chunk.
         | 
         | The case of 1. really depends on the workload. For embeddings
         | etc selecting column subsets is rare. In order cases, where one
         | has a a bunch of separate features, doing column subsetting
         | might be rather common. But yes, it is far from every case.
        
         | ismailmaj wrote:
         | Parquet is only a mess if you try to mutate it, usually you
         | consider them as immutable and have the data stored across many
         | files.
         | 
         | Also batched-row access is negligible given the compression
         | benefits you get with the columnar format, which is probably
         | why it's still king in ML; I think given what I'm seeing in the
         | industry and recent trends (e.g. Velox).
        
       | mhh__ wrote:
       | I still don't like dataframes but oh my God Polars is so much
       | better than pandas.
       | 
       | I was doing some time series calculations, simple equity price
       | adjustments basically, in Polars and my two thoughts were:
       | 
       | - WTF, I can actually read the code and test it.
       | 
       | - it's running so fast it seems like it's broken.
        
         | eskaytwo wrote:
         | There's some nice plugins too, some are finance related:
         | https://github.com/ddotta/awesome-polars
        
           | mhh__ wrote:
           | The one thing I really want is for someone to make it so I
           | can use it in F#. Presumably it's possible given how the
           | python bit is implemented under the hood?
        
             | whyever wrote:
             | It uses pyo3 to generate the bindings, so you would have to
             | find a similar crate for F#/.NET and port the polars Python
             | FFI to it. If such a crate does not exist, it will be even
             | more work.
        
         | LaurensBER wrote:
         | Yeah, the readability difference is immense. I worked for years
         | with Pandas and I still cannot "scan" it as quickly as with a
         | "normal" programming language or SQL. Then there's the whole
         | issue with (multi)-indexes, serialisation, etc.
         | 
         | Polars makes programming fun again instead of a chore.
        
       | k2so wrote:
       | A neat trick in Vespa (vectors DB among other things)
       | documentation is to use hex representation of vectors after
       | converting them to binary.
       | 
       | This trick can be used to reduce your payload sizes. In Vespa,
       | they support this format which is particularly useful when the
       | same vectors are referenced multiple times in a document. For
       | ColBERT or ColPaLi like cases (where you have many embedding
       | vectors), this can reduce the size of the vectors stored on disk
       | massively.
       | 
       | https://docs.vespa.ai/en/reference/document-json-format.html...
       | 
       | Not sure why this is not more commonly adopted though
        
       ___________________________________________________________________
       (page generated 2025-02-25 23:02 UTC)