hngopher.com

       [HN Gopher] The best way to use text embeddings portably is with...
       ___________________________________________________________________
        
       The best way to use text embeddings portably is with Parquet and
       Polars
        
       Author : minimaxir
       Score  : 84 points
       Date   : 2025-02-24 18:27 UTC (4 hours ago)
        
 (HTM) web link (minimaxir.com)
 (TXT) w3m dump (minimaxir.com)
        
       | whinvik wrote:
       | Since we are talking about an embedded solution shouldn't the
       | benchmark be something like sqlite with a vector extension or
       | lancedb?
        
         | minimaxir wrote:
         | I mention sqlite + sqlite-vec at the end, noting it requires
         | technical overhead and it's not as easy as read_parquet() and
         | write_parquet().
         | 
         | I just became aware of lancedb and am looking into that,
         | although from glancing at the README it has similar issues to
         | faiss with regards to usability for casual use, although much
         | better than faiss in that it can work with colocated metadata.
        
         | 0cf8612b2e1e wrote:
         | My natural point of comparison without actually be DuckDB plus
         | their vector search extension.
        
       | thelastbender12 wrote:
       | This is pretty neat.
       | 
       | IMO a hindrance to this was lack of built-in fixed-size list
       | array support in the Arrow format, until recently. Some
       | implementations/clients supported it, while others didn't. Else,
       | it could have been used as the default storage format for numpy
       | arrays, torch tensors, too.
       | 
       | (You could always store arrays as variable length list arrays
       | with fixed strides and handle the conversion).
        
       | banku_brougham wrote:
       | Is your example of a float32 number correct, holding 24 ascii
       | char representation? I had thought single-precision gonna be 7
       | digits and the exponent, sign and exp sign. Something like
       | 7+2+1+1 or 10 char ascii representation? Rather than the 24 you
       | mentioned?
        
         | minimaxir wrote:
         | It depends on the default print format. The example string I
         | mentioned is pulled from what np.savetxt() does (fmt='%.18e')
         | and there isn't any precision loss in that number. But I admit
         | I'm not a sprintf() guru.
         | 
         | In practice numbers with that much precision is overkill and
         | verbose so tools don't print float32s to that level of
         | precision.
        
         | PaulHoule wrote:
         | One of the things I remember from my PhD work is that you can
         | do a stupendous number of FLOPs on floating point numbers in
         | the time it takes to serialize/deserialize them to ASCII.
        
       | banku_brougham wrote:
       | Really cool article, I've enjoyed your work for a long time. You
       | might add a note for those jumping into a sqlite implementation,
       | that duckdb reads parquet and launched a few vector similarity
       | functions which cover this use-case perfectly:
       | 
       | https://duckdb.org/2024/05/03/vector-similarity-search-vss.h...
        
         | jt_b wrote:
         | I have tinkered with using DuckDB as a poor man's vector
         | database for a POC and had great results.
         | 
         | One thing I'd love to see is being able to do some sort of row
         | group level metadata statistics for embeddings within a parquet
         | file - something that would allow various readers to push
         | predicates down to an HTTP request metadata level and
         | completely avoid loading in non-relevant rows to the database
         | from a remote file - particularly one stored on S3 compatible
         | storage that supports byte-range requests. I'm not sure what
         | the implementation would look like to define sorting the
         | algorithm to organize the "close" rows together, how the
         | metadata would be calculated, or what the reader implementation
         | would look like, but I'd love to be able to implement some of
         | the same patterns with vector search as with geoparquet.
        
       | kernelsanderz wrote:
       | For another library that has great performance and features like
       | full text indexing and the ability to version changes I'd
       | recommend lancedb https://lancedb.github.io/lancedb/
       | 
       | Yes, it's a vector database and has more complexity. But you can
       | use it without creating indexes and it has excellent polars and
       | pandas zero copy arrow support also.
        
         | esafak wrote:
         | Lance is made for this stuff; parquet is not.
        
         | daveguy wrote:
         | Since a lot of ML data is stored as parquet, I found this to be
         | a useful tidbit from lancedb's documentation:
         | 
         | > Data storage is columnar and is interoperable with other
         | columnar formats (such as Parquet) via Arrow
         | 
         | https://lancedb.github.io/lancedb/concepts/data_management/
         | 
         | Edit: That said, I am personally a fan of parquet, arrow, and
         | ibis. So many data wrangling options out there it's easy to get
         | analysis paralysis.
        
       | stephantul wrote:
       | Check out Unum's usearch. It beats anything, and is super easy to
       | use. It just does exactly what you need.
       | 
       | https://github.com/unum-cloud/usearch
        
         | esafak wrote:
         | Have you tested it against Lance? Does it do predicate pushdown
         | for filtering?
        
           | stephantul wrote:
           | Usearch is a vector store afaik, not a vector db. At least
           | that's how I use it.
           | 
           | I haven't compared it to lancedb, I reached for it here
           | because the author mentioned Faiss being difficult to use and
           | install. usearch is a great alternative to Faiss.
           | 
           | But thanks for the suggestion, I'll check it out
        
           | ashvardanian wrote:
           | USearch author here :)
           | 
           | The engine supports arbitrary predicates for C, C++, and Rust
           | users. In higher level languages it's hard to combine
           | callbacks and concurrent state management.
           | 
           | In terms of scalability and efficiency, the only tool I've
           | seen coming close is Nvidia's cuVS if you have GPUs
           | available. FAISS HNSW implementation can easily be 10x slower
           | and most commercial & venture-backed alternatives are even
           | slower: https://www.unum.cloud/blog/2023-11-07-scaling-
           | vector-search...
           | 
           | In this use-case, I believe SimSIMD raw kernels may be a
           | better choice. Just replace NumPy and enjoy speedups. It
           | provides hundreds of hand-written SIMD kernels for all kinds
           | of vector-vector operations for AVX, AVX-512, NEON, and SVE
           | across F64, F32, BF16, F16, I8, and binary vectors, mostly
           | operating in mixed precision to avoid overflow and
           | instability: https://github.com/ashvardanian/SimSIMD
        
       | kipukun wrote:
       | To the second footnote: you could utilize Polar's lazyframe API
       | to do that cosine similarity in a streaming fashion for large
       | files.
        
         | minimaxir wrote:
         | That would get around memory limitations but I still think that
         | would be slow.
        
           | kipukun wrote:
           | You'd be surprised. As long as your query is using Polars
           | natives and not a UDF (which drops it down to Python), you
           | may get good results.
        
       | jtrueb wrote:
       | Polars + Parquet is awesome for portability and performance. This
       | post focused on python portability, but Polars has an easy-to-use
       | Rust API for embedding the engine all over the place.
        
       | robschmidt90 wrote:
       | Nice read. I agree that for a lot of hobby use cases you can just
       | load the embeddings from parquet and compute the similarities in-
       | memory.
       | 
       | To find similarity between my blogposts [1] I wanted to
       | experiment with a local vector database and found ChromaDB fairly
       | easy to use (similar to SQLite just a file on your machine).
       | 
       | [1] https://staticnotes.org/posts/how-recommendations-work/
        
       | jononor wrote:
       | At 33k items in memory is quite fast, 10 ms is very responsive.
       | With 10x/330k items given same hardware the expected time is 1
       | second. That might be too slow for some applications (but not
       | all). Especially if one just does retrieval of a rather small
       | amount of matches, an index will help a lot for 100k++ datasets.
        
       | noahbp wrote:
       | Wow! How much did this cost you in GPU credits? And did you
       | consider using your MacBook?
        
         | minimaxir wrote:
         | It took 1:17 to encode all ~32k cards using a preemptible L4
         | GPU on Google Cloud Platform (g2-standard-4) at ~$0.28/hour,
         | costing < $0.01 overall: https://github.com/minimaxir/mtg-
         | embeddings/blob/main/mtg_em...
         | 
         | The base ModernBERT uses CUDA tricks not available in MPS, so I
         | suspect it would take much longer.
         | 
         | For the 2D UMAP, it took 3:33 because I wanted to do 1 million
         | epochs to be _thorough_ : https://github.com/minimaxir/mtg-
         | embeddings/blob/main/mtg_em...
        
       | rcarmo wrote:
       | I'm a huge fan of polars, but I hadn't considered using it to
       | store embeddings in this way (I've been fiddling with sqlite-
       | vec). Seems like an interesting idea indeed.
        
       | thomasfromcdnjs wrote:
       | Lots of great findings
       | 
       | ---
       | 
       | I'm curious if anyone knows whether it is better to pass
       | structured data or unstructured data to embedding api's? If I ask
       | ChatGPT, it says it is better to send unstructured data. (looking
       | at the authors github, it looks like he generated embeddings from
       | json strings)
       | 
       | My use case is for jsonresume, I am creating embeddings by
       | sending full json versions as strings, but I've been
       | experimenting with using models to translate resume.json's into
       | full text versions first before creating embeddings. The results
       | seem to be better but I haven't seen any concrete opinions on
       | this.
       | 
       | My understanding is that unstructured data is better because it
       | contains textual/semantic meaning because of natural lanaguage
       | aka                 skills: ['Javascript', 'Python']
       | 
       | is worse than;                 Thomas excels at Javascript and
       | Python
       | 
       | Another question: What if the search was also a json embedding?
       | JSON <> JSON embeddings could also be great?
        
         | minimaxir wrote:
         | In general I like to send structured data (see the input format
         | here: https://github.com/minimaxir/mtg-embeddings), but the
         | ModernBERT base for the embedding model used here
         | _specifically_ has better benefits implicitly for structured
         | data compared to previous models. That 's worth another blog
         | post explaining why.
        
       ___________________________________________________________________
       (page generated 2025-02-24 23:00 UTC)