[HN Gopher] Show HN: Llama2 Embeddings FastAPI Server
       ___________________________________________________________________
        
       Show HN: Llama2 Embeddings FastAPI Server
        
       Author here. I just wanted a quick and easy way to easily submit
       strings to a REST API and get back the embedding vectors in JSON
       using Llama2 and other similar LLMs, so I put this together over
       the past couple days. It's very quick and easy to set up and
       totally self-contained and self-hosted. You can easily add new
       models to it by simply adding the HuggingFace URL to the GGML
       format model weights. Two models are included by default, and these
       are automatically downloaded the first time it's run.  It lets you
       not only submit text strings and get back the embeddings, but also
       to compare two strings and get back their similarity score (i.e.,
       the cosine similarity of their embedding vectors). You can also
       upload a plaintext file or PDF and get back all the embeddings for
       every sentence in the file as a zipped JSON file (and you can
       specify the layout of this JSON file).  Each time an embedding is
       computed for a given string with a given LLM, that vector is stored
       in the SQlite database and can be returned immediately. You can
       also search across all stored vectors easily using a query string;
       this uses FAISS which is integrated.  There are lots of nice
       performance enhancements, including parallel inference, db write
       queue, fully async everything, and even a RAM Disk feature to speed
       up model loading.  I'm working now on adding additional API
       endpoints for easily generating sentiment scores using presets for
       different focus areas, but that's still work-in-progress (the code
       for this so far is in the repo though).
        
       Author : eigenvalue
       Score  : 145 points
       Date   : 2023-08-15 12:31 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | clbrmbr wrote:
       | I'd love to see some examples of how the token-in-context
       | embeddings stack up against sentence-level. What new use-cases
       | are unlocked?
       | 
       | Perhaps semantic search with word-highlighting?
       | 
       | Any advantage to using the full context window to maximize
       | context around the embedded token?
        
         | eigenvalue wrote:
         | Great idea, I'll see if I can make a new endpoint that adds
         | word level highlighting annotations that could be parsed and
         | used to control the brightness or color of each word based on
         | semantic relevance to a query term.
         | 
         | To be honest, I hadn't even thought about token-level
         | embeddings until someone on Reddit asked about it and I
         | realized it was possible to do with llama-cpp, so I just
         | quickly added the functionality without closely examining the
         | best use cases.
         | 
         | It's a LOT more data and compute than using the normal
         | sentence-level embeddings, so it would really have to unlock
         | some useful new functionality to be worth it. But I do think
         | the "combined feature vector" concept that at least makes them
         | fixed length is helpful.
        
       | Palmik wrote:
       | It seems unlikely that raw llama2 will perform better than
       | purpose made encoder models like bge [1], gte [2], e5 [3] or
       | instructor, despite it's much much larger size (for the tasks
       | people usually need embeddings for).
       | 
       | You can probably get it to behave well with a fine tuning like
       | this: https://arxiv.org/pdf/2202.08904.pdf
       | 
       | [1] https://huggingface.co/BAAI/bge-large-en
       | 
       | [2] https://huggingface.co/thenlper/gte-large
       | 
       | [3] https://huggingface.co/intfloat/e5-large-v2
        
         | yberreby wrote:
         | I did not know about BGE, learned about it from your comment.
         | Seems to be the new SoTA for semantic embeddings, even for
         | small models. Very cool!
        
         | thejosh wrote:
         | It's exciting/terrifying to see how fast all this moves, it
         | feels like every day there are new discoveries, techniques and
         | models.
         | 
         | When I talk with people about ChatGPT-esque things (ChatGPT is
         | pretty much what most people know now), I say that it's crazy
         | that you can run this on your own consumer-level hardware if
         | you just want to mess with it. You don't need
         | "prosumer"/enthusiast hardware (unless you want to train
         | models, but then I see people are using Google Colab etc).
         | 
         | It's a crazy world we live in.
        
         | jauntbox wrote:
         | Yeah, I'd be surprised if embeddings derived from decoder-only
         | models are competitive in common embedding tasks without some
         | extra training work. There's a good benchmark page on
         | huggingface for the MTEB tasks (Massive Text Embedding
         | Benchmark) that's kept up to date here:
         | https://huggingface.co/spaces/mteb/leaderboard
        
         | jayalammar wrote:
         | This is my sense as well. Text generation LLMs haven't been the
         | best source of embeddings for other downstream use cases. If
         | you're optimizing for token embeddings (e.g., for NER, span
         | detection, or token classification tasks), then a token
         | training objective is important. If you need text-level
         | embeddings (e.g., for semantic search or text classification),
         | then that training objective is required (e.g., what Sentence
         | BERT did to optimize BERT embeddings for semantic search).
         | 
         | That's a great list of existing embeddings models (in addition
         | the SentenceBERT models
         | https://www.sbert.net/docs/pretrained_models.html).
        
           | readyplayeremma wrote:
           | The SGPT model is a very high performing text embeddings
           | model adapted from a decoder. Using the same techniques with
           | Llama-2 might perform better than you expect. I think someone
           | will need to try these things before we know for certain. I
           | believe there is still room for significant improvement with
           | embedding models.
        
         | eigenvalue wrote:
         | Thanks for pointing out those models. I see from a quick
         | Huggingface search that the bge model is available in GGML
         | format. You can trivially add new GGML format models to the
         | code by simply adding the direct download link to this line:
         | 
         | https://github.com/Dicklesworthstone/llama_embeddings_fastap...
         | 
         | So to add the base bge model, you could just add this URL to
         | the list:
         | 
         | https://huggingface.co/maikaarda/bge-base-en-ggml/resolve/ma...
         | 
         | I will add that as an additional default.
        
           | jerrygenser wrote:
           | I think it's still overkill though for semantic embedding,
           | SBERT is on order of ~250mm parameters while smallest llama
           | at 7b parameters.
        
             | eigenvalue wrote:
             | If all you want to do is make some basic semantic search,
             | that's probably true. But I strongly suspect we are only
             | just now starting to scratch the surface of what's possible
             | with embeddings that come from much more powerful LLMs like
             | Llama2 that can clearly manifest much greater demonstrated
             | "understanding" of sentences they are shown (whatever that
             | means, but intuitively, it seems obvious to me). That's
             | partly why I made this tool--- to aid in my investigations
             | of LLM embeddings in a convenient and performant way.
        
       | [deleted]
        
       | lee101 wrote:
       | [dead]
        
       | villgax wrote:
       | I'd rather stick with InstructEmbedding instead of pandering to
       | flavour of the month LLM. That way I keep my key components
       | insulated from drastic changes
        
         | villgax wrote:
         | LLM inputs are the worst candidates for caching. The only place
         | where caching might make sense if you have a public facing
         | service & have coupled it with a vector cache instead of a
         | typical word for word caching of the prompt
        
           | eigenvalue wrote:
           | It's useful if you might submit the same document or edited
           | versions of the same document with a lot of overlap.
        
       | lhr0909 wrote:
       | Nice work and starred! I am curious how you get the embeddings
       | computed, the # of dimensions for the embeddings and if you have
       | run any benchmarks against OpenAI's offering?
       | 
       | Cheers!
        
         | eigenvalue wrote:
         | The embeddings are computed using llama-cpp, but langchain
         | makes a nice convenience wrapper to directly get them, so I use
         | that. The embeddings are 4096 dimensional vectors.
         | 
         | And no, I haven't benchmarked them against OpenAI's embeddings.
         | I should point out that this code will work for any model in
         | GGML format, so if there are fine-tuned Llama2 versions that
         | are optimized for embedding, you could use those instead very
         | easily (or any other model). This project is more about making
         | it easy to go from model to embeddings on demand via an API and
         | then letting you do useful things with those embeddings easily.
        
       | lgvld wrote:
       | Hi,
       | 
       | Looks quite clean, congrats.
       | 
       | Two questions: 1. Starting from this, what would be the proper
       | way to create embeddings for a complete document (i.e. a long
       | paragraph)? My goal is to directly compare two PDFs according to
       | their contents. It seems that
       | `compute_similarity_between_strings` could be used, but then why
       | is `get_all_embedding_vectors_for_document` useful for? 2. Using
       | your API, does the inference run directly on the VPS? Does it
       | need special kind of hardware (GPU, TPU or whatever)?
       | 
       | Sorry if my questions are dumb, but I really appreciate your
       | project simplicity, and I want to know if it could suit my needs.
       | 
       | Thanks for sharing this piece of work.
       | 
       | ;-)
        
         | eigenvalue wrote:
         | Sure, the difference is that the first endpoint would give you
         | back only one single embedding vector for the entire paragraph,
         | while the second endpoint would give you a separate embedding
         | vector for each sentence in the paragraph.
         | 
         | And yes, everything in this code is designed to run on the CPU
         | well on a modest machine and is 100% self-hosted, no API keys
         | needed at all. But if you do have a GPU installed and
         | configured it will automatically use that since it's powered by
         | llama-cpp which now supports CUDA.
        
       | kordlessagain wrote:
       | Here's an example of Instructor Embeddings w/FeatureBase:
       | https://gist.github.com/kordless/aae99946e7e2a5afccc83f3c4ee...
       | 
       | Instructor Embedding rank high on various leaderboards for
       | embeddings and can be run locally, irregardless of how they are
       | stored. It takes about half a second to embed 20 strings. 2.2
       | seconds to embed 80 strings. I haven't tested this with different
       | batch sizes or GPU acceleration (don't know if that's possible).
       | It is possible to quantize the vectors to 8 bit floats.
       | 
       | I'm using FeatureBase to store these because a) I work there and
       | b) it will store and search the vectors by euclidian_distance and
       | cosine_distance. Right now this is a cloud-only feature, but
       | we'll work on getting it into the community release at some
       | point.
       | 
       | Combined with our current support of sets, set intersection
       | operations like tanimoto() and filtering and aggregation (all
       | done using roaring bitmaps for in memory operations) this
       | presents an interesting offering for storing training data and
       | reporting. Being able to filter the vectors compared in distance
       | makes nearest neighbor search algos almost unnecessary, except
       | for extreme use cases. In that case, it might be better to
       | consider switching to using knowledge graphs (to filter the
       | vector space) instead of storing 10s of millions of dense vectors
       | and doing approximate search on them.
        
         | eigenvalue wrote:
         | Cool. I was wondering what Tanimoto meant, since I've tried to
         | make myself familiar with all the useful similarity measures,
         | and apparently it's just another name for Jaccard Index.
         | 
         | I do think there is a lot of potential in exploring more
         | sensitive measures of similarity or statistical dependence. It
         | seems like the ML community has basically decided that all the
         | heavy lifting should be done at the model embedding level, and
         | then you can just use cosine similarity for speed and the
         | answers just "fall out". Which is definitely nice because then
         | you can search across millions of records per second.
         | 
         | But there are some lesser-known measures of
         | similarity/dependence that can pick up on more subtle
         | relationships-- the big drawback is that they are slow. I
         | included a couple of these exotic ones in my project,
         | Hoeffding's D and HSIC, mostly out of curiosity.
        
           | kordlessagain wrote:
           | > apparently it's just another name for Jaccard Index
           | 
           | Yes, it produces the same values for set compare:
           | https://www.featurebase.com/blog/tanimoto-similarity-in-
           | feat...
        
       | [deleted]
        
       | rvrs wrote:
       | What's wrong with just using Torchserve[1]? We've been using it
       | to serve embedding models in production.
       | 
       | [1] https://pytorch.org/serve/
        
         | eigenvalue wrote:
         | I wanted something that works natively with llama-cpp and
         | langchain that is also small and easy to hack on. And I also
         | wanted everything to be seamlessly cached in SQlite. And also
         | something that had built in semantic search with FAISS. And
         | string similarity using multiple different similarity measures
         | beyond just cosine similarity. And token-level embedding
         | support. There is much more flexibility when you do it yourself
         | to add whatever functionality you want quickly.
        
       | daturkel wrote:
       | This looks really cool. One thing I've wondered about with, e.g.,
       | the OpenAI API is if json is really a good format for passing
       | embeddings back and forth. I'd think that passing floats as text
       | over the wire wastes a ton of space that could add up, and might
       | even sacrifice some precision in. Would it be better to encode at
       | least the vectors as binary blobs, or else use something like
       | protobuf to more efficiently handle sending tons of floats
       | around?
        
         | eigenvalue wrote:
         | I totally agree when you're talking about a bunch of embeddings
         | at once-- that's why the document level endpoint (and the
         | token-level embedding endpoint) can optionally return a link to
         | a zip file containing the JSON. For a single embedding, not
         | sure it matters that much, and the extra convenience is nice.
         | 
         | Edit: One other thing is that you can store the JSON in SQLite
         | using the JSON data type and then use the nice querying
         | constructs directly at the database level, which is nice for
         | the token-level embeddings and document embeddings. This is
         | built in to my project.
        
         | zh217 wrote:
         | OpenAI's embedding API has an undocumented flag
         | 'encoding_format': 'base64' which will give you base64-encoded
         | raw bytes of little-endian float32. As it is used by the
         | official python client, it is unlikely to go away.
        
       ___________________________________________________________________
       (page generated 2023-08-15 23:00 UTC)