[HN Gopher] You probably shouldn't use OpenAI's embeddings
       ___________________________________________________________________
        
       You probably shouldn't use OpenAI's embeddings
        
       Author : diego
       Score  : 33 points
       Date   : 2023-03-30 19:49 UTC (3 hours ago)
        
 (HTM) web link (iamnotarobot.substack.com)
 (TXT) w3m dump (iamnotarobot.substack.com)
        
       | nico wrote:
       | Is someone doing embeddings<>embeddings mapping?
       | 
       | For example, mapping embeddings of Llama to GPT-3?
       | 
       | That way you can see how similar the models "understand the
       | world".
        
         | fzliu wrote:
         | I'm curious about this as well. There are potentially many
         | different versions of embedding models used in production and
         | correlating different versions together could be very
         | important.
        
         | sroussey wrote:
         | I'd be interested to see this as well. I guess you can make a
         | test and see what happens. Ping me if you do!
        
       | breckenedge wrote:
       | It's fine to use their embeddings for a proof of concept, but
       | since you don't own it, you probably shouldn't rely on it because
       | it could go away at any time.
        
       | mustacheemperor wrote:
       | Could anyone point me towards a relatively beginner-friendly
       | guide to do something like
       | 
       | >download all my tweets (about 20k) and build a semantic searcher
       | on top ?
       | 
       | How can utilize 3rd party embeddings with OpenAI's LLM API? Am I
       | correct to understand from this article that this is possible?
        
         | sroussey wrote:
         | I have a system that download all my data from Google,
         | Facebook, Twitter, and others. Geo data is fun to look at, but
         | now the text and images have some more meaning to gleam. I'm
         | thinking about going back to it. Not sure how to package a
         | bunch of python stuff in an app though.
        
         | diego wrote:
         | That's exactly what I did here.
         | https://github.com/dbasch/semantic-search-tweets
        
           | mustacheemperor wrote:
           | Thank you! Comparing this and the link the other commenter
           | posted, what handles the actual search querying? Does
           | instructor-xl include an LLM in addition to the embeddings?
           | The other commenter's repo uses Pinecone for the embeddings
           | and OpenAI for the LLM.
           | 
           | My apologies if I am completely mangling the vocabulary here
           | - I have an, at best, rudimentary understanding of this stuff
           | that I am trying to hack my education on.
           | 
           | Edit: If you're at the SF meetup tomorrow, I'd happily buy
           | you a beverage in return for this explanation :)
        
             | eternalban wrote:
             | It's in the repo:
             | 
             | You first create embeddings. What is this? It's an
             | n-dimensional vector space with your tweets 'embedded' in
             | that space. Each word is an n-dimensional vector in this
             | space. The vectorization is supposed to maintain 'semantic
             | distance'. Basically, if two words are very close in
             | meaning or related (by say frequently appearing next to
             | each other in corpus) they should be 'close' in some of
             | those n-dimensions as well. The result at the end is the
             | '.bin' file, the 'semantic model' of your corpus.
             | 
             | https://github.com/dbasch/semantic-search-
             | tweets/blob/main/e...
             | 
             | For semantic search, you run the same embedding algorithm
             | against the query and take the resultant vectors and do
             | similarity search via matrix ops, resulting in a set of
             | results, with probabilities. These point back to the
             | original source, here the tweets, and you just print the
             | tweet(s) that you select from that result set (here the top
             | 10).
             | 
             | https://github.com/dbasch/semantic-search-
             | tweets/blob/main/s...
             | 
             | Experts can chime in here but there are knobs such as
             | 'batch size' and the functions you use to index. (cosine
             | was used here.)
             | 
             | So the various performance dimensions of the process should
             | also be clear. There is a fixed cost of making the
             | embeddings of your data. There is a per-op embedding of
             | your query, and then running the similarity algorithm to
             | find the result set.
        
               | mustacheemperor wrote:
               | Thank you for this walkthrough, and for citing the code
               | alongside!
        
         | celestialcheese wrote:
         | langchain and llama-index are two big opensource projects which
         | are great for buildign this type of thing.
         | 
         | https://github.com/mayooear/gpt4-pdf-chatbot-langchain for
         | example
        
           | mustacheemperor wrote:
           | Cheers, thank you!
        
         | devxpy wrote:
         | https://gooey.ai/doc-search/?example_id=8ls7dpf6
         | 
         | No code needed :)
        
       | nomadiccoder wrote:
       | The heat map of availability time, 98.58 (Jan), 99.07 (Feb), and
       | 99.71 (Mar) trends upwards.
        
       | fzliu wrote:
       | I've done some quick-and-dirty testing with OpenAI's embedding
       | API + Zilliz Cloud. The 1st gen embeddings leave something to be
       | desired (https://medium.com/@nils_reimers/openai-gpt-3-text-
       | embedding...), but the 2nd gen embeddings are actually fairly
       | performant relative to many open source models with MLM loss.
       | 
       | I'll have to dig out the notebook that I created for this, but
       | I'll try to post it here once I find it.
        
       | celestialcheese wrote:
       | Very interested in this - I've been using embeddings / semantic
       | search doing information retrieval from PDFs, using ada-002, and
       | have been impressed by the results in testing.
       | 
       | The reasons the article listed, namely a) lock-in and b) cost,
       | have given me pause with embedding our whole corpus of data. I'd
       | much rather use an open model but don't have much experience in
       | evaluating these embedding models and search performance - still
       | very new to me.
       | 
       | Like what you did with ada-002 vs Instruct XL, has there been any
       | papers or prior work done evaluating the different embedding
       | models?
        
         | VHRanger wrote:
         | You can find some comparisons and evaluation datasets/tasks
         | here: https://www.sbert.net/docs/pretrained_models.html
         | 
         | Generally MiniLM is a good baseline. For faster models you want
         | this library:
         | 
         | https://github.com/oborchers/Fast_Sentence_Embeddings
         | 
         | For higher quality ones, just take the bigger/slower models in
         | the SentenceTransformers library
        
           | sroussey wrote:
           | Is there performance comparisons for Apple Silicon machines?
        
             | VHRanger wrote:
             | Performance in terms of model quality would be the same.
             | 
             | The fast-se library uses C++ code and word embeddings being
             | averaged to generate sentence embeddings, so would be
             | similarly fast, or faster on apple silicon than x86.
             | 
             | For the SentenceTransformer library models I'm not sure,
             | but I think it would run off the CPU for a M1/M2 computer
        
       ___________________________________________________________________
       (page generated 2023-03-30 23:01 UTC)