[HN Gopher] You probably shouldn't use OpenAI's embeddings
___________________________________________________________________
You probably shouldn't use OpenAI's embeddings
Author : diego
Score : 33 points
Date : 2023-03-30 19:49 UTC (3 hours ago)
(HTM) web link (iamnotarobot.substack.com)
(TXT) w3m dump (iamnotarobot.substack.com)
| nico wrote:
| Is someone doing embeddings<>embeddings mapping?
|
| For example, mapping embeddings of Llama to GPT-3?
|
| That way you can see how similar the models "understand the
| world".
| fzliu wrote:
| I'm curious about this as well. There are potentially many
| different versions of embedding models used in production and
| correlating different versions together could be very
| important.
| sroussey wrote:
| I'd be interested to see this as well. I guess you can make a
| test and see what happens. Ping me if you do!
| breckenedge wrote:
| It's fine to use their embeddings for a proof of concept, but
| since you don't own it, you probably shouldn't rely on it because
| it could go away at any time.
| mustacheemperor wrote:
| Could anyone point me towards a relatively beginner-friendly
| guide to do something like
|
| >download all my tweets (about 20k) and build a semantic searcher
| on top ?
|
| How can utilize 3rd party embeddings with OpenAI's LLM API? Am I
| correct to understand from this article that this is possible?
| sroussey wrote:
| I have a system that download all my data from Google,
| Facebook, Twitter, and others. Geo data is fun to look at, but
| now the text and images have some more meaning to gleam. I'm
| thinking about going back to it. Not sure how to package a
| bunch of python stuff in an app though.
| diego wrote:
| That's exactly what I did here.
| https://github.com/dbasch/semantic-search-tweets
| mustacheemperor wrote:
| Thank you! Comparing this and the link the other commenter
| posted, what handles the actual search querying? Does
| instructor-xl include an LLM in addition to the embeddings?
| The other commenter's repo uses Pinecone for the embeddings
| and OpenAI for the LLM.
|
| My apologies if I am completely mangling the vocabulary here
| - I have an, at best, rudimentary understanding of this stuff
| that I am trying to hack my education on.
|
| Edit: If you're at the SF meetup tomorrow, I'd happily buy
| you a beverage in return for this explanation :)
| eternalban wrote:
| It's in the repo:
|
| You first create embeddings. What is this? It's an
| n-dimensional vector space with your tweets 'embedded' in
| that space. Each word is an n-dimensional vector in this
| space. The vectorization is supposed to maintain 'semantic
| distance'. Basically, if two words are very close in
| meaning or related (by say frequently appearing next to
| each other in corpus) they should be 'close' in some of
| those n-dimensions as well. The result at the end is the
| '.bin' file, the 'semantic model' of your corpus.
|
| https://github.com/dbasch/semantic-search-
| tweets/blob/main/e...
|
| For semantic search, you run the same embedding algorithm
| against the query and take the resultant vectors and do
| similarity search via matrix ops, resulting in a set of
| results, with probabilities. These point back to the
| original source, here the tweets, and you just print the
| tweet(s) that you select from that result set (here the top
| 10).
|
| https://github.com/dbasch/semantic-search-
| tweets/blob/main/s...
|
| Experts can chime in here but there are knobs such as
| 'batch size' and the functions you use to index. (cosine
| was used here.)
|
| So the various performance dimensions of the process should
| also be clear. There is a fixed cost of making the
| embeddings of your data. There is a per-op embedding of
| your query, and then running the similarity algorithm to
| find the result set.
| mustacheemperor wrote:
| Thank you for this walkthrough, and for citing the code
| alongside!
| celestialcheese wrote:
| langchain and llama-index are two big opensource projects which
| are great for buildign this type of thing.
|
| https://github.com/mayooear/gpt4-pdf-chatbot-langchain for
| example
| mustacheemperor wrote:
| Cheers, thank you!
| devxpy wrote:
| https://gooey.ai/doc-search/?example_id=8ls7dpf6
|
| No code needed :)
| nomadiccoder wrote:
| The heat map of availability time, 98.58 (Jan), 99.07 (Feb), and
| 99.71 (Mar) trends upwards.
| fzliu wrote:
| I've done some quick-and-dirty testing with OpenAI's embedding
| API + Zilliz Cloud. The 1st gen embeddings leave something to be
| desired (https://medium.com/@nils_reimers/openai-gpt-3-text-
| embedding...), but the 2nd gen embeddings are actually fairly
| performant relative to many open source models with MLM loss.
|
| I'll have to dig out the notebook that I created for this, but
| I'll try to post it here once I find it.
| celestialcheese wrote:
| Very interested in this - I've been using embeddings / semantic
| search doing information retrieval from PDFs, using ada-002, and
| have been impressed by the results in testing.
|
| The reasons the article listed, namely a) lock-in and b) cost,
| have given me pause with embedding our whole corpus of data. I'd
| much rather use an open model but don't have much experience in
| evaluating these embedding models and search performance - still
| very new to me.
|
| Like what you did with ada-002 vs Instruct XL, has there been any
| papers or prior work done evaluating the different embedding
| models?
| VHRanger wrote:
| You can find some comparisons and evaluation datasets/tasks
| here: https://www.sbert.net/docs/pretrained_models.html
|
| Generally MiniLM is a good baseline. For faster models you want
| this library:
|
| https://github.com/oborchers/Fast_Sentence_Embeddings
|
| For higher quality ones, just take the bigger/slower models in
| the SentenceTransformers library
| sroussey wrote:
| Is there performance comparisons for Apple Silicon machines?
| VHRanger wrote:
| Performance in terms of model quality would be the same.
|
| The fast-se library uses C++ code and word embeddings being
| averaged to generate sentence embeddings, so would be
| similarly fast, or faster on apple silicon than x86.
|
| For the SentenceTransformer library models I'm not sure,
| but I think it would run off the CPU for a M1/M2 computer
___________________________________________________________________
(page generated 2023-03-30 23:01 UTC)