[HN Gopher] Jina AI launches open-source 8k text embedding
       ___________________________________________________________________
        
       Jina AI launches open-source 8k text embedding
        
       Author : artex_xh
       Score  : 511 points
       Date   : 2023-10-26 00:24 UTC (22 hours ago)
        
 (HTM) web link (jina.ai)
 (TXT) w3m dump (jina.ai)
        
       | burcs wrote:
       | This is great news!
       | 
       | It feels like open-source is closing the gap with "Open"AI which
       | is really exciting, and the acceleration towards parity is faster
       | than more advancements made on the closed source models. Maybe
       | it's wishful thinking though?
        
         | udev4096 wrote:
         | Is it tho? It's not really open source if they don't give us
         | the information regarding training datasets
        
           | jerpint wrote:
           | It definitely is open source even if they don't disclose all
           | details behind the training
        
             | SOLAR_FIELDS wrote:
             | The very definition of what constitutes open source is
             | being called into question in these kinds of discussions
             | about AI. Without the training details and the weights
             | being made fully open it's hard to really call something
             | truly open, even if it happens to meet some arbitrary
             | definition of "open source".
             | 
             | A good definition of "truly open" is whether the exact same
             | results can be reproduced by someone with no extra
             | information from only what has been made available. If that
             | is not possible, because the reproduction methodology is
             | closed (a common reason, like in this case) then what has
             | been made available is not truly open.
             | 
             | We can sit here and technically argue whether or not the
             | subject matter violated some arbitrary "open source"
             | definition but it still doesn't change the fact that it's
             | not truly open in spirit
        
               | okaram wrote:
               | Notice you are creating your own arbitrary definition of
               | 'truly open', which IMHO corresponds more with
               | 'reproducible'.
               | 
               | We already have a definition of open source. I don't see
               | any reason to change it.
        
               | losteric wrote:
               | The inference runtime software is open, the weights are
               | an opaque binary. Publishing the training data,
               | hyperparameters, process, etc - that would make the whole
               | thing "open source".
        
               | magicalhippo wrote:
               | The quake engine is still open source even though it
               | doesn't come with the quake game assets, no?
               | 
               | It seems unreasonable to require the training data just
               | to be called open source, given it has similar copyright
               | challenges as game assets.
               | 
               | Of course, this wouldn't make the model reproducible. But
               | that's different from open source.
        
               | darkwater wrote:
               | Good example. And in fact you are calling the "engine"
               | opensource, not the whole Quake game. The 'assets" in
               | most "opensource" AI models are not available.
        
               | EGreg wrote:
               | Imagine if the Telegram client was open source but not
               | the backend.
               | 
               | Imagine if Facebook open-sourced their front-end
               | libraries like React but not the back-end.
               | 
               | Imagine if Twitter or Google didn't publish its Algorithm
               | for how they rank things to display to different people.
               | 
               | You don't need to imagine. That's exactly what's
               | happening! Would you call them open source because their
               | front end is open source? Could you host your own back
               | end on your choice of computers?
               | 
               | No. That's why I even started https://qbix.com/platform
        
               | darkwater wrote:
               | I completely agree with you (and the example you mention
               | are singled out in the "antifeatures" list in F-Droid, to
               | name an example)
        
               | torginus wrote:
               | It's a bit different - here most of the value lies in the
               | weights.
               | 
               | A better analogy would be some graphics card drivers
               | which ship a massive proprietary GPU firmware blob, and a
               | small(ish) kernel shim to talk with said blob.
        
               | magicalhippo wrote:
               | Well perhaps we can consider this a kind of short-
               | sightedness of Stallmann. His point with GPL and the free
               | software movement, as I understand it, was to ensure the
               | user could continue to use the software regardless of
               | what the software author decided to do.
               | 
               | Sometimes though the software alone can be near useless
               | without additional assets that aren't necessarily covered
               | by the code license.
               | 
               | Like Quake, having the engine without the assets is
               | useless if what you wanted was to play Quake the game.
               | Neural nets are another prime example, as you mention.
               | Simulators that rely on measured material property
               | databases for usable results also fall into this
               | category, and so on.
               | 
               | So perhaps what we need is new open source licenses that
               | includes the assets needed for the user to be able to
               | reasonably use the program as a whole.
        
               | TeMPOraL wrote:
               | Problem is, the literal/default definition of "open
               | source" is meaningless/worthless in this context. It's
               | the weights, training data and methodology that matter
               | for those models - NOT the inference shell.
               | 
               | It's basically like giving people a binary program and
               | calling it open source because the compiler and runtime
               | used are open source.
        
               | jerpint wrote:
               | The weights are the inference and result of training. I
               | can give you all the training details and you might not
               | be able to reproduce what I did (google does this all the
               | time). As a dev, I'd much rather an open model over an
               | open recipe without weights. We can all agree having both
               | is the best case scenario but having openly licensed
               | weights is for me the bare minimum of open source
        
               | ekianjo wrote:
               | Weights are like binaries. They are not code. It would
               | make more sense to put it under a creative commons
               | license
        
               | rolisz wrote:
               | Then a lot of stuff is not open source. Have you tried
               | reproducing random GitHub repos, especially in machine
               | learning?
        
               | richardw wrote:
               | So if someone includes images in their project they need
               | to tell you every brush stroke that led to the final
               | image?
               | 
               | All sorts of intangibles end up in open source projects.
               | This isn't a science experiment that needs replication.
               | They're not trying to prove how they came up with the
               | image/code/model.
        
               | xnorswap wrote:
               | Those "Brush Strokes" are effectively the source code. To
               | be considered open source, yes source code needs to be
               | provided along side the binaries (the "image").
        
               | EGreg wrote:
               | It's more like someone giving you an open source front
               | end client, but not giving you a way to host your own
               | backend.
               | 
               | Look into Affero GPL. Images are inert static assets.
               | Here we are talking about _the back end engine_. The fact
               | that neural networks and model weights are non-von-
               | neumann architecture doesn't negate the fact that they
               | are _executable code_ and not just static assets!
        
               | abriosi wrote:
               | Imagine someone giving you a executable binary without
               | the source code and calling it "open source"
        
               | jyrkesh wrote:
               | I'm actually mostly in your camp here. But it's
               | complicated with AI.
               | 
               | What if someone gave you a binary and the source code,
               | but not a compiler? Maybe not even a language spec?
               | 
               | Or what if they gave you a binary and the source code and
               | a fully documented language spec, and both of 'em all the
               | way down to the compiler? BUT it only runs on special
               | proprietary silicon? Or maybe even the silicon is fully
               | documented, but producing that silicon is effectively out
               | of reach to all but F100 companies?
               | 
               | It's turtles all the way down...
        
               | krageon wrote:
               | There is the binary (the model) and the source (the thing
               | that allows you to recreate the model, the dataset and
               | methodology). Compilers and how art is made quite simply
               | doesn't factor in here, because nobody is talking about
               | the compiler layer. Art isn't even close to what is
               | present. Trying to make this more complicated than it is
               | is playing into companies' hands by troubling the waters
               | around what constitutes open source.
        
               | r3trohack3r wrote:
               | To be fair, OpenSource troubled the waters around what
               | constitutes free software.
               | 
               | Free(dom Respecting) Software wasn't just about the
               | source code.
               | 
               | https://www.gnu.org/philosophy/open-source-misses-the-
               | point....
        
               | DougBTX wrote:
               | You can pass in any command line arguments you like, so
               | it must be open source
        
               | otikik wrote:
               | Well the other day on this very website there were some
               | very opinionated voices stating that Open Source is
               | "exclusively what OSI defines". I am not on that camp,
               | more like in yours. To me there's open source and OSI-
               | approved open source. But you will encounter people very
               | set on that other opinion, which I found interesting.
               | 
               | Make no mistake, I am super grateful to OSI for their
               | efforts and most of my code out there uses one of their
               | licenses. I just think they are limited by the
               | circumstances. Some things I consider open are not
               | conforming to their licenses and, like here, some things
               | that conform might not be _really open_.
        
               | m3at wrote:
               | To take an other example, would you call a game that has
               | its code and all assets (ex. character sprites) freely
               | available open source? Or would the process that was used
               | to create the assets in the first place also be required
               | to be considered open?
               | 
               | The parallel can be made with model weights being static
               | assets delivered in their completed state.
               | 
               | (I favor the full process being released especially for
               | scientific reproducibility, but this is an other point)
        
               | pjc50 wrote:
               | The old Stallman definition used the phrase "preferred
               | form for modification" rather than the more specific
               | "source code". What do you need to effectively modify an
               | AI model?
        
               | kordlessagain wrote:
               | Usually the datasets, not the source code.
        
             | selcuka wrote:
             | How do you define "source", then?
             | 
             | By this logic any freely downloadable executable software
             | (a.k.a. freeware) is also open source, even though they
             | don't disclose all details on how to build it.
        
               | mogwire wrote:
               | Source would be the way the data is produced so that you
               | can replicate it yourself and make changes.
               | 
               | If I hand you a beer for free that's freeware. If I hand
               | you the recipe and instructions to brew the beer that is
               | open source.
               | 
               | We muddy the waters too much lately and call "free" to
               | use things "open source".
        
               | TeMPOraL wrote:
               | > _If I hand you a beer for free that's freeware. If I
               | hand you the recipe and instructions to brew the beer
               | that is open source._
               | 
               | Yeah, but what those "open source" models are is like you
               | handing me a bottle of beer, plus the instructions to
               | _make the glass bottle_. You 're open-sourcing something,
               | just not the part that matters. It's not "open source
               | beer", it's "beer in an open-source bottle". In the same
               | fashion, those models aren't open source - they're closed
               | models inside a tiny open-source inference script.
        
               | imranhou wrote:
               | Perhaps one more thing that is missing in context is that
               | I'm also getting the right to alter that beer by adding
               | anything I like to it and redistributing it, without
               | knowing its true recipe.
        
             | szundi wrote:
             | Interesting as the literal source of the result is not open
        
               | EGreg wrote:
               | People need to realize something...
               | 
               | The model weights in eg TensorFlow _are the source code_.
               | 
               | It is not a von-Neumann architecture but a gigabyte of
               | model weights is the executable part, no less than a
               | gigabyte of imperative code.
               | 
               | Now, the training of the model is akin to the process of
               | writing the code. In classical imperative languages that
               | code may be such spaghetti code that each part would be
               | intertwined with 40 others, so you can't _just_ modify
               | something easily.
               | 
               | So the fact that you can't modify the code is Freedom 2
               | or whatever. But at least you have Freedom 0 of hosting
               | the model where _You_ want and not getting charged for it
               | an exorbitant amount or getting cut off, or having the
               | model change out from under you via RLHF for political
               | correctnesss or whatever.
               | 
               | OpenAI has not even met Freedom Zero of FSR or OSI's
               | definition. But others can.
        
               | simonw wrote:
               | That doesn't work for me.
               | 
               | The model weights aren't source code. They are the binary
               | result of compiling that source code.
               | 
               | The source code is the combination of the training data
               | and configuration of model architecture that runs against
               | it.
               | 
               | The model architecture could be considered the compiler.
               | 
               | If you give me gcc and your C code I can compile the
               | binary myself.
               | 
               | If you give me your training data and code that
               | implements your model architecture, I can run those to
               | compile the model weights myself.
        
               | EGreg wrote:
               | No, you would need to spend "eye watering amounts of
               | compute" to do it, similar to hiring a lot of developers
               | to produce the code. The compiling of the code to an
               | executable format is a tiny part of that cost.
        
               | simonw wrote:
               | I still think of millions of dollars of GPU spend
               | crunching away for a month as a compiler.
               | 
               | A very slow, very expensive compiler - but it's still
               | taking the source code (the training material and model
               | architecture) and compiling that into a binary executable
               | (the model).
               | 
               | Maybe it helps to think about this at a much smaller
               | scale. There are plenty of interesting machine learning
               | models which can be trained on a laptop in a few seconds
               | (or a few minutes). That process feels very much like a
               | compiler - takes less time to compile than a lot of large
               | C++ projects.
               | 
               | Running on a GPU cluster for a month is the exact same
               | process, just scaled up.
               | 
               | Huge projects like Microsoft Windows take hours to
               | compile and that process often runs on expensive
               | clusters, but it's still considered compilation.
        
               | EGreg wrote:
               | Actually, the dirty secret is that a lot of human work
               | (at below minimum wage) went into training and refining
               | the AI models:
               | 
               | https://time.com/6247678/openai-chatgpt-kenya-workers/
               | 
               | And billion-dollar companies made their money off it:
               | 
               | https://www.forbes.com/sites/kenrickcai/2023/04/11/how-
               | alexa...
               | 
               | That's the dirty secret of why ChatGPT 4 is better. But
               | they'll tell you it has to do with chaining ChatGPT 3's
               | together, more fine tuning etc. They go to these poor
               | countries and recruit people to work on training the AI.
               | 
               | Not to mention all the uncompensated work of humans
               | around the world who put their content up on the Web.
        
         | Gasp0de wrote:
         | They compare it to OpenAI's ada model though, which is light-
         | years away from ChatGPT.
        
           | infecto wrote:
           | Does that not conflate two different things though? Embedding
           | model != LLM Model ?
        
           | simonw wrote:
           | Don't confuse the current Ada embedding model the old Ada
           | GPT3 model.
           | 
           | It turns out OpenAI have used the name "Ada" for several very
           | different things, purely because they went through a phase of
           | giving everything Ada/Babbage/Curie/DaVinci names because
           | they liked the A/B/C/D thing to indicate which of their
           | models were largest.
        
         | infecto wrote:
         | Wishful thinking? Embeddings to me were never the interesting
         | or bleeding edge thing at OpenAI. Maybe the various ada models
         | at one point reigned supreme but there have been open-source
         | models at the top of the leaderboard for a while and from a
         | cost/performance perspective, often even the Bert models did a
         | really fine job.
        
       | omneity wrote:
       | Impressive work.
       | 
       | I wonder what would be the best way to use 8k embeddings. It's a
       | lot of information to keep in a vector, so things like
       | "precision" of the embedding space and its ability to distinguish
       | very similar large documents will be key.
       | 
       | Maybe it can be useful for coarse similarity matching, for
       | example to detect plagiarism?
        
         | sroussey wrote:
         | 8K is the context length. Their vector dimension size is actual
         | much smaller, which is great for a number of use cases, though
         | maybe not the ones you are thinking about.
        
           | omneity wrote:
           | Yes that's also how I understood it. Maybe it was ambiguously
           | expressed, but I mean "8k tokens as input is a lot of
           | information to encode"
        
       | andrewstuart wrote:
       | Anyone got links to examples of text embedding?
        
         | BoorishBears wrote:
         | Easiest example is taking three words: Universe, University,
         | College.
         | 
         | - University and Universe are similar alphabetically.
         | 
         | - University and College are similar in meaning.
         | 
         | Take embeddings for those three words and `University` will be
         | near `College`, while `Universe` will be further away, because
         | embeddings capture meaning:
         | 
         | University<-->College<-------------->Universe
         | 
         | _
         | 
         | With old school search you'd need to handle the special case of
         | treating University and College as similar, but embeddings
         | already handle it.
         | 
         | With embeddings you can do math to find how similar two results
         | are, based on how close their vectors are. The closer the
         | embeddings, the closer the meaning.
        
           | osigurdson wrote:
           | Another interesting point is that math can be performed on
           | embedding vectors: emb("king") - emb("man") + emb("woman") =
           | emb("queen").
        
             | minimaxir wrote:
             | That's a property of Word2Vec _specifically_ due to how it
             | 's trained (a shallow network where most of the "logic"
             | would be contained within the embeddings themselves). Using
             | it for embeddings generated from LLMs or Embedding layers
             | will not give as fun results; in practice the only thing
             | you can do is average or cluster them.
        
               | TeMPOraL wrote:
               | > _That 's a property of Word2Vec specifically due to how
               | it's trained (a shallow network where most of the "logic"
               | would be contained within the embeddings themselves)._
               | 
               | Is it though? I thought the LLM-based embeddings are
               | _even more fun_ for this, as you have many more
               | interesting directions to move in. I.e. not just:
               | 
               | emb("king") - emb("man") + emb("woman") = emb("queen")
               | 
               | But also e.g.:
               | 
               | emb(<insert a couple paragraph long positive book
               | review>) + a _v(sad) + b_ v(short) - c*v(positive) =
               | emb(<a single paragraph, negative and depressing review>)
               | 
               | Where a, b, c are some constants to tweak, and v(X) is a
               | vector for quality X, which you can get by embedding a
               | bunch of texts expressing the quality X and averaging
               | them out (or doing some other dimensional reduction
               | trickery).
               | 
               | I've suggested this on HN some time ago, but only been
               | told that I'm confused and the idea is not even wrong.
               | But then, there was this talk on some AI conference
               | recently[0], where the speaker demonstrated exactly this
               | kind of latent space translations of text in a language
               | model.
               | 
               | --
               | 
               | [0] -
               | https://www.youtube.com/watch?v=veShHxQYPzo&t=13980s -
               | "The Hidden Life of Embeddings", by Linus Lee from
               | Notion.
        
               | simonw wrote:
               | That talk used a novel embeddings model trained by the
               | speaker which does exhibit this kind of property - but
               | that was a new (extremely cool) thing, not something that
               | other embeddings models can do.
        
         | RossBencina wrote:
         | OpenAI have a brief explainer with a bunch of example use cases
         | here:
         | 
         | https://platform.openai.com/docs/guides/embeddings/what-are-...
        
       | Nitrolo wrote:
       | Is there something like oobabooga to easily run this in a click-
       | and-run way? Where I can load up a model, a text, and ask it
       | questions?
        
         | brucethemoose2 wrote:
         | iirc ooba has its own integrated vectordb called superbooga.
         | 
         | I bet you could hack this in.
        
         | simonw wrote:
         | See my comment here:
         | https://news.ycombinator.com/item?id=38020655 for a CLI tool
         | that lets you do this.
         | 
         | Note that embedding models are a different kind of thing from a
         | Large Language Model, so it's not the kind of model you can ask
         | questions.
         | 
         | It's a model which can take text and turn it into an array of
         | floating point numbers, which you can then use to implement
         | things like semantic search and related documents.
         | 
         | More on that here:
         | https://simonwillison.net/2023/Oct/23/embeddings/
        
         | minimaxir wrote:
         | The Hugging Face page for the model has a two line load-and-
         | encode Python code demo: https://huggingface.co/jinaai/jina-
         | embeddings-v2-base-en
        
       | sroussey wrote:
       | Does anyone know what they are using for this comparison and
       | ranking? And where does instruct-xl stand in the mix?
        
         | sroussey wrote:
         | Oh duh, it's right in the post and instructor-xl is number 9.
         | And so many new participants now!
        
           | sroussey wrote:
           | The ranking are here:
           | 
           | https://huggingface.co/spaces/mteb/leaderboard
           | 
           | It's amazing how many new and better ones there are since I
           | last looked a few months ago. Instructor-xl was number 1, now
           | it is number 9, and its size is more than 10x the number 2
           | ranked!
           | 
           | Things move fast!
        
       | RossBencina wrote:
       | Some relevant stats from the link:
       | 
       | 8192 token input sequence length
       | 
       | 768 embedding dimensions
       | 
       | 0.27GB model (with 0.07GB model also available)
       | 
       | Tokeniser: BertTokenizer [1], 30528 token vocab [2]
       | 
       | Is an 8K sequence length directly comparable to text-embedding-
       | ada-002 if the vocabulary is much smaller? I seem to remember its
       | tokeniser has a larger vocabulary.
       | 
       | [1] https://huggingface.co/jinaai/jina-embeddings-v2-base-
       | en/blo...
       | 
       | [2] https://huggingface.co/jinaai/jina-embeddings-v2-base-
       | en/blo...
        
         | LoganDark wrote:
         | > Is an 8K sequence length directly comparable to text-
         | embedding-ada-002 if the vocabulary is much smaller? I seem to
         | remember its tokeniser has a larger vocabulary.
         | 
         | Words that aren't in the vocabulary can still be represented by
         | multiple tokens. Some models can input and output valid UTF-8
         | at the byte level (rather than needing a unique token for each
         | codepoint). For example RWKV-World.
        
           | space_fountain wrote:
           | A large vocabulary means less tokens are needed to represent
           | the same information
        
             | HPMOR wrote:
             | *fewer
             | 
             | Less is used for qualitative data like "I love him less".
             | Whereas fewer is used for countable things like "I need
             | fewer tokens."
        
               | scubbo wrote:
               | Username checks out.
        
             | LoganDark wrote:
             | Thanks.
        
         | DavidSJ wrote:
         | A uniform distribution over 30528 tokens is just under 15 bits
         | of information per token, whereas a vocabulary size of ~60000
         | would be just under 16 bits per token. In practice it's not
         | uniform, but this shows that they're in the same ballpark.
        
         | rajin112 wrote:
         | Thanks what size gpu would you need to fine tune or do an
         | inference?
        
       | jncraton wrote:
       | This is great to see. It looks like the size of the embedding
       | vector is half the size of text-embedding-ada-002 (768 vs 1536)
       | while providing competitive performance. This will save space in
       | databases and make lookups somewhat faster.
       | 
       | For those unaware, if 512 tokens of context is sufficient for
       | your use case, there are already many options that outperform
       | text-embedding-ada-002 on common benchmarks:
       | 
       | https://huggingface.co/spaces/mteb/leaderboard
        
         | minimaxir wrote:
         | The 768D-sized embeddings compared to OpenAI's 1536D embeddings
         | are actually a feature outside of index size.
         | 
         | In my experience, OpenAI's embeddings are overspecified and do
         | very poorly with cosine similarity out of the box as they match
         | syntax more than semantic meaning (which is important as that's
         | the metric for RAG). Ideally you'd want cosine similarity in
         | the range of [-1, 1] on a variety of data but in my experience
         | the results are [0.6, 0.8].
        
           | karxxm wrote:
           | You wrote ,,out of the box", did you find a way to improve
           | this?
        
             | teaearlgraycold wrote:
             | You can do PCA or some other dimensionality reduction
             | technique. That'll reduce computation and improve
             | signal/noise ratio when comparing vectors.
        
               | karxxm wrote:
               | Unfortunately this is not feasible with a large amount of
               | words due to the quadratic scaling. But thanks for the
               | response!
        
               | minimaxir wrote:
               | Not sure what you mean by large amount of words. You can
               | fit a PCA on millions of vectors relatively performantly,
               | then inference from it is just a matmul.
        
           | TeMPOraL wrote:
           | Unless I'm missing something, it should be possible to map
           | out in advance which dimensions represent syntactic aspects,
           | and then downweigh or remove them for similarity comparisons.
           | And that map should be a function of the model alone, i.e.
           | fully reusable. Are there any efforts to map out the latent
           | space of ada models like that?
        
       | e1g wrote:
       | Their OpenAI benchmark is GPT3 (text-embedding-ada-002), not
       | GPT4.
        
         | simonw wrote:
         | "text-embedding-ada-002" isn't GPT3, it's a different kind of
         | model. Embedding models and Large Language Models aren't the
         | same thing.
        
           | e1g wrote:
           | LLMs and embedding models are certainly different, but it's a
           | useful benchmark to calibrate expectations. OpenAI released
           | text-embedding-ada-002 a year ago, and they describe the ada
           | model as[1] "the original GPT-3 base model [...] capable of
           | very simple tasks, usually the fastest model in the GPT-3
           | series".
           | 
           | It's fair to expect GPT3-level results - not GPT 3.5 and
           | certainly not open-source tiny GPT4 as some might think when
           | they read "rivaling OpenAI".
           | 
           | [1] https://platform.openai.com/docs/models/whisper
        
             | minimaxir wrote:
             | When people talked about GPT-3 they always referred to
             | davinci which is the largest model, not ada.
        
             | simonw wrote:
             | No, you're confusing two things here.
             | 
             | "text-ada-001" is LLM in the GPT3 family, described as
             | "Capable of very simple tasks, usually the fastest model in
             | the GPT-3 series, and lowest cost"
             | 
             | "text-embedding-ada-002" is entirely different - that page
             | describes it as "Our second generation embedding model,
             | text-embedding-ada-002 is a designed to replace the
             | previous 16 first-generation embedding models at a fraction
             | of the cost."
        
               | minimaxir wrote:
               | tl;dr OpenAI is bad at product naming.
        
               | e1g wrote:
               | OpenAI doesn't say directly what text-embedding-ada-002
               | is, but in the release blog post they show that
               | performance is comparable to davinci/curie, which places
               | it firmly in the universe of GPT3. I understand it's not
               | a straight line comparison, but to me it's still a useful
               | mental heuristic about what to expect.
               | 
               | [1] https://openai.com/blog/new-and-improved-embedding-
               | model (see "Model improvements")
        
               | helloplanets wrote:
               | Reading through that article, the specific Davinci/Curie
               | models they seem to be referring to are called the
               | following: 'text-search-davinci-001', 'text-search-
               | curie-001', 'text-similarity-davinci-001' and 'text-
               | similarity-curie-001'.
               | 
               | Are you sure these have anything to do with 'text-
               | davinci-003' or 'text-curie-001'?
               | 
               | Will have to agree with everyone here that OpenAI is good
               | at being extremely confusing. It seems like the logic
               | might be something along the lines of the 'text-search'
               | portion being the actual type of the model, while the
               | 'curie-001' / '<name>-<number>' format is just a
               | personalized way of expressing the version of that type
               | of model. And the whole 'GPT<number>' category used to be
               | a sort family of models, but now they've just switched it
               | to the actual name of the newer gargantuan LLMs. Then,
               | because the 'GPT<number>' models are now that different
               | thing altogether these days, the newest 'text-embedding'
               | model is just named 'ada-<number>' because it's on that
               | iteration of the 'text-embedding' type of model, adhering
               | to the older principle of naming their models? Not sure,
               | ha. Definitely feels like doing some detective work.
        
               | simonw wrote:
               | You mean this table here?                   text-
               | embedding-ada-002     53.3         text-search-
               | davinci-*-001 52.8         text-search-curie-*-001
               | 50.9         text-search-babbage-*-001 50.4         text-
               | search-ada-*-001     49.0
               | 
               | That's not comparing it to the davinci/curie/babbage GPT3
               | models, it's comparing to the "search-text-*" family.
               | 
               | Those were introduced in
               | https://openai.com/blog/introducing-text-and-code-
               | embeddings as the first public release of embeddings
               | models from OpenAI.
               | 
               | > We're releasing three families of embedding models,
               | each tuned to perform well on different functionalities:
               | text similarity, text search, and code search. The models
               | take either text or code as input and return an embedding
               | vector.
               | 
               | It's not at all clear to me if there's any relationship
               | between those and the GPT3 davinci/curie/babbage/ada
               | models.
               | 
               | My guess is that OpenAI's naming convention back then was
               | "davinci is the best one, then curie, then babbage, then
               | ada".
        
               | e1g wrote:
               | How interesting. I assumed that a consistent codename
               | such as Ada/Davinci refers to the lineage/DNA of the
               | OpenAI model from which a distinct product was created.
               | But I can see how these codenames could be "just" a
               | revision label of A/B/C/D (Ada/Babbage/Curie/Davinci),
               | similar to "Pro/Max/Ultra". If true, a product named "M2
               | Ultra" could have nothing to do with another product
               | called "Watch Ultra".
        
               | simonw wrote:
               | Wow I genuinely hadn't noticed the A/B/C/D thing!
        
       | tayo42 wrote:
       | You can't fine tune without using their library tied to their
       | cloud? Did I misunderstand? Do you need fine tune?
        
       | simonw wrote:
       | I just shipped a new llm-embed-jina plugin for my LLM tool which
       | provides access to these new Jina models:
       | https://github.com/simonw/llm-embed-jina
       | 
       | Here's how to try it out.
       | 
       | First, install LLM. Use pip or pipx or brew:
       | brew install llm
       | 
       | Next install the new plugin:                   llm install llm-
       | embed-jina
       | 
       | You can confirm the new models are now available to LLM by
       | running:                   llm embed-models
       | 
       | You should see a list that includes "jina-embeddings-v2-small-en"
       | and "jina-embeddings-v2-base-en"
       | 
       | To embed a string using the small model, run this:
       | llm embed -m jina-embeddings-v2-small-en -c 'Hello world'
       | 
       | That will output a JSON array of 512 floating point numbers (see
       | my explainer here for what those are:
       | https://simonwillison.net/2023/Oct/23/embeddings/#what-are-e...)
       | 
       | Embeddings are only really interesting if you store them and use
       | them for comparisons.
       | 
       | Here's how to use the "llm embed-multi" command to create
       | embeddings for the 30 most recent issues in my LLM GitHub
       | repository:                   curl 'https://api.github.com/repos/
       | simonw/llm/issues?state=all&filter=all' \         | jq '[.[] |
       | {id: .id, title: .title}]' \         | llm embed-multi -m jina-
       | embeddings-v2-small-en jina-llm-issues - \         --store
       | 
       | This creates a collection called "jina-llm-issues" in a default
       | SQLite database on your machine (the path to that can be found
       | using "llm collections path").
       | 
       | To search for issues in that collection with titles most similar
       | to the term "bug":                   llm similar jina-llm-issues
       | -c 'bug'
       | 
       | Or for issues most similar to another existing issue by ID:
       | llm similar jina-llm-issues 1922688957
       | 
       | Full documentation on what you can do with LLM and embeddings
       | here: https://llm.datasette.io/en/stable/embeddings/index.html
       | 
       | Alternative recipe - this creates embeddings for every single
       | README.md in the current directory and its subdirectories. Run
       | this somewhere with a node_modules folder and you should get a
       | whole lot of interesting stuff:                   llm embed-multi
       | jina-readmes \           -m jina-embeddings-v2-small-en \
       | --files . '**/README.md' --store
       | 
       | Then search them like this:                   llm similar jina-
       | readmes -c 'backup tools'
        
         | X6S1x6Okd1st wrote:
         | Thank you so much for all the work you've put into llm!
        
         | dazzaji wrote:
         | Excellent! And you were just saying how risky it is to rely
         | long-term on OpenAI text embeddings in your post on the topic.
         | The timing for this open source option worked out nicely.
        
         | bosky101 wrote:
         | The only feedback I had from your embedding post was
         | wish we could create the array of floating points without
         | openai
         | 
         | Great timely turnaround time, good sir. Ht
        
         | simonw wrote:
         | Wrote this up on my blog:
         | https://simonwillison.net/2023/Oct/26/llm-embed-jina/
        
         | mike_ivanov wrote:
         | JFYI, this is what happens on my M1 Macbook:
         | 
         | $ brew install llm $ llm ModuleNotFoundError: No module named
         | 'typing_extensions'
         | 
         | Not sure where to report it.
        
           | simonw wrote:
           | Whoa, that is a weird one. Do you know what version of Python
           | you have from Homebrew?
           | 
           | It looks like that package is correctly listed in the
           | formula: https://github.com/Homebrew/homebrew-
           | core/blob/a0048881ba9a2...
        
             | mike_ivanov wrote:
             | % python3 --version         Python 3.11.6
             | % which python3         /opt/homebrew/bin/python3
             | % brew info python-typing-extensions         ==> python-
             | typing-extensions: stable 4.8.0 (bottled)
        
           | IanCal wrote:
           | Probably not this, but check with `which llm` what that's
           | running. I had weird issues not matching the documentation
           | but just had some _other_ random python cli tool called llm I
           | 'd put in my home bin for and forgotten about it.
        
             | mike_ivanov wrote:
             | % which llm         /opt/homebrew/bin/llm
        
         | jillesvangurp wrote:
         | Thanks, this is wonderfully simple to use. Just managed to
         | package this up using docker and was able to use it without a
         | lot of drama. Nice how simple this is to use.
         | 
         | I've dabbled a bit with elasticsearch dense vectors before and
         | this model should work great for that. Basically, I just need
         | to feed it a lot of content and add the vectors and vector
         | search should work great.
        
         | michalmatczuk wrote:
         | FYI it seems that llm install llm-embed-jina is missing yaml
         | dependency                 File
         | "/opt/homebrew/Cellar/llm/0.11_1/libexec/lib/python3.12/site-
         | packages/llm/default_plugins/openai_models.py", line 17, in
         | <module>         import yaml
         | 
         | ModuleNotFoundError: No module named 'yaml'
        
           | simonw wrote:
           | Thanks! I wonder if the Python 3.12 upgrade broke something.
           | 
           | The pyyaml package is correctly listed on the formula page
           | though: https://formulae.brew.sh/formula/llm
        
       | neximo64 wrote:
       | Does it match OpenAI on number of params?
        
         | minimaxir wrote:
         | No one knows since OpenAI has not disclosed the number of
         | paramerers their embeddings model uses.
        
       | andy99 wrote:
       | What is the use case for an 8k token embedding? My (somewhat
       | limited) experience with long context models is they aren't great
       | for RAG. I get the impression they are optimized for something
       | else, like writing 8k+ tokens rather than synthesizing responses.
       | 
       | Isn't the normal way of using embedding to find relevant text
       | snippets for a RAG prompt? Where is it better to have coarser
       | retrieval?
        
         | kristopolous wrote:
         | Is this what you mean by RAG?
         | https://www.promptingguide.ai/techniques/rag?
        
           | teaearlgraycold wrote:
           | Yes
        
           | simonw wrote:
           | I have an explanation of RAG in the context of embeddings
           | here: https://simonwillison.net/2023/Oct/23/embeddings/#answe
           | ring-...
        
             | Grimburger wrote:
             | You could just sum it up for us all rather than do a divert
             | to your blog?
             | 
             | It's Retrieval Augmented Generation btw.
             | 
             | To quote:
             | 
             | > The key idea is this: a user asks a question. You search
             | your private documents for content that appears relevant to
             | the question, then paste excerpts of that content into the
             | LLM (respecting its size limit, usually between 3,000 and
             | 6,000 words) along with the original question.
             | 
             | > The LLM can then answer the question based on the
             | additional content you provided.
        
               | simonw wrote:
               | > You could just sum it up for us all rather than do a
               | divert to your blog?
               | 
               | Why? Have links gone out of fashion?
               | 
               | I even linked directly to the relevant section rather
               | than linking to the top of the page.
               | 
               | The paper that coined the term used the hyphen, though I
               | think I prefer it without:
               | https://arxiv.org/abs/2005.11401
        
               | Grimburger wrote:
               | > Have links gone out of fashion?
               | 
               | Yes.
               | 
               | You wrote far more words than needed to answer the
               | comment, I did it for you instead.
        
               | simonw wrote:
               | One of the reasons I write so much stuff is so I can
               | provide links to things I've written to answer relevant
               | questions.
        
               | scubbo wrote:
               | And those of us with the sense to value your insight, and
               | the attention-span to read more than tweet-sized content,
               | thank you for it.
        
               | mhog_hn wrote:
               | Thank you, nice blog.
        
               | discordance wrote:
               | Thanks so much for your writings and for posting the link
               | (and also for Datasette!). I've learned in the past few
               | months from your blog.
        
               | monkeydust wrote:
               | Appreciate it. Your posts in general have been great -
               | accessible to a large audience, quality links to follow
               | up research and catchy analogies even when they don't
               | fully hold true (llm as a calculator for words - which I
               | admit I use with citation!). Keep going.
        
               | gar1t wrote:
               | I liked your link a lot.
        
               | hboon wrote:
               | Just to add that, we appreciate that very much.
        
               | gkbrk wrote:
               | "Links have gone out of fashion" is an odd thing to write
               | on a Link Aggregator website.
        
               | kristopolous wrote:
               | You know you're responding to a programmer famous enough
               | to have a Wikipedia page, right?
               | 
               | https://en.m.wikipedia.org/wiki/Simon_Willison
        
         | teaearlgraycold wrote:
         | You could get a facsimile to a summary for a full article or
         | short story. Reducing an 8k token article to a summary using a
         | completions model would cost far more. So if you need to search
         | through collections of contracts, scientific papers, movie
         | scripts, etc. for recommendations/clustering then bigger input
         | sizes can do that in one shot.
         | 
         | Think of it like skipping the square root step in Euclidean
         | distance. Perfectly valid as long as you don't want a distance
         | so much as a way to compare distances. And doing so skips the
         | most computationally expensive operation.
        
           | refulgentis wrote:
           | I think I'm missing something: like, yeah, it's vector search
           | for bigger text chunks. But arguably vector search with
           | bigger text chunks is _definitively_ worse -- this isn't
           | doing summarization, just turning about 25 pages of text to
           | 1024 floats, which you then can use cosine similarity to
           | measure the semantic similarity to other text
           | 
           | I'd much rather know what paragraph to look in than what 25
           | pages to look in
        
             | simonw wrote:
             | I imagine it's more useful for finding related articles and
             | clustering things than for semantic search, which will work
             | much better against smaller chunks - especially if you're
             | implementing Retrieval Augmented Generation.
        
               | rolisz wrote:
               | I think the point is: if you compress 25 pages of text
               | into 1024 floats, you will lose a ton of information,
               | regardless of what the use case is, so you're probably
               | still better of with chunking.
        
               | simonw wrote:
               | I've been getting great results for related documents by
               | embedding entire blog posts, e.g. here:
               | https://til.simonwillison.net/gis/pmtiles#related
               | 
               | I'm not sure how I would do that after chunking.
        
               | thomasahle wrote:
               | Did you compare with simple baselines like bag-of-words
               | and word vectors?
        
               | simonw wrote:
               | My previous implementation used TF-IDF - I basically took
               | all the words in the post and turned them into a giant
               | "word OR word OR word OR word" search query and piped
               | that through SQLite full-text search.
               | https://til.simonwillison.net/sqlite/related-content
               | 
               | I jumped straight from that to OpenAI embeddings. The
               | results were good enough that I didn't spend time
               | investigating other approaches.
        
               | rolisz wrote:
               | That's not quite tfidf though. I agree you can get better
               | results than that with Ada embeddings, but I would argue
               | you can get even better results with embeddings from
               | smaller chunks.
        
               | simonw wrote:
               | I guess technically it's bm25, since it's using the rank
               | mechanism in SQLite FTS5: https://www.sqlite.org/fts5.htm
               | l#sorting_by_auxiliary_functi...
        
               | thomasahle wrote:
               | > Into a giant "word OR word OR word OR word"
               | 
               | Does that mean you'd return other docs if they share just
               | one word?
               | 
               | The idea of tfidf is that it gives you a vector (maybe
               | combined with pca or a random dimensionality reduction)
               | that you can use just like an Ada embedding. But you
               | still need vector search.
        
               | simonw wrote:
               | My goal for related articles was to first filter to every
               | document that shared at least one word with the target -
               | which is probably EVERY document in the set - but then
               | rank them based on which ones share the MOST words,
               | scoring words that are rare in the corpus more highly.
               | BM25 does that for free.
               | 
               | Then I take the top ten by score and call those the
               | "related articles".
        
               | teaearlgraycold wrote:
               | Ever read the back of a book?
        
               | TeMPOraL wrote:
               | You mean the marketing blurb? Those tend to carry low
               | information value, sometimes even _negative_ - as in, if
               | you didn 't know anything else about the book, reading
               | the blurb will make you _even more wrong_ about it than
               | you were. This is a common feature of marketing copy.
        
               | TeMPOraL wrote:
               | > _if you compress 25 pages of text into 1024 floats, you
               | will lose a ton of information_
               | 
               | Sure, but then if you do it one page at a time, or one
               | paragraph at a time, you lose ton of _meaning_ - after
               | all, individual paragraphs aren 't independent of each
               | other. And meaning is kind of the whole point of the
               | exercise.
               | 
               | Or put another way, squashing a ton of text loses you
               | some high-frequency information, while chunking cuts off
               | the low-frequency parts. Ideally you'd want to retain
               | both.
        
               | kordlessagain wrote:
               | I think that the assumption that you lose a ton of
               | meaning (of low frequency) in doing separate chunks is
               | probably less likely to be true over doing the whole
               | document at once (losing high frequency meaning). As you
               | say, doing both is probably a good strategy, and I think
               | that's why we see a lot of "summarize this text"
               | approaches.
               | 
               | I use a multi-pronged approach to this based on a special
               | type of summarization. I chunk on sentences using
               | punctuation until they are just over 512 characters, then
               | I embed them. After embedding, I ask a foundation model
               | to summarize (or ask a question about the chunk) and then
               | generate keyterms for it. Those keyterms are stored along
               | with the vector in the database. During search, I use the
               | user's input to do a vector search for matching chunks,
               | then pull their keyterms in. Using those keyterms, I do
               | set operations to find related chunks. I then run a
               | vector search against these to the top matches from the
               | vector search to assemble new prompt text.
               | 
               | This strategy is based on the idea of a "back of the book
               | index". It is entirely plausible to look for "outliers"
               | in the keyterms and consider throwing those chunks with
               | those keyterms in there to see if it nets us
               | understanding of some "hidden" meaning in the document.
               | 
               | There is also a means to continue doing the "keyterm"
               | extraction trick as the system is used. Keyterms from
               | answer as well as user prompts may be added to the
               | existing index over time, thus helping improve the
               | ability to return low frequency information that may be
               | initially hidden.
        
               | imranhou wrote:
               | Good point, I wonder how different it is to use a large
               | context here vs having some other model summarize an 8k
               | article into a small paragraph and using embedding from
               | the paragraph instead where such a large context wouldn't
               | be necessary.
        
             | antman wrote:
             | you could do both
        
             | scotty79 wrote:
             | Isn't it up to 8k? So you can index your documents by
             | paragraphs if you prefer?
        
         | dragonwriter wrote:
         | > What is the use case for an 8k token embedding?
         | 
         | Calculating embeddings on larger documents than smaller-window
         | embedding models.
         | 
         | > My (somewhat limited) experience with long context models is
         | they aren't great for RAG.
         | 
         | The only reason they wouldn't be great for RAG is that they
         | aren't great at using information in their context window,
         | which is possible (ISTR that some models have a strong recency
         | bias within the window, for instance) but I don't think is a
         | general problem of long context models.
         | 
         | > Isn't the normal way of using embedding to find relevant text
         | snippets for a RAG prompt?
         | 
         | I would say the usual use is for search and semantic similarity
         | comparisons generally. RAG is itself an application of search,
         | but its not the only one.
        
           | 3abiton wrote:
           | I wonder how the perfomance fair when context size is
           | increased. Intuitively this should be higher, but some
           | quantized models I've tested showed noticeably worst
           | performance.
        
             | Kubuxu wrote:
             | Your KV cache size is linear with context size which might
             | put you tight on memory. There is also increased cost of
             | recalculating KV cache of context window when the window
             | has to move but this is close to being solved with
             | streaming LLMs.
        
               | woadwarrior01 wrote:
               | BERT style encoder-only models, like the embedding model
               | being discussed here, don't need a KV cache for
               | inference. A KV cache is only needed for efficient
               | inference with encoder-decoder and decoder-only (aka GPT)
               | models.
        
       | moralestapia wrote:
       | Ada is one of the (if not the) worst model offered by OpenAI,
       | though ...
        
         | simonw wrote:
         | You're thinking of the old "ada" GPT-3 model - the one that was
         | a companion to "davinci" and "babbage".
         | 
         | I believe "text-embedding-ada-002" is entirely unrelated to
         | those old GPT-3 models. It's a recent embedding model (released
         | in December 2022 - https://openai.com/blog/new-and-improved-
         | embedding-model ) which OpenAI claim is their best current best
         | available embedding model.
         | 
         | I understand your confusion: OpenAI are notoriously bad at
         | naming things!
        
           | moralestapia wrote:
           | Oh, thanks for clarifying!
           | 
           | Edit: looking at the press release, the improvement over old
           | Ada is ... marginal? And Ada-01 is/was a poor performing
           | model, tbh. I guess I'll have to run some tests, but at first
           | sight it doesn't seem that wow-ey.
        
             | LASR wrote:
             | So just to be super clear, this is an embedding model. It
             | generates no text. It's not outputting words.
             | 
             | Maybe I am assuming incorrectly, but I think the poor
             | performance you are referring to is the old Ada completion
             | model, where the output is text. That was poor indeed.
        
               | itake wrote:
               | This article is not kind to the old ada embeddings model:
               | 
               | https://medium.com/@nils_reimers/openai-gpt-3-text-
               | embedding...
               | 
               | If the new ada model only has marginal improvements, it
               | seems open source is way to go.
        
       | Zuiii wrote:
       | Color me surprised! it looks like its actually open source
       | (Apache 2.0) and not the usual false advertising by some two-
       | faced company or institution. Links here:
       | 
       | * https://huggingface.co/jinaai/jina-embeddings-v2-base-en *
       | https://huggingface.co/jinaai/jina-embeddings-v2-small-en
        
       | nicognaw wrote:
       | Jina AI itself is also a great framework to expose APIs from deep
       | neural net models and deploy them to Kubernetes clusters, which I
       | think is very promising, but they didn't get as much hype as I
       | predicted that they deserved.
        
       | pknerd wrote:
       | Pardon my ignorance in advance but could it be used to "chat"
       | with PDFs and websites? I am looking for OpenAI alternatives as I
       | am in learning phase
        
         | clarkmcc wrote:
         | Check out my little side project for chatting with PDFs. You
         | should be able to load most models including this one.
         | https://github.com/clarkmcc/chitchat
        
           | pknerd wrote:
           | This looks cool so can it be used to feed Website/Products
           | data in CSV/JSON format and "chat" with it?
        
             | clarkmcc wrote:
             | Pretty much! Right now it only supports md, pdf, txt, and
             | html, but supporting additional formats is trivial:
             | https://github.com/clarkmcc/chitchat/blob/main/src-
             | tauri/src....
        
         | canadaduane wrote:
         | No, this is an embedding model, not a text completion model.
        
         | lofties wrote:
         | No. "Chatting with PDFs" is (mostly) taking a users chat
         | message, retrieve relevant content via e.g embedding search,
         | then feed that into an LLM with a prompt that's something along
         | the lines of "given this information, can you answer this
         | question".
         | 
         | This tool helps with embedding part.
         | 
         | I've built a bunch of "chat with your PDFs" bots, do reach out
         | if you have any questions me at brian.jp.
        
           | pknerd wrote:
           | Actually I wanna use langchain. OpwnAI is not free. I wanted
           | to test two use cases:
           | 
           | - chat with documents(pdf, doc etc)
           | 
           | - chat with website. Like, if I integrate with an ecommerce
           | site, I can ask questions from the website. What options do I
           | have in free for both cloud and locally?
        
         | seydor wrote:
         | using the bing tab of microsoft edge browser, you can chat with
         | PDFs and i think they use GTP4 or equivalent
        
       | marinhero wrote:
       | How well do LLMS like this work with a non-English language? Or
       | are these open source models limited to English?
        
         | simonw wrote:
         | Quite a few of the top ranked models on this leaderboard are
         | multilingual: https://huggingface.co/spaces/mteb/leaderboard
         | 
         | https://huggingface.co/BAAI/bge-large-en-v1.5 FlagEmbedding for
         | example describes itself as covering Chinese and English.
        
         | ttul wrote:
         | That depends on whether the training data contained languages
         | other than English.
        
         | anigbrowl wrote:
         | Stability has a Japanese port which is getting lots of work
         | https://twitter.com/StabilityAI_JP/status/171699857824440759...
        
           | m3at wrote:
           | This is not an embedding model though. Yes you can always
           | extract some embeddings from somewhere, but for most LLMs
           | those won't perform well for retrieval (which makes sense as
           | it's not what the models are optimizing for)
        
       | backendEngineer wrote:
       | oh thank god I first read Jira...
        
         | eshack94 wrote:
         | You're not the only one... glad I misread that.
        
       | dylanjcastillo wrote:
       | I wonder how much better is this, compared to taking the average
       | ( or some other aggregation) of embeddings with a smaller context
       | length. Has anyone done a similar comparison?
        
         | pietro72ohboy wrote:
         | The issue with averaging is that over large inputs, it drowns
         | out small signal. For example, there is a chance that it
         | completely loses a reference to something made only in a single
         | sentence somewhere in a large document.
        
       | extasia wrote:
       | Is this a text encoder model, BERT style?
        
       | Kutsuya wrote:
       | this is super cool! I wish there was an easy to understand and
       | follow guide on how to make your own embedding, for llama2 for
       | example. All I can find are various guides that already assume
       | you know everything there is to training an embedding.
       | 
       | I just want to make an embedding between a conversation of me and
       | my friend and simulate talking to them. Is this a hard thing to
       | train to begin with?
       | 
       | If anyone knows or could help me with this, I would be very
       | grateful!
        
         | infecto wrote:
         | I will butcher this so if any experts see this please don't
         | flame me. I think you might be conflating ideas? You could
         | definitely fine-tune existing embedding models or train your
         | own from scratch but the goals of embeddings models are
         | different than a LLM conversation. Embedding models are used
         | for things like, classifying, search, image captioning...maybe
         | at a high level anything where you have high dimensionality
         | that you need to condense?
         | 
         | What you are asking for sounds like fine tuning an existing
         | LLM...where the data will be tokenized but the outcomes are
         | different? There is a lot of writeups on how people have done
         | it. You should especially follow some of the work on
         | Huggingface. To replicate talking to your friend though, you
         | will need a very large dataset to train off of I would think
         | and its unclear to me if you can just fine-tune it or you would
         | need to train a model from scratch. So a dataset with 10s of
         | thousands of examples and then you need to train it on a GPU.
         | 
         | https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...
        
       | 3cats-in-a-coat wrote:
       | Great company name.
        
         | smcl wrote:
         | I'm gonna try to explain this because I thought the same thing,
         | though you may enjoy it for another reason. Among Czech or
         | other slavic software people - "jina AI" could be like "another
         | AI" and, to me at least, brings to mind the "yet another
         | {thing}" naming convention (yacc = "yet another compiler
         | compiler" for example).
        
       | itronitron wrote:
       | It's weird to think there are entire companies built around
       | providing access to a pre-computed vector space model.
        
       | pietz wrote:
       | I'm always happy to see OSS contributions but I don't quite
       | understand why this model is so remarkable. As the leaderboard
       | suggests it's ranking lower than OpenAI embeddings, while 14
       | other contributions are even better than that. Many of which
       | feature a comparable or lower dimensionality than 768.
       | 
       | The 8k context window is new, but isn't the 512 token limitation
       | a soft limit anyway? I'm pretty sure I can stuff bigger documents
       | into BGE for example.
       | 
       | Furthermore, I think that most (all?) benchmarks in the MTEB
       | leaderboard deal with very small documents. So there is nothing
       | here that validates how well this model does on larger documents.
       | If anything, I'd pick a higher ranking model because I put little
       | trust in one that only ranks 17th on small documents. Should I
       | expect it to magically get better when the documents get larger?
       | 
       | Plus, you can expect that this model was designed to perform well
       | on the datasets in MTEB while the OpenAI model probably wasn't.
       | 
       | Many also stated that a 8k context embeddings will not be very
       | useful in list situations.
       | 
       | When would anyone use this model?
        
         | infecto wrote:
         | I have been trying to understand the hype as well. Happy to see
         | all the work happening in this space still.
         | 
         | I was pretty curious about the context limit. I am not an
         | expert in this area but I always thought the biggest problem
         | was the length of your original text. So typically you might
         | only encode a sentence or a selection of sentences. You could
         | always stuff more in but they you are potentially losing the
         | specificity, I would think that is a function of the
         | dimensionality. This model is 768, are they saying I can stuff
         | 8k tokens worth of text and can utilize it just as well as I
         | have with other models on a per 1-3 sentence level?
        
           | infecto wrote:
           | Thinking about it some more as I read through more comments.
           | I guess in the stated case of research papers it can make
           | sense if your task is looking for the common themes and not
           | specific details. If you are embedding a sentence or a
           | paragraph you miss out on the connection between those
           | sentences across the whole paper...or at least its harder to
           | manage that. By encoding a large number of pages from the
           | paper (or the entire paper) you can hopefully do a better job
           | of capturing the theme of that paper.
           | 
           | This also opens up another question though, how would that
           | compare to using a LLM to summarize that paper and then just
           | embed on top of that summary.
        
             | stormfather wrote:
             | I would guess that the embedded summary is better, but for
             | many tasks where you use embeddings (like document search),
             | summarizing every document with an LLM is too expensive and
             | slow.
        
         | larodi wrote:
         | Potentially useful for paragraph embedding, where... well,
         | paragraphs can grow a lot. Not sure how this model fares in
         | comparison to other embedding engines (yet), but I can
         | definitely tell you mpnet models fare much better for paragraph
         | embeddings than the leader in HF's leaderboard (being
         | thenlper/gte-large at time of writing).
         | 
         | I can guess the Davinci and similar embeddings work better for
         | code than MPNET and it really matters what you are encoding,
         | not only the context length. What features are actually being
         | extracted by the emb.engine.
        
         | theptip wrote:
         | > The 8k context window is new
         | 
         | Hasn't Claude had this for many months (before they bumped to
         | 100k)?
         | 
         | Edit: ah, you mean new for OSS maybe?
        
           | simonw wrote:
           | Claude is a large language model, which is a different thing
           | from an embedding model.
        
             | theptip wrote:
             | Aha, that's what I missed, thanks!
        
             | Der_Einzige wrote:
             | Any large language model generates embedding
             | representations at every layer of the model, and these can
             | be trivially extracted. So, large language models are
             | indeed embedding models.
             | 
             | This leaderboard doesn't compare these custom tailored
             | embedding models vs the obvious thing of average pooling
             | layered with any traditional LLM, which is easily
             | implemented using sentence transformers.
        
         | egorfine wrote:
         | I fail to imagine a 8k-token-length piece of text that has just
         | one single semantic coordinate and is appropriate for embedding
         | and vector search.
         | 
         | In my experience, any text is better embedded using a sliding
         | window of a few dozen words - this is the approximate size of
         | semantic units in a written document in english; although this
         | will wildly differ for different texts and topics.
        
           | simonw wrote:
           | What are you using those embeddings for?
           | 
           | I can see a sliding window working for semantic search and
           | RAG, but not so much for clustering or finding related
           | documents.
        
             | egorfine wrote:
             | Ah yes, clustering is indeed something that would benefit
             | from large context, I agree.
             | 
             | However even so I would think about the documents
             | themselves and figure out if it is even needed. Lets say we
             | are talking about clustering court proceedings. I'd rather
             | extract the abstract from these document, embed and cluster
             | those instead of the whole text.
        
       | nwhnwh wrote:
       | What does this even do?
        
         | tyingq wrote:
         | See this story from yesterday:
         | https://news.ycombinator.com/item?id=37985489
        
       | woofwoofwoof wrote:
       | Just noticed that they (jina.ai) have offices both in Berlin and
       | China. I am wondering how they will they operate with the
       | presence of chip export restrictions and other side effects of
       | USA / China tensions.
        
       | do-me wrote:
       | Just quantized the models for onnx usage in e.g. transformers.js
       | and got 4x reduced file size:
       | 
       | - 28.5 MB jina-embeddings-v2-small-en (https://huggingface.co/do-
       | me/jina-embeddings-v2-small-en)
       | 
       | - 109 MB jina-embeddings-v2-base-en (https://huggingface.co/do-
       | me/jina-embeddings-v2-base-en)
       | 
       | However, I noted, that the base model is performing quite poorly
       | on small text chunks (a few words) while the small version seems
       | to be unaffected. Might this be some kind of side effect due to
       | the way they deal with large contexts?
       | 
       | If you want to test, you can head over to SemanticFinder
       | (https://do-me.github.io/SemanticFinder/), go to advanced
       | settings, choose the Jina AI base model (at the very bottom) and
       | run with "Find". You'll see that all other models perform just
       | fine and find "food"-related chunks but the base version doesn't.
        
         | Havoc wrote:
         | Why quantize something that is already very small (270mb)?
        
       | luke-stanley wrote:
       | When I go to this leaderboard:
       | https://huggingface.co/spaces/mteb/leaderboard I click on the
       | "Classification" tab, then I see "jina-embeddings-v2-base-en" at
       | number 12, with an average score of 73.45. But the highest
       | scoring model there is llmrails/ember-v1 with 75.99 average score
       | but it only supports 512 tokens, so if you need 8K tokens to be
       | embedded, I guess they are the best. Do people need 8K of tokens
       | for embedding? Maybe not but they might need more than 512 often
       | enough. It could save a summary extraction step.
        
         | cztomsik wrote:
         | Small context window means you cannot embed the whole document,
         | you are embedding just a part.
         | 
         | So, if there is some information at the bottom which is
         | dependent on something which is at the top, your embedding
         | could be entirely wrong.
        
       | egorfine wrote:
       | One thing that is missing in comparison: OpenAI's model is
       | multilingual.
       | 
       | And not only it supports and embeds a variety of languages, it
       | also computes the same coordinates for the same semantics in
       | different languages. I.e. if you embed "russia is a terrorist
       | state" and "rossiia - strana-terrorist", both of these embeddings
       | will have almost the same coordinates.
        
         | m3kw9 wrote:
         | I don't really know what that means but it seems useful
        
       ___________________________________________________________________
       (page generated 2023-10-26 23:02 UTC)