[HN Gopher] Jina AI launches open-source 8k text embedding
___________________________________________________________________
Jina AI launches open-source 8k text embedding
Author : artex_xh
Score : 511 points
Date : 2023-10-26 00:24 UTC (22 hours ago)
(HTM) web link (jina.ai)
(TXT) w3m dump (jina.ai)
| burcs wrote:
| This is great news!
|
| It feels like open-source is closing the gap with "Open"AI which
| is really exciting, and the acceleration towards parity is faster
| than more advancements made on the closed source models. Maybe
| it's wishful thinking though?
| udev4096 wrote:
| Is it tho? It's not really open source if they don't give us
| the information regarding training datasets
| jerpint wrote:
| It definitely is open source even if they don't disclose all
| details behind the training
| SOLAR_FIELDS wrote:
| The very definition of what constitutes open source is
| being called into question in these kinds of discussions
| about AI. Without the training details and the weights
| being made fully open it's hard to really call something
| truly open, even if it happens to meet some arbitrary
| definition of "open source".
|
| A good definition of "truly open" is whether the exact same
| results can be reproduced by someone with no extra
| information from only what has been made available. If that
| is not possible, because the reproduction methodology is
| closed (a common reason, like in this case) then what has
| been made available is not truly open.
|
| We can sit here and technically argue whether or not the
| subject matter violated some arbitrary "open source"
| definition but it still doesn't change the fact that it's
| not truly open in spirit
| okaram wrote:
| Notice you are creating your own arbitrary definition of
| 'truly open', which IMHO corresponds more with
| 'reproducible'.
|
| We already have a definition of open source. I don't see
| any reason to change it.
| losteric wrote:
| The inference runtime software is open, the weights are
| an opaque binary. Publishing the training data,
| hyperparameters, process, etc - that would make the whole
| thing "open source".
| magicalhippo wrote:
| The quake engine is still open source even though it
| doesn't come with the quake game assets, no?
|
| It seems unreasonable to require the training data just
| to be called open source, given it has similar copyright
| challenges as game assets.
|
| Of course, this wouldn't make the model reproducible. But
| that's different from open source.
| darkwater wrote:
| Good example. And in fact you are calling the "engine"
| opensource, not the whole Quake game. The 'assets" in
| most "opensource" AI models are not available.
| EGreg wrote:
| Imagine if the Telegram client was open source but not
| the backend.
|
| Imagine if Facebook open-sourced their front-end
| libraries like React but not the back-end.
|
| Imagine if Twitter or Google didn't publish its Algorithm
| for how they rank things to display to different people.
|
| You don't need to imagine. That's exactly what's
| happening! Would you call them open source because their
| front end is open source? Could you host your own back
| end on your choice of computers?
|
| No. That's why I even started https://qbix.com/platform
| darkwater wrote:
| I completely agree with you (and the example you mention
| are singled out in the "antifeatures" list in F-Droid, to
| name an example)
| torginus wrote:
| It's a bit different - here most of the value lies in the
| weights.
|
| A better analogy would be some graphics card drivers
| which ship a massive proprietary GPU firmware blob, and a
| small(ish) kernel shim to talk with said blob.
| magicalhippo wrote:
| Well perhaps we can consider this a kind of short-
| sightedness of Stallmann. His point with GPL and the free
| software movement, as I understand it, was to ensure the
| user could continue to use the software regardless of
| what the software author decided to do.
|
| Sometimes though the software alone can be near useless
| without additional assets that aren't necessarily covered
| by the code license.
|
| Like Quake, having the engine without the assets is
| useless if what you wanted was to play Quake the game.
| Neural nets are another prime example, as you mention.
| Simulators that rely on measured material property
| databases for usable results also fall into this
| category, and so on.
|
| So perhaps what we need is new open source licenses that
| includes the assets needed for the user to be able to
| reasonably use the program as a whole.
| TeMPOraL wrote:
| Problem is, the literal/default definition of "open
| source" is meaningless/worthless in this context. It's
| the weights, training data and methodology that matter
| for those models - NOT the inference shell.
|
| It's basically like giving people a binary program and
| calling it open source because the compiler and runtime
| used are open source.
| jerpint wrote:
| The weights are the inference and result of training. I
| can give you all the training details and you might not
| be able to reproduce what I did (google does this all the
| time). As a dev, I'd much rather an open model over an
| open recipe without weights. We can all agree having both
| is the best case scenario but having openly licensed
| weights is for me the bare minimum of open source
| ekianjo wrote:
| Weights are like binaries. They are not code. It would
| make more sense to put it under a creative commons
| license
| rolisz wrote:
| Then a lot of stuff is not open source. Have you tried
| reproducing random GitHub repos, especially in machine
| learning?
| richardw wrote:
| So if someone includes images in their project they need
| to tell you every brush stroke that led to the final
| image?
|
| All sorts of intangibles end up in open source projects.
| This isn't a science experiment that needs replication.
| They're not trying to prove how they came up with the
| image/code/model.
| xnorswap wrote:
| Those "Brush Strokes" are effectively the source code. To
| be considered open source, yes source code needs to be
| provided along side the binaries (the "image").
| EGreg wrote:
| It's more like someone giving you an open source front
| end client, but not giving you a way to host your own
| backend.
|
| Look into Affero GPL. Images are inert static assets.
| Here we are talking about _the back end engine_. The fact
| that neural networks and model weights are non-von-
| neumann architecture doesn't negate the fact that they
| are _executable code_ and not just static assets!
| abriosi wrote:
| Imagine someone giving you a executable binary without
| the source code and calling it "open source"
| jyrkesh wrote:
| I'm actually mostly in your camp here. But it's
| complicated with AI.
|
| What if someone gave you a binary and the source code,
| but not a compiler? Maybe not even a language spec?
|
| Or what if they gave you a binary and the source code and
| a fully documented language spec, and both of 'em all the
| way down to the compiler? BUT it only runs on special
| proprietary silicon? Or maybe even the silicon is fully
| documented, but producing that silicon is effectively out
| of reach to all but F100 companies?
|
| It's turtles all the way down...
| krageon wrote:
| There is the binary (the model) and the source (the thing
| that allows you to recreate the model, the dataset and
| methodology). Compilers and how art is made quite simply
| doesn't factor in here, because nobody is talking about
| the compiler layer. Art isn't even close to what is
| present. Trying to make this more complicated than it is
| is playing into companies' hands by troubling the waters
| around what constitutes open source.
| r3trohack3r wrote:
| To be fair, OpenSource troubled the waters around what
| constitutes free software.
|
| Free(dom Respecting) Software wasn't just about the
| source code.
|
| https://www.gnu.org/philosophy/open-source-misses-the-
| point....
| DougBTX wrote:
| You can pass in any command line arguments you like, so
| it must be open source
| otikik wrote:
| Well the other day on this very website there were some
| very opinionated voices stating that Open Source is
| "exclusively what OSI defines". I am not on that camp,
| more like in yours. To me there's open source and OSI-
| approved open source. But you will encounter people very
| set on that other opinion, which I found interesting.
|
| Make no mistake, I am super grateful to OSI for their
| efforts and most of my code out there uses one of their
| licenses. I just think they are limited by the
| circumstances. Some things I consider open are not
| conforming to their licenses and, like here, some things
| that conform might not be _really open_.
| m3at wrote:
| To take an other example, would you call a game that has
| its code and all assets (ex. character sprites) freely
| available open source? Or would the process that was used
| to create the assets in the first place also be required
| to be considered open?
|
| The parallel can be made with model weights being static
| assets delivered in their completed state.
|
| (I favor the full process being released especially for
| scientific reproducibility, but this is an other point)
| pjc50 wrote:
| The old Stallman definition used the phrase "preferred
| form for modification" rather than the more specific
| "source code". What do you need to effectively modify an
| AI model?
| kordlessagain wrote:
| Usually the datasets, not the source code.
| selcuka wrote:
| How do you define "source", then?
|
| By this logic any freely downloadable executable software
| (a.k.a. freeware) is also open source, even though they
| don't disclose all details on how to build it.
| mogwire wrote:
| Source would be the way the data is produced so that you
| can replicate it yourself and make changes.
|
| If I hand you a beer for free that's freeware. If I hand
| you the recipe and instructions to brew the beer that is
| open source.
|
| We muddy the waters too much lately and call "free" to
| use things "open source".
| TeMPOraL wrote:
| > _If I hand you a beer for free that's freeware. If I
| hand you the recipe and instructions to brew the beer
| that is open source._
|
| Yeah, but what those "open source" models are is like you
| handing me a bottle of beer, plus the instructions to
| _make the glass bottle_. You 're open-sourcing something,
| just not the part that matters. It's not "open source
| beer", it's "beer in an open-source bottle". In the same
| fashion, those models aren't open source - they're closed
| models inside a tiny open-source inference script.
| imranhou wrote:
| Perhaps one more thing that is missing in context is that
| I'm also getting the right to alter that beer by adding
| anything I like to it and redistributing it, without
| knowing its true recipe.
| szundi wrote:
| Interesting as the literal source of the result is not open
| EGreg wrote:
| People need to realize something...
|
| The model weights in eg TensorFlow _are the source code_.
|
| It is not a von-Neumann architecture but a gigabyte of
| model weights is the executable part, no less than a
| gigabyte of imperative code.
|
| Now, the training of the model is akin to the process of
| writing the code. In classical imperative languages that
| code may be such spaghetti code that each part would be
| intertwined with 40 others, so you can't _just_ modify
| something easily.
|
| So the fact that you can't modify the code is Freedom 2
| or whatever. But at least you have Freedom 0 of hosting
| the model where _You_ want and not getting charged for it
| an exorbitant amount or getting cut off, or having the
| model change out from under you via RLHF for political
| correctnesss or whatever.
|
| OpenAI has not even met Freedom Zero of FSR or OSI's
| definition. But others can.
| simonw wrote:
| That doesn't work for me.
|
| The model weights aren't source code. They are the binary
| result of compiling that source code.
|
| The source code is the combination of the training data
| and configuration of model architecture that runs against
| it.
|
| The model architecture could be considered the compiler.
|
| If you give me gcc and your C code I can compile the
| binary myself.
|
| If you give me your training data and code that
| implements your model architecture, I can run those to
| compile the model weights myself.
| EGreg wrote:
| No, you would need to spend "eye watering amounts of
| compute" to do it, similar to hiring a lot of developers
| to produce the code. The compiling of the code to an
| executable format is a tiny part of that cost.
| simonw wrote:
| I still think of millions of dollars of GPU spend
| crunching away for a month as a compiler.
|
| A very slow, very expensive compiler - but it's still
| taking the source code (the training material and model
| architecture) and compiling that into a binary executable
| (the model).
|
| Maybe it helps to think about this at a much smaller
| scale. There are plenty of interesting machine learning
| models which can be trained on a laptop in a few seconds
| (or a few minutes). That process feels very much like a
| compiler - takes less time to compile than a lot of large
| C++ projects.
|
| Running on a GPU cluster for a month is the exact same
| process, just scaled up.
|
| Huge projects like Microsoft Windows take hours to
| compile and that process often runs on expensive
| clusters, but it's still considered compilation.
| EGreg wrote:
| Actually, the dirty secret is that a lot of human work
| (at below minimum wage) went into training and refining
| the AI models:
|
| https://time.com/6247678/openai-chatgpt-kenya-workers/
|
| And billion-dollar companies made their money off it:
|
| https://www.forbes.com/sites/kenrickcai/2023/04/11/how-
| alexa...
|
| That's the dirty secret of why ChatGPT 4 is better. But
| they'll tell you it has to do with chaining ChatGPT 3's
| together, more fine tuning etc. They go to these poor
| countries and recruit people to work on training the AI.
|
| Not to mention all the uncompensated work of humans
| around the world who put their content up on the Web.
| Gasp0de wrote:
| They compare it to OpenAI's ada model though, which is light-
| years away from ChatGPT.
| infecto wrote:
| Does that not conflate two different things though? Embedding
| model != LLM Model ?
| simonw wrote:
| Don't confuse the current Ada embedding model the old Ada
| GPT3 model.
|
| It turns out OpenAI have used the name "Ada" for several very
| different things, purely because they went through a phase of
| giving everything Ada/Babbage/Curie/DaVinci names because
| they liked the A/B/C/D thing to indicate which of their
| models were largest.
| infecto wrote:
| Wishful thinking? Embeddings to me were never the interesting
| or bleeding edge thing at OpenAI. Maybe the various ada models
| at one point reigned supreme but there have been open-source
| models at the top of the leaderboard for a while and from a
| cost/performance perspective, often even the Bert models did a
| really fine job.
| omneity wrote:
| Impressive work.
|
| I wonder what would be the best way to use 8k embeddings. It's a
| lot of information to keep in a vector, so things like
| "precision" of the embedding space and its ability to distinguish
| very similar large documents will be key.
|
| Maybe it can be useful for coarse similarity matching, for
| example to detect plagiarism?
| sroussey wrote:
| 8K is the context length. Their vector dimension size is actual
| much smaller, which is great for a number of use cases, though
| maybe not the ones you are thinking about.
| omneity wrote:
| Yes that's also how I understood it. Maybe it was ambiguously
| expressed, but I mean "8k tokens as input is a lot of
| information to encode"
| andrewstuart wrote:
| Anyone got links to examples of text embedding?
| BoorishBears wrote:
| Easiest example is taking three words: Universe, University,
| College.
|
| - University and Universe are similar alphabetically.
|
| - University and College are similar in meaning.
|
| Take embeddings for those three words and `University` will be
| near `College`, while `Universe` will be further away, because
| embeddings capture meaning:
|
| University<-->College<-------------->Universe
|
| _
|
| With old school search you'd need to handle the special case of
| treating University and College as similar, but embeddings
| already handle it.
|
| With embeddings you can do math to find how similar two results
| are, based on how close their vectors are. The closer the
| embeddings, the closer the meaning.
| osigurdson wrote:
| Another interesting point is that math can be performed on
| embedding vectors: emb("king") - emb("man") + emb("woman") =
| emb("queen").
| minimaxir wrote:
| That's a property of Word2Vec _specifically_ due to how it
| 's trained (a shallow network where most of the "logic"
| would be contained within the embeddings themselves). Using
| it for embeddings generated from LLMs or Embedding layers
| will not give as fun results; in practice the only thing
| you can do is average or cluster them.
| TeMPOraL wrote:
| > _That 's a property of Word2Vec specifically due to how
| it's trained (a shallow network where most of the "logic"
| would be contained within the embeddings themselves)._
|
| Is it though? I thought the LLM-based embeddings are
| _even more fun_ for this, as you have many more
| interesting directions to move in. I.e. not just:
|
| emb("king") - emb("man") + emb("woman") = emb("queen")
|
| But also e.g.:
|
| emb(<insert a couple paragraph long positive book
| review>) + a _v(sad) + b_ v(short) - c*v(positive) =
| emb(<a single paragraph, negative and depressing review>)
|
| Where a, b, c are some constants to tweak, and v(X) is a
| vector for quality X, which you can get by embedding a
| bunch of texts expressing the quality X and averaging
| them out (or doing some other dimensional reduction
| trickery).
|
| I've suggested this on HN some time ago, but only been
| told that I'm confused and the idea is not even wrong.
| But then, there was this talk on some AI conference
| recently[0], where the speaker demonstrated exactly this
| kind of latent space translations of text in a language
| model.
|
| --
|
| [0] -
| https://www.youtube.com/watch?v=veShHxQYPzo&t=13980s -
| "The Hidden Life of Embeddings", by Linus Lee from
| Notion.
| simonw wrote:
| That talk used a novel embeddings model trained by the
| speaker which does exhibit this kind of property - but
| that was a new (extremely cool) thing, not something that
| other embeddings models can do.
| RossBencina wrote:
| OpenAI have a brief explainer with a bunch of example use cases
| here:
|
| https://platform.openai.com/docs/guides/embeddings/what-are-...
| Nitrolo wrote:
| Is there something like oobabooga to easily run this in a click-
| and-run way? Where I can load up a model, a text, and ask it
| questions?
| brucethemoose2 wrote:
| iirc ooba has its own integrated vectordb called superbooga.
|
| I bet you could hack this in.
| simonw wrote:
| See my comment here:
| https://news.ycombinator.com/item?id=38020655 for a CLI tool
| that lets you do this.
|
| Note that embedding models are a different kind of thing from a
| Large Language Model, so it's not the kind of model you can ask
| questions.
|
| It's a model which can take text and turn it into an array of
| floating point numbers, which you can then use to implement
| things like semantic search and related documents.
|
| More on that here:
| https://simonwillison.net/2023/Oct/23/embeddings/
| minimaxir wrote:
| The Hugging Face page for the model has a two line load-and-
| encode Python code demo: https://huggingface.co/jinaai/jina-
| embeddings-v2-base-en
| sroussey wrote:
| Does anyone know what they are using for this comparison and
| ranking? And where does instruct-xl stand in the mix?
| sroussey wrote:
| Oh duh, it's right in the post and instructor-xl is number 9.
| And so many new participants now!
| sroussey wrote:
| The ranking are here:
|
| https://huggingface.co/spaces/mteb/leaderboard
|
| It's amazing how many new and better ones there are since I
| last looked a few months ago. Instructor-xl was number 1, now
| it is number 9, and its size is more than 10x the number 2
| ranked!
|
| Things move fast!
| RossBencina wrote:
| Some relevant stats from the link:
|
| 8192 token input sequence length
|
| 768 embedding dimensions
|
| 0.27GB model (with 0.07GB model also available)
|
| Tokeniser: BertTokenizer [1], 30528 token vocab [2]
|
| Is an 8K sequence length directly comparable to text-embedding-
| ada-002 if the vocabulary is much smaller? I seem to remember its
| tokeniser has a larger vocabulary.
|
| [1] https://huggingface.co/jinaai/jina-embeddings-v2-base-
| en/blo...
|
| [2] https://huggingface.co/jinaai/jina-embeddings-v2-base-
| en/blo...
| LoganDark wrote:
| > Is an 8K sequence length directly comparable to text-
| embedding-ada-002 if the vocabulary is much smaller? I seem to
| remember its tokeniser has a larger vocabulary.
|
| Words that aren't in the vocabulary can still be represented by
| multiple tokens. Some models can input and output valid UTF-8
| at the byte level (rather than needing a unique token for each
| codepoint). For example RWKV-World.
| space_fountain wrote:
| A large vocabulary means less tokens are needed to represent
| the same information
| HPMOR wrote:
| *fewer
|
| Less is used for qualitative data like "I love him less".
| Whereas fewer is used for countable things like "I need
| fewer tokens."
| scubbo wrote:
| Username checks out.
| LoganDark wrote:
| Thanks.
| DavidSJ wrote:
| A uniform distribution over 30528 tokens is just under 15 bits
| of information per token, whereas a vocabulary size of ~60000
| would be just under 16 bits per token. In practice it's not
| uniform, but this shows that they're in the same ballpark.
| rajin112 wrote:
| Thanks what size gpu would you need to fine tune or do an
| inference?
| jncraton wrote:
| This is great to see. It looks like the size of the embedding
| vector is half the size of text-embedding-ada-002 (768 vs 1536)
| while providing competitive performance. This will save space in
| databases and make lookups somewhat faster.
|
| For those unaware, if 512 tokens of context is sufficient for
| your use case, there are already many options that outperform
| text-embedding-ada-002 on common benchmarks:
|
| https://huggingface.co/spaces/mteb/leaderboard
| minimaxir wrote:
| The 768D-sized embeddings compared to OpenAI's 1536D embeddings
| are actually a feature outside of index size.
|
| In my experience, OpenAI's embeddings are overspecified and do
| very poorly with cosine similarity out of the box as they match
| syntax more than semantic meaning (which is important as that's
| the metric for RAG). Ideally you'd want cosine similarity in
| the range of [-1, 1] on a variety of data but in my experience
| the results are [0.6, 0.8].
| karxxm wrote:
| You wrote ,,out of the box", did you find a way to improve
| this?
| teaearlgraycold wrote:
| You can do PCA or some other dimensionality reduction
| technique. That'll reduce computation and improve
| signal/noise ratio when comparing vectors.
| karxxm wrote:
| Unfortunately this is not feasible with a large amount of
| words due to the quadratic scaling. But thanks for the
| response!
| minimaxir wrote:
| Not sure what you mean by large amount of words. You can
| fit a PCA on millions of vectors relatively performantly,
| then inference from it is just a matmul.
| TeMPOraL wrote:
| Unless I'm missing something, it should be possible to map
| out in advance which dimensions represent syntactic aspects,
| and then downweigh or remove them for similarity comparisons.
| And that map should be a function of the model alone, i.e.
| fully reusable. Are there any efforts to map out the latent
| space of ada models like that?
| e1g wrote:
| Their OpenAI benchmark is GPT3 (text-embedding-ada-002), not
| GPT4.
| simonw wrote:
| "text-embedding-ada-002" isn't GPT3, it's a different kind of
| model. Embedding models and Large Language Models aren't the
| same thing.
| e1g wrote:
| LLMs and embedding models are certainly different, but it's a
| useful benchmark to calibrate expectations. OpenAI released
| text-embedding-ada-002 a year ago, and they describe the ada
| model as[1] "the original GPT-3 base model [...] capable of
| very simple tasks, usually the fastest model in the GPT-3
| series".
|
| It's fair to expect GPT3-level results - not GPT 3.5 and
| certainly not open-source tiny GPT4 as some might think when
| they read "rivaling OpenAI".
|
| [1] https://platform.openai.com/docs/models/whisper
| minimaxir wrote:
| When people talked about GPT-3 they always referred to
| davinci which is the largest model, not ada.
| simonw wrote:
| No, you're confusing two things here.
|
| "text-ada-001" is LLM in the GPT3 family, described as
| "Capable of very simple tasks, usually the fastest model in
| the GPT-3 series, and lowest cost"
|
| "text-embedding-ada-002" is entirely different - that page
| describes it as "Our second generation embedding model,
| text-embedding-ada-002 is a designed to replace the
| previous 16 first-generation embedding models at a fraction
| of the cost."
| minimaxir wrote:
| tl;dr OpenAI is bad at product naming.
| e1g wrote:
| OpenAI doesn't say directly what text-embedding-ada-002
| is, but in the release blog post they show that
| performance is comparable to davinci/curie, which places
| it firmly in the universe of GPT3. I understand it's not
| a straight line comparison, but to me it's still a useful
| mental heuristic about what to expect.
|
| [1] https://openai.com/blog/new-and-improved-embedding-
| model (see "Model improvements")
| helloplanets wrote:
| Reading through that article, the specific Davinci/Curie
| models they seem to be referring to are called the
| following: 'text-search-davinci-001', 'text-search-
| curie-001', 'text-similarity-davinci-001' and 'text-
| similarity-curie-001'.
|
| Are you sure these have anything to do with 'text-
| davinci-003' or 'text-curie-001'?
|
| Will have to agree with everyone here that OpenAI is good
| at being extremely confusing. It seems like the logic
| might be something along the lines of the 'text-search'
| portion being the actual type of the model, while the
| 'curie-001' / '<name>-<number>' format is just a
| personalized way of expressing the version of that type
| of model. And the whole 'GPT<number>' category used to be
| a sort family of models, but now they've just switched it
| to the actual name of the newer gargantuan LLMs. Then,
| because the 'GPT<number>' models are now that different
| thing altogether these days, the newest 'text-embedding'
| model is just named 'ada-<number>' because it's on that
| iteration of the 'text-embedding' type of model, adhering
| to the older principle of naming their models? Not sure,
| ha. Definitely feels like doing some detective work.
| simonw wrote:
| You mean this table here? text-
| embedding-ada-002 53.3 text-search-
| davinci-*-001 52.8 text-search-curie-*-001
| 50.9 text-search-babbage-*-001 50.4 text-
| search-ada-*-001 49.0
|
| That's not comparing it to the davinci/curie/babbage GPT3
| models, it's comparing to the "search-text-*" family.
|
| Those were introduced in
| https://openai.com/blog/introducing-text-and-code-
| embeddings as the first public release of embeddings
| models from OpenAI.
|
| > We're releasing three families of embedding models,
| each tuned to perform well on different functionalities:
| text similarity, text search, and code search. The models
| take either text or code as input and return an embedding
| vector.
|
| It's not at all clear to me if there's any relationship
| between those and the GPT3 davinci/curie/babbage/ada
| models.
|
| My guess is that OpenAI's naming convention back then was
| "davinci is the best one, then curie, then babbage, then
| ada".
| e1g wrote:
| How interesting. I assumed that a consistent codename
| such as Ada/Davinci refers to the lineage/DNA of the
| OpenAI model from which a distinct product was created.
| But I can see how these codenames could be "just" a
| revision label of A/B/C/D (Ada/Babbage/Curie/Davinci),
| similar to "Pro/Max/Ultra". If true, a product named "M2
| Ultra" could have nothing to do with another product
| called "Watch Ultra".
| simonw wrote:
| Wow I genuinely hadn't noticed the A/B/C/D thing!
| tayo42 wrote:
| You can't fine tune without using their library tied to their
| cloud? Did I misunderstand? Do you need fine tune?
| simonw wrote:
| I just shipped a new llm-embed-jina plugin for my LLM tool which
| provides access to these new Jina models:
| https://github.com/simonw/llm-embed-jina
|
| Here's how to try it out.
|
| First, install LLM. Use pip or pipx or brew:
| brew install llm
|
| Next install the new plugin: llm install llm-
| embed-jina
|
| You can confirm the new models are now available to LLM by
| running: llm embed-models
|
| You should see a list that includes "jina-embeddings-v2-small-en"
| and "jina-embeddings-v2-base-en"
|
| To embed a string using the small model, run this:
| llm embed -m jina-embeddings-v2-small-en -c 'Hello world'
|
| That will output a JSON array of 512 floating point numbers (see
| my explainer here for what those are:
| https://simonwillison.net/2023/Oct/23/embeddings/#what-are-e...)
|
| Embeddings are only really interesting if you store them and use
| them for comparisons.
|
| Here's how to use the "llm embed-multi" command to create
| embeddings for the 30 most recent issues in my LLM GitHub
| repository: curl 'https://api.github.com/repos/
| simonw/llm/issues?state=all&filter=all' \ | jq '[.[] |
| {id: .id, title: .title}]' \ | llm embed-multi -m jina-
| embeddings-v2-small-en jina-llm-issues - \ --store
|
| This creates a collection called "jina-llm-issues" in a default
| SQLite database on your machine (the path to that can be found
| using "llm collections path").
|
| To search for issues in that collection with titles most similar
| to the term "bug": llm similar jina-llm-issues
| -c 'bug'
|
| Or for issues most similar to another existing issue by ID:
| llm similar jina-llm-issues 1922688957
|
| Full documentation on what you can do with LLM and embeddings
| here: https://llm.datasette.io/en/stable/embeddings/index.html
|
| Alternative recipe - this creates embeddings for every single
| README.md in the current directory and its subdirectories. Run
| this somewhere with a node_modules folder and you should get a
| whole lot of interesting stuff: llm embed-multi
| jina-readmes \ -m jina-embeddings-v2-small-en \
| --files . '**/README.md' --store
|
| Then search them like this: llm similar jina-
| readmes -c 'backup tools'
| X6S1x6Okd1st wrote:
| Thank you so much for all the work you've put into llm!
| dazzaji wrote:
| Excellent! And you were just saying how risky it is to rely
| long-term on OpenAI text embeddings in your post on the topic.
| The timing for this open source option worked out nicely.
| bosky101 wrote:
| The only feedback I had from your embedding post was
| wish we could create the array of floating points without
| openai
|
| Great timely turnaround time, good sir. Ht
| simonw wrote:
| Wrote this up on my blog:
| https://simonwillison.net/2023/Oct/26/llm-embed-jina/
| mike_ivanov wrote:
| JFYI, this is what happens on my M1 Macbook:
|
| $ brew install llm $ llm ModuleNotFoundError: No module named
| 'typing_extensions'
|
| Not sure where to report it.
| simonw wrote:
| Whoa, that is a weird one. Do you know what version of Python
| you have from Homebrew?
|
| It looks like that package is correctly listed in the
| formula: https://github.com/Homebrew/homebrew-
| core/blob/a0048881ba9a2...
| mike_ivanov wrote:
| % python3 --version Python 3.11.6
| % which python3 /opt/homebrew/bin/python3
| % brew info python-typing-extensions ==> python-
| typing-extensions: stable 4.8.0 (bottled)
| IanCal wrote:
| Probably not this, but check with `which llm` what that's
| running. I had weird issues not matching the documentation
| but just had some _other_ random python cli tool called llm I
| 'd put in my home bin for and forgotten about it.
| mike_ivanov wrote:
| % which llm /opt/homebrew/bin/llm
| jillesvangurp wrote:
| Thanks, this is wonderfully simple to use. Just managed to
| package this up using docker and was able to use it without a
| lot of drama. Nice how simple this is to use.
|
| I've dabbled a bit with elasticsearch dense vectors before and
| this model should work great for that. Basically, I just need
| to feed it a lot of content and add the vectors and vector
| search should work great.
| michalmatczuk wrote:
| FYI it seems that llm install llm-embed-jina is missing yaml
| dependency File
| "/opt/homebrew/Cellar/llm/0.11_1/libexec/lib/python3.12/site-
| packages/llm/default_plugins/openai_models.py", line 17, in
| <module> import yaml
|
| ModuleNotFoundError: No module named 'yaml'
| simonw wrote:
| Thanks! I wonder if the Python 3.12 upgrade broke something.
|
| The pyyaml package is correctly listed on the formula page
| though: https://formulae.brew.sh/formula/llm
| neximo64 wrote:
| Does it match OpenAI on number of params?
| minimaxir wrote:
| No one knows since OpenAI has not disclosed the number of
| paramerers their embeddings model uses.
| andy99 wrote:
| What is the use case for an 8k token embedding? My (somewhat
| limited) experience with long context models is they aren't great
| for RAG. I get the impression they are optimized for something
| else, like writing 8k+ tokens rather than synthesizing responses.
|
| Isn't the normal way of using embedding to find relevant text
| snippets for a RAG prompt? Where is it better to have coarser
| retrieval?
| kristopolous wrote:
| Is this what you mean by RAG?
| https://www.promptingguide.ai/techniques/rag?
| teaearlgraycold wrote:
| Yes
| simonw wrote:
| I have an explanation of RAG in the context of embeddings
| here: https://simonwillison.net/2023/Oct/23/embeddings/#answe
| ring-...
| Grimburger wrote:
| You could just sum it up for us all rather than do a divert
| to your blog?
|
| It's Retrieval Augmented Generation btw.
|
| To quote:
|
| > The key idea is this: a user asks a question. You search
| your private documents for content that appears relevant to
| the question, then paste excerpts of that content into the
| LLM (respecting its size limit, usually between 3,000 and
| 6,000 words) along with the original question.
|
| > The LLM can then answer the question based on the
| additional content you provided.
| simonw wrote:
| > You could just sum it up for us all rather than do a
| divert to your blog?
|
| Why? Have links gone out of fashion?
|
| I even linked directly to the relevant section rather
| than linking to the top of the page.
|
| The paper that coined the term used the hyphen, though I
| think I prefer it without:
| https://arxiv.org/abs/2005.11401
| Grimburger wrote:
| > Have links gone out of fashion?
|
| Yes.
|
| You wrote far more words than needed to answer the
| comment, I did it for you instead.
| simonw wrote:
| One of the reasons I write so much stuff is so I can
| provide links to things I've written to answer relevant
| questions.
| scubbo wrote:
| And those of us with the sense to value your insight, and
| the attention-span to read more than tweet-sized content,
| thank you for it.
| mhog_hn wrote:
| Thank you, nice blog.
| discordance wrote:
| Thanks so much for your writings and for posting the link
| (and also for Datasette!). I've learned in the past few
| months from your blog.
| monkeydust wrote:
| Appreciate it. Your posts in general have been great -
| accessible to a large audience, quality links to follow
| up research and catchy analogies even when they don't
| fully hold true (llm as a calculator for words - which I
| admit I use with citation!). Keep going.
| gar1t wrote:
| I liked your link a lot.
| hboon wrote:
| Just to add that, we appreciate that very much.
| gkbrk wrote:
| "Links have gone out of fashion" is an odd thing to write
| on a Link Aggregator website.
| kristopolous wrote:
| You know you're responding to a programmer famous enough
| to have a Wikipedia page, right?
|
| https://en.m.wikipedia.org/wiki/Simon_Willison
| teaearlgraycold wrote:
| You could get a facsimile to a summary for a full article or
| short story. Reducing an 8k token article to a summary using a
| completions model would cost far more. So if you need to search
| through collections of contracts, scientific papers, movie
| scripts, etc. for recommendations/clustering then bigger input
| sizes can do that in one shot.
|
| Think of it like skipping the square root step in Euclidean
| distance. Perfectly valid as long as you don't want a distance
| so much as a way to compare distances. And doing so skips the
| most computationally expensive operation.
| refulgentis wrote:
| I think I'm missing something: like, yeah, it's vector search
| for bigger text chunks. But arguably vector search with
| bigger text chunks is _definitively_ worse -- this isn't
| doing summarization, just turning about 25 pages of text to
| 1024 floats, which you then can use cosine similarity to
| measure the semantic similarity to other text
|
| I'd much rather know what paragraph to look in than what 25
| pages to look in
| simonw wrote:
| I imagine it's more useful for finding related articles and
| clustering things than for semantic search, which will work
| much better against smaller chunks - especially if you're
| implementing Retrieval Augmented Generation.
| rolisz wrote:
| I think the point is: if you compress 25 pages of text
| into 1024 floats, you will lose a ton of information,
| regardless of what the use case is, so you're probably
| still better of with chunking.
| simonw wrote:
| I've been getting great results for related documents by
| embedding entire blog posts, e.g. here:
| https://til.simonwillison.net/gis/pmtiles#related
|
| I'm not sure how I would do that after chunking.
| thomasahle wrote:
| Did you compare with simple baselines like bag-of-words
| and word vectors?
| simonw wrote:
| My previous implementation used TF-IDF - I basically took
| all the words in the post and turned them into a giant
| "word OR word OR word OR word" search query and piped
| that through SQLite full-text search.
| https://til.simonwillison.net/sqlite/related-content
|
| I jumped straight from that to OpenAI embeddings. The
| results were good enough that I didn't spend time
| investigating other approaches.
| rolisz wrote:
| That's not quite tfidf though. I agree you can get better
| results than that with Ada embeddings, but I would argue
| you can get even better results with embeddings from
| smaller chunks.
| simonw wrote:
| I guess technically it's bm25, since it's using the rank
| mechanism in SQLite FTS5: https://www.sqlite.org/fts5.htm
| l#sorting_by_auxiliary_functi...
| thomasahle wrote:
| > Into a giant "word OR word OR word OR word"
|
| Does that mean you'd return other docs if they share just
| one word?
|
| The idea of tfidf is that it gives you a vector (maybe
| combined with pca or a random dimensionality reduction)
| that you can use just like an Ada embedding. But you
| still need vector search.
| simonw wrote:
| My goal for related articles was to first filter to every
| document that shared at least one word with the target -
| which is probably EVERY document in the set - but then
| rank them based on which ones share the MOST words,
| scoring words that are rare in the corpus more highly.
| BM25 does that for free.
|
| Then I take the top ten by score and call those the
| "related articles".
| teaearlgraycold wrote:
| Ever read the back of a book?
| TeMPOraL wrote:
| You mean the marketing blurb? Those tend to carry low
| information value, sometimes even _negative_ - as in, if
| you didn 't know anything else about the book, reading
| the blurb will make you _even more wrong_ about it than
| you were. This is a common feature of marketing copy.
| TeMPOraL wrote:
| > _if you compress 25 pages of text into 1024 floats, you
| will lose a ton of information_
|
| Sure, but then if you do it one page at a time, or one
| paragraph at a time, you lose ton of _meaning_ - after
| all, individual paragraphs aren 't independent of each
| other. And meaning is kind of the whole point of the
| exercise.
|
| Or put another way, squashing a ton of text loses you
| some high-frequency information, while chunking cuts off
| the low-frequency parts. Ideally you'd want to retain
| both.
| kordlessagain wrote:
| I think that the assumption that you lose a ton of
| meaning (of low frequency) in doing separate chunks is
| probably less likely to be true over doing the whole
| document at once (losing high frequency meaning). As you
| say, doing both is probably a good strategy, and I think
| that's why we see a lot of "summarize this text"
| approaches.
|
| I use a multi-pronged approach to this based on a special
| type of summarization. I chunk on sentences using
| punctuation until they are just over 512 characters, then
| I embed them. After embedding, I ask a foundation model
| to summarize (or ask a question about the chunk) and then
| generate keyterms for it. Those keyterms are stored along
| with the vector in the database. During search, I use the
| user's input to do a vector search for matching chunks,
| then pull their keyterms in. Using those keyterms, I do
| set operations to find related chunks. I then run a
| vector search against these to the top matches from the
| vector search to assemble new prompt text.
|
| This strategy is based on the idea of a "back of the book
| index". It is entirely plausible to look for "outliers"
| in the keyterms and consider throwing those chunks with
| those keyterms in there to see if it nets us
| understanding of some "hidden" meaning in the document.
|
| There is also a means to continue doing the "keyterm"
| extraction trick as the system is used. Keyterms from
| answer as well as user prompts may be added to the
| existing index over time, thus helping improve the
| ability to return low frequency information that may be
| initially hidden.
| imranhou wrote:
| Good point, I wonder how different it is to use a large
| context here vs having some other model summarize an 8k
| article into a small paragraph and using embedding from
| the paragraph instead where such a large context wouldn't
| be necessary.
| antman wrote:
| you could do both
| scotty79 wrote:
| Isn't it up to 8k? So you can index your documents by
| paragraphs if you prefer?
| dragonwriter wrote:
| > What is the use case for an 8k token embedding?
|
| Calculating embeddings on larger documents than smaller-window
| embedding models.
|
| > My (somewhat limited) experience with long context models is
| they aren't great for RAG.
|
| The only reason they wouldn't be great for RAG is that they
| aren't great at using information in their context window,
| which is possible (ISTR that some models have a strong recency
| bias within the window, for instance) but I don't think is a
| general problem of long context models.
|
| > Isn't the normal way of using embedding to find relevant text
| snippets for a RAG prompt?
|
| I would say the usual use is for search and semantic similarity
| comparisons generally. RAG is itself an application of search,
| but its not the only one.
| 3abiton wrote:
| I wonder how the perfomance fair when context size is
| increased. Intuitively this should be higher, but some
| quantized models I've tested showed noticeably worst
| performance.
| Kubuxu wrote:
| Your KV cache size is linear with context size which might
| put you tight on memory. There is also increased cost of
| recalculating KV cache of context window when the window
| has to move but this is close to being solved with
| streaming LLMs.
| woadwarrior01 wrote:
| BERT style encoder-only models, like the embedding model
| being discussed here, don't need a KV cache for
| inference. A KV cache is only needed for efficient
| inference with encoder-decoder and decoder-only (aka GPT)
| models.
| moralestapia wrote:
| Ada is one of the (if not the) worst model offered by OpenAI,
| though ...
| simonw wrote:
| You're thinking of the old "ada" GPT-3 model - the one that was
| a companion to "davinci" and "babbage".
|
| I believe "text-embedding-ada-002" is entirely unrelated to
| those old GPT-3 models. It's a recent embedding model (released
| in December 2022 - https://openai.com/blog/new-and-improved-
| embedding-model ) which OpenAI claim is their best current best
| available embedding model.
|
| I understand your confusion: OpenAI are notoriously bad at
| naming things!
| moralestapia wrote:
| Oh, thanks for clarifying!
|
| Edit: looking at the press release, the improvement over old
| Ada is ... marginal? And Ada-01 is/was a poor performing
| model, tbh. I guess I'll have to run some tests, but at first
| sight it doesn't seem that wow-ey.
| LASR wrote:
| So just to be super clear, this is an embedding model. It
| generates no text. It's not outputting words.
|
| Maybe I am assuming incorrectly, but I think the poor
| performance you are referring to is the old Ada completion
| model, where the output is text. That was poor indeed.
| itake wrote:
| This article is not kind to the old ada embeddings model:
|
| https://medium.com/@nils_reimers/openai-gpt-3-text-
| embedding...
|
| If the new ada model only has marginal improvements, it
| seems open source is way to go.
| Zuiii wrote:
| Color me surprised! it looks like its actually open source
| (Apache 2.0) and not the usual false advertising by some two-
| faced company or institution. Links here:
|
| * https://huggingface.co/jinaai/jina-embeddings-v2-base-en *
| https://huggingface.co/jinaai/jina-embeddings-v2-small-en
| nicognaw wrote:
| Jina AI itself is also a great framework to expose APIs from deep
| neural net models and deploy them to Kubernetes clusters, which I
| think is very promising, but they didn't get as much hype as I
| predicted that they deserved.
| pknerd wrote:
| Pardon my ignorance in advance but could it be used to "chat"
| with PDFs and websites? I am looking for OpenAI alternatives as I
| am in learning phase
| clarkmcc wrote:
| Check out my little side project for chatting with PDFs. You
| should be able to load most models including this one.
| https://github.com/clarkmcc/chitchat
| pknerd wrote:
| This looks cool so can it be used to feed Website/Products
| data in CSV/JSON format and "chat" with it?
| clarkmcc wrote:
| Pretty much! Right now it only supports md, pdf, txt, and
| html, but supporting additional formats is trivial:
| https://github.com/clarkmcc/chitchat/blob/main/src-
| tauri/src....
| canadaduane wrote:
| No, this is an embedding model, not a text completion model.
| lofties wrote:
| No. "Chatting with PDFs" is (mostly) taking a users chat
| message, retrieve relevant content via e.g embedding search,
| then feed that into an LLM with a prompt that's something along
| the lines of "given this information, can you answer this
| question".
|
| This tool helps with embedding part.
|
| I've built a bunch of "chat with your PDFs" bots, do reach out
| if you have any questions me at brian.jp.
| pknerd wrote:
| Actually I wanna use langchain. OpwnAI is not free. I wanted
| to test two use cases:
|
| - chat with documents(pdf, doc etc)
|
| - chat with website. Like, if I integrate with an ecommerce
| site, I can ask questions from the website. What options do I
| have in free for both cloud and locally?
| seydor wrote:
| using the bing tab of microsoft edge browser, you can chat with
| PDFs and i think they use GTP4 or equivalent
| marinhero wrote:
| How well do LLMS like this work with a non-English language? Or
| are these open source models limited to English?
| simonw wrote:
| Quite a few of the top ranked models on this leaderboard are
| multilingual: https://huggingface.co/spaces/mteb/leaderboard
|
| https://huggingface.co/BAAI/bge-large-en-v1.5 FlagEmbedding for
| example describes itself as covering Chinese and English.
| ttul wrote:
| That depends on whether the training data contained languages
| other than English.
| anigbrowl wrote:
| Stability has a Japanese port which is getting lots of work
| https://twitter.com/StabilityAI_JP/status/171699857824440759...
| m3at wrote:
| This is not an embedding model though. Yes you can always
| extract some embeddings from somewhere, but for most LLMs
| those won't perform well for retrieval (which makes sense as
| it's not what the models are optimizing for)
| backendEngineer wrote:
| oh thank god I first read Jira...
| eshack94 wrote:
| You're not the only one... glad I misread that.
| dylanjcastillo wrote:
| I wonder how much better is this, compared to taking the average
| ( or some other aggregation) of embeddings with a smaller context
| length. Has anyone done a similar comparison?
| pietro72ohboy wrote:
| The issue with averaging is that over large inputs, it drowns
| out small signal. For example, there is a chance that it
| completely loses a reference to something made only in a single
| sentence somewhere in a large document.
| extasia wrote:
| Is this a text encoder model, BERT style?
| Kutsuya wrote:
| this is super cool! I wish there was an easy to understand and
| follow guide on how to make your own embedding, for llama2 for
| example. All I can find are various guides that already assume
| you know everything there is to training an embedding.
|
| I just want to make an embedding between a conversation of me and
| my friend and simulate talking to them. Is this a hard thing to
| train to begin with?
|
| If anyone knows or could help me with this, I would be very
| grateful!
| infecto wrote:
| I will butcher this so if any experts see this please don't
| flame me. I think you might be conflating ideas? You could
| definitely fine-tune existing embedding models or train your
| own from scratch but the goals of embeddings models are
| different than a LLM conversation. Embedding models are used
| for things like, classifying, search, image captioning...maybe
| at a high level anything where you have high dimensionality
| that you need to condense?
|
| What you are asking for sounds like fine tuning an existing
| LLM...where the data will be tokenized but the outcomes are
| different? There is a lot of writeups on how people have done
| it. You should especially follow some of the work on
| Huggingface. To replicate talking to your friend though, you
| will need a very large dataset to train off of I would think
| and its unclear to me if you can just fine-tune it or you would
| need to train a model from scratch. So a dataset with 10s of
| thousands of examples and then you need to train it on a GPU.
|
| https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...
| 3cats-in-a-coat wrote:
| Great company name.
| smcl wrote:
| I'm gonna try to explain this because I thought the same thing,
| though you may enjoy it for another reason. Among Czech or
| other slavic software people - "jina AI" could be like "another
| AI" and, to me at least, brings to mind the "yet another
| {thing}" naming convention (yacc = "yet another compiler
| compiler" for example).
| itronitron wrote:
| It's weird to think there are entire companies built around
| providing access to a pre-computed vector space model.
| pietz wrote:
| I'm always happy to see OSS contributions but I don't quite
| understand why this model is so remarkable. As the leaderboard
| suggests it's ranking lower than OpenAI embeddings, while 14
| other contributions are even better than that. Many of which
| feature a comparable or lower dimensionality than 768.
|
| The 8k context window is new, but isn't the 512 token limitation
| a soft limit anyway? I'm pretty sure I can stuff bigger documents
| into BGE for example.
|
| Furthermore, I think that most (all?) benchmarks in the MTEB
| leaderboard deal with very small documents. So there is nothing
| here that validates how well this model does on larger documents.
| If anything, I'd pick a higher ranking model because I put little
| trust in one that only ranks 17th on small documents. Should I
| expect it to magically get better when the documents get larger?
|
| Plus, you can expect that this model was designed to perform well
| on the datasets in MTEB while the OpenAI model probably wasn't.
|
| Many also stated that a 8k context embeddings will not be very
| useful in list situations.
|
| When would anyone use this model?
| infecto wrote:
| I have been trying to understand the hype as well. Happy to see
| all the work happening in this space still.
|
| I was pretty curious about the context limit. I am not an
| expert in this area but I always thought the biggest problem
| was the length of your original text. So typically you might
| only encode a sentence or a selection of sentences. You could
| always stuff more in but they you are potentially losing the
| specificity, I would think that is a function of the
| dimensionality. This model is 768, are they saying I can stuff
| 8k tokens worth of text and can utilize it just as well as I
| have with other models on a per 1-3 sentence level?
| infecto wrote:
| Thinking about it some more as I read through more comments.
| I guess in the stated case of research papers it can make
| sense if your task is looking for the common themes and not
| specific details. If you are embedding a sentence or a
| paragraph you miss out on the connection between those
| sentences across the whole paper...or at least its harder to
| manage that. By encoding a large number of pages from the
| paper (or the entire paper) you can hopefully do a better job
| of capturing the theme of that paper.
|
| This also opens up another question though, how would that
| compare to using a LLM to summarize that paper and then just
| embed on top of that summary.
| stormfather wrote:
| I would guess that the embedded summary is better, but for
| many tasks where you use embeddings (like document search),
| summarizing every document with an LLM is too expensive and
| slow.
| larodi wrote:
| Potentially useful for paragraph embedding, where... well,
| paragraphs can grow a lot. Not sure how this model fares in
| comparison to other embedding engines (yet), but I can
| definitely tell you mpnet models fare much better for paragraph
| embeddings than the leader in HF's leaderboard (being
| thenlper/gte-large at time of writing).
|
| I can guess the Davinci and similar embeddings work better for
| code than MPNET and it really matters what you are encoding,
| not only the context length. What features are actually being
| extracted by the emb.engine.
| theptip wrote:
| > The 8k context window is new
|
| Hasn't Claude had this for many months (before they bumped to
| 100k)?
|
| Edit: ah, you mean new for OSS maybe?
| simonw wrote:
| Claude is a large language model, which is a different thing
| from an embedding model.
| theptip wrote:
| Aha, that's what I missed, thanks!
| Der_Einzige wrote:
| Any large language model generates embedding
| representations at every layer of the model, and these can
| be trivially extracted. So, large language models are
| indeed embedding models.
|
| This leaderboard doesn't compare these custom tailored
| embedding models vs the obvious thing of average pooling
| layered with any traditional LLM, which is easily
| implemented using sentence transformers.
| egorfine wrote:
| I fail to imagine a 8k-token-length piece of text that has just
| one single semantic coordinate and is appropriate for embedding
| and vector search.
|
| In my experience, any text is better embedded using a sliding
| window of a few dozen words - this is the approximate size of
| semantic units in a written document in english; although this
| will wildly differ for different texts and topics.
| simonw wrote:
| What are you using those embeddings for?
|
| I can see a sliding window working for semantic search and
| RAG, but not so much for clustering or finding related
| documents.
| egorfine wrote:
| Ah yes, clustering is indeed something that would benefit
| from large context, I agree.
|
| However even so I would think about the documents
| themselves and figure out if it is even needed. Lets say we
| are talking about clustering court proceedings. I'd rather
| extract the abstract from these document, embed and cluster
| those instead of the whole text.
| nwhnwh wrote:
| What does this even do?
| tyingq wrote:
| See this story from yesterday:
| https://news.ycombinator.com/item?id=37985489
| woofwoofwoof wrote:
| Just noticed that they (jina.ai) have offices both in Berlin and
| China. I am wondering how they will they operate with the
| presence of chip export restrictions and other side effects of
| USA / China tensions.
| do-me wrote:
| Just quantized the models for onnx usage in e.g. transformers.js
| and got 4x reduced file size:
|
| - 28.5 MB jina-embeddings-v2-small-en (https://huggingface.co/do-
| me/jina-embeddings-v2-small-en)
|
| - 109 MB jina-embeddings-v2-base-en (https://huggingface.co/do-
| me/jina-embeddings-v2-base-en)
|
| However, I noted, that the base model is performing quite poorly
| on small text chunks (a few words) while the small version seems
| to be unaffected. Might this be some kind of side effect due to
| the way they deal with large contexts?
|
| If you want to test, you can head over to SemanticFinder
| (https://do-me.github.io/SemanticFinder/), go to advanced
| settings, choose the Jina AI base model (at the very bottom) and
| run with "Find". You'll see that all other models perform just
| fine and find "food"-related chunks but the base version doesn't.
| Havoc wrote:
| Why quantize something that is already very small (270mb)?
| luke-stanley wrote:
| When I go to this leaderboard:
| https://huggingface.co/spaces/mteb/leaderboard I click on the
| "Classification" tab, then I see "jina-embeddings-v2-base-en" at
| number 12, with an average score of 73.45. But the highest
| scoring model there is llmrails/ember-v1 with 75.99 average score
| but it only supports 512 tokens, so if you need 8K tokens to be
| embedded, I guess they are the best. Do people need 8K of tokens
| for embedding? Maybe not but they might need more than 512 often
| enough. It could save a summary extraction step.
| cztomsik wrote:
| Small context window means you cannot embed the whole document,
| you are embedding just a part.
|
| So, if there is some information at the bottom which is
| dependent on something which is at the top, your embedding
| could be entirely wrong.
| egorfine wrote:
| One thing that is missing in comparison: OpenAI's model is
| multilingual.
|
| And not only it supports and embeds a variety of languages, it
| also computes the same coordinates for the same semantics in
| different languages. I.e. if you embed "russia is a terrorist
| state" and "rossiia - strana-terrorist", both of these embeddings
| will have almost the same coordinates.
| m3kw9 wrote:
| I don't really know what that means but it seems useful
___________________________________________________________________
(page generated 2023-10-26 23:02 UTC)