[HN Gopher] All-in-one embedding model for interleaved text, ima...
___________________________________________________________________
All-in-one embedding model for interleaved text, images, and
screenshots
Author : fzliu
Score : 214 points
Date : 2024-11-17 07:42 UTC (15 hours ago)
(HTM) web link (blog.voyageai.com)
(TXT) w3m dump (blog.voyageai.com)
| carschno wrote:
| This does read very impressive. Any critical perspectives on the
| presented evaluation? What about noon-English text?
|
| I understand the model is, like for other commercial ones,
| available exclusively through their API, right?
| stephantul wrote:
| Yes, voyage models are API only.
|
| There was a part here about multilingualism but that was wrong!
| Sorry!
|
| FWIW: Voyage also has separate `law`, `code`, and `finance`
| models. See [1]
|
| Really cool results, anyway.
|
| [1]: https://docs.voyageai.com/docs/embeddings
| fzliu wrote:
| Glad you liked the results! We do have multilingual models
| (and rerankers) -- voyage-3, in particular, is multilingual:
| https://blog.voyageai.com/2024/09/18/voyage-3/
|
| voyage-multimodal-3 is multilingual as well, supporting the
| same set of languages as voyage-3.
| stephantul wrote:
| Sorry for spreading false information. I edited the post
| above.
|
| It is interesting that you're not as up front about
| multilingualism compared to cohere. They seem to mention it
| a lot, which led to my confusion.
| fzliu wrote:
| No worries at all. That's great feedback and an area of
| improvement for us when it comes to future posts -- we'll
| be more explicit about multilingualism in blogs and in
| our docs.
| unit149 wrote:
| In the traditional Python API, the Voyage engine will tokenize
| blocks of text and output a string of characters. This model
| seems to be doing that by vectorizing images in space.
|
| Words like 'you' and 'apple' will be a unitary token. More
| complex terms like 'pikachu' may be divided into pik-a-chu.
|
| [1]: https://docs.voyageai.com/docs/tokenization
| FergusArgyll wrote:
| I'm missing something. Shouldn't any llm that's 'natively
| multimodal' somehow include embeddings which are multi-modal? for
| ex here's googles blogpost on Gemini Until now,
| the standard approach to creating multimodal models involved
| training separate components for different modalities and then
| stitching them together to roughly mimic some of this
| functionality. These models can sometimes be good at
| performing certain tasks, like describing images, but
| struggle with more conceptual and complex reasoning.
| We designed Gemini to be natively multimodal, pre-trained from
| the start on different modalities. Then we fine-tuned it
| with additional multimodal data to further refine its
| effectiveness. This helps Gemini seamlessly understand and
| reason about all kinds of inputs from the ground up, far better
| than existing multimodal models -- and its capabilities
| are state of the art in nearly every domain.
| aabhay wrote:
| LLM embedding contain super positions of many concepts so while
| they might predict the next token they don't actually out
| perform contrastively pretrained embedding models.
| fzliu wrote:
| Because LLMs such as Gemini -- and other causal language models
| more broadly -- are trained on next token prediction, the
| vectors that you get from pooling the output token embeddings
| aren't that useful for RAG or semantic search compared to what
| you get from actual embedding models.
|
| One distinction to make here is that _token embeddings_ and the
| embeddings /vectors that are output from _embedding models_ are
| related but separate concepts. There are numerous token
| embeddings (one per token) which become contextualized as they
| propagate through the transformer, while there is a single
| vector /embedding that is output by embedding models (one per
| input data, such as long text, photo, or document screenshot).
| refulgentis wrote:
| Fwiw if the other replies aren't clear: change "embeddings" to
| "List<double> that some layer of my AI model produces" (that's
| not exactly correct, it's slightly more specific than that, but
| in this context it's correct)
|
| LLMs, including multimodal LLMs, do have embeddings, but
| they're embeddings learned by generating text, instead of
| finding similar documents
| greatgib wrote:
| Indeed, sad that their models are both commercial proprietary and
| API only.
| doug_durham wrote:
| Sad that people have to pay their employees?
| skeptrune wrote:
| No, but it serves everyone in the "AI retrieval" space better
| if we continue to make rapid improvements. New models are
| great, but not the ultimate solution.
| mech4lunch wrote:
| The colab measures dot product values 0.428 and 0.498, describing
| them as "...similarity value is quite high." Is that high? Can
| you design a system that confidently labels data with a 0.4
| threshold?
| brokensegue wrote:
| The raw output value is generally irrelevant. What matters is
| its position in the distribution of outputs
| fzliu wrote:
| While the raw similarity score does matter, what typically
| matters more is the score relative to other documents. In the
| case of the examples in the notebook, those values were the
| highest in relative terms.
|
| I can see why this may be unclear/confusing -- we will correct
| it. Thank you for the feedback!
| minimaxir wrote:
| A 0.4 with cosine similarity is not the same as a 0.4 with
| sigmoid thresholding.
|
| 0.4 cosine similarity is pretty good for real-world data that
| isn't an near-identical duplicate.
| djoldman wrote:
| This is a cool way to look at multimodal embeddings. They look at
| performance as the the percentage of inputs slides from one
| modality to another:
|
| https://i0.wp.com/blog.voyageai.com/wp-content/uploads/2024/...
| djoldman wrote:
| This is a key observation that is simple and intuitive:
|
| >All CLIP-like models perform poorly on mixed-modality search due
| to a phenomenon known as the modality gap. As illustrated in the
| figure below, the closest vector to the snippet "I address you,
| members of the Seventy-Seventh Congress..." is not its
| screenshot, but other texts. This leads to search results that
| are skewed towards items of the same modality; in other words,
| _text vectors will be closer to irrelevant texts than relevant
| images in the embedding space_.
| Zopieux wrote:
| API-only model. No thanks but congrats anyway.
| A4ET8a8uTh0 wrote:
| Agreed on both parts of the statement. Granted, there are
| obvious considerations for exclusive API focus beyond just
| trying get the money from people, but I personally would not
| consider it based on the fact that they don't offer other
| options.
| fzliu wrote:
| I understand the sentiment. We are starting to open source
| some tools, mostly around embedding model evaluation (i.e.
| https://github.com/voyage-ai/voyage-evaluation-public), with
| other stuff coming up.
|
| FWIW, there are other deployment options besides the API as
| well: AWS (https://docs.voyageai.com/docs/aws-marketplace-
| model-package), Azure (https://docs.voyageai.com/docs/azure-
| marketplace-managed-app...), Snowflake
| (https://docs.voyageai.com/docs/snowflake), and vector
| database integrations
| (https://docs.voyageai.com/docs/integrations-and-other-
| librar..., https://milvus.io/docs/integrate_with_voyageai.md,
| https://docs.pinecone.io/integrations/voyage,
| https://weaviate.io/developers/weaviate/model-
| providers/voya...,
| https://qdrant.tech/documentation/embeddings/voyage/, etc).
| jonathan-adly wrote:
| If you are interested in that space, would throw our project in
| the mix which uses ColPali under the hood transparently.
|
| https://github.com/tjmlabs/ColiVara
|
| The main benchmark for this is the Vidore leaderboard. Where we
| would love to see where VoyageAI performs compared to the more
| open-source implementations.
| skeptrune wrote:
| I wish people would take the time to put in real datasets and
| make qualitative analysis of when and why "foo new solution" is
| better.
|
| Quantitative benchmarks are great, but sparse.
| tinyhouse wrote:
| Funny, all those big name Stanford advisors for a company that
| builds embeddings... A couple of strong MLEs can deliver
| everything they are doing. This shouldn't be a company but OK...
| I'm sure some clueless VCs in SV gave them money.
|
| And just to be clear. I don't think that delivering strong
| embeddings for different domains is an easy task. However, it's
| 2024 not 2016.
___________________________________________________________________
(page generated 2024-11-17 23:00 UTC)