[HN Gopher] All-in-one embedding model for interleaved text, ima...
       ___________________________________________________________________
        
       All-in-one embedding model for interleaved text, images, and
       screenshots
        
       Author : fzliu
       Score  : 214 points
       Date   : 2024-11-17 07:42 UTC (15 hours ago)
        
 (HTM) web link (blog.voyageai.com)
 (TXT) w3m dump (blog.voyageai.com)
        
       | carschno wrote:
       | This does read very impressive. Any critical perspectives on the
       | presented evaluation? What about noon-English text?
       | 
       | I understand the model is, like for other commercial ones,
       | available exclusively through their API, right?
        
         | stephantul wrote:
         | Yes, voyage models are API only.
         | 
         | There was a part here about multilingualism but that was wrong!
         | Sorry!
         | 
         | FWIW: Voyage also has separate `law`, `code`, and `finance`
         | models. See [1]
         | 
         | Really cool results, anyway.
         | 
         | [1]: https://docs.voyageai.com/docs/embeddings
        
           | fzliu wrote:
           | Glad you liked the results! We do have multilingual models
           | (and rerankers) -- voyage-3, in particular, is multilingual:
           | https://blog.voyageai.com/2024/09/18/voyage-3/
           | 
           | voyage-multimodal-3 is multilingual as well, supporting the
           | same set of languages as voyage-3.
        
             | stephantul wrote:
             | Sorry for spreading false information. I edited the post
             | above.
             | 
             | It is interesting that you're not as up front about
             | multilingualism compared to cohere. They seem to mention it
             | a lot, which led to my confusion.
        
               | fzliu wrote:
               | No worries at all. That's great feedback and an area of
               | improvement for us when it comes to future posts -- we'll
               | be more explicit about multilingualism in blogs and in
               | our docs.
        
       | unit149 wrote:
       | In the traditional Python API, the Voyage engine will tokenize
       | blocks of text and output a string of characters. This model
       | seems to be doing that by vectorizing images in space.
       | 
       | Words like 'you' and 'apple' will be a unitary token. More
       | complex terms like 'pikachu' may be divided into pik-a-chu.
       | 
       | [1]: https://docs.voyageai.com/docs/tokenization
        
       | FergusArgyll wrote:
       | I'm missing something. Shouldn't any llm that's 'natively
       | multimodal' somehow include embeddings which are multi-modal? for
       | ex here's googles blogpost on Gemini                 Until now,
       | the standard approach to creating multimodal models involved
       | training separate components for different modalities and then
       | stitching them        together to roughly mimic some of this
       | functionality. These models can        sometimes be good at
       | performing certain tasks, like describing images, but
       | struggle with more conceptual and complex reasoning.
       | We designed Gemini to be natively multimodal, pre-trained from
       | the start on        different modalities. Then we fine-tuned it
       | with additional multimodal data to        further refine its
       | effectiveness. This helps Gemini seamlessly understand and
       | reason about all kinds of inputs from the ground up, far better
       | than existing        multimodal models -- and its capabilities
       | are state of the art in nearly every        domain.
        
         | aabhay wrote:
         | LLM embedding contain super positions of many concepts so while
         | they might predict the next token they don't actually out
         | perform contrastively pretrained embedding models.
        
         | fzliu wrote:
         | Because LLMs such as Gemini -- and other causal language models
         | more broadly -- are trained on next token prediction, the
         | vectors that you get from pooling the output token embeddings
         | aren't that useful for RAG or semantic search compared to what
         | you get from actual embedding models.
         | 
         | One distinction to make here is that _token embeddings_ and the
         | embeddings /vectors that are output from _embedding models_ are
         | related but separate concepts. There are numerous token
         | embeddings (one per token) which become contextualized as they
         | propagate through the transformer, while there is a single
         | vector /embedding that is output by embedding models (one per
         | input data, such as long text, photo, or document screenshot).
        
         | refulgentis wrote:
         | Fwiw if the other replies aren't clear: change "embeddings" to
         | "List<double> that some layer of my AI model produces" (that's
         | not exactly correct, it's slightly more specific than that, but
         | in this context it's correct)
         | 
         | LLMs, including multimodal LLMs, do have embeddings, but
         | they're embeddings learned by generating text, instead of
         | finding similar documents
        
       | greatgib wrote:
       | Indeed, sad that their models are both commercial proprietary and
       | API only.
        
         | doug_durham wrote:
         | Sad that people have to pay their employees?
        
           | skeptrune wrote:
           | No, but it serves everyone in the "AI retrieval" space better
           | if we continue to make rapid improvements. New models are
           | great, but not the ultimate solution.
        
       | mech4lunch wrote:
       | The colab measures dot product values 0.428 and 0.498, describing
       | them as "...similarity value is quite high." Is that high? Can
       | you design a system that confidently labels data with a 0.4
       | threshold?
        
         | brokensegue wrote:
         | The raw output value is generally irrelevant. What matters is
         | its position in the distribution of outputs
        
         | fzliu wrote:
         | While the raw similarity score does matter, what typically
         | matters more is the score relative to other documents. In the
         | case of the examples in the notebook, those values were the
         | highest in relative terms.
         | 
         | I can see why this may be unclear/confusing -- we will correct
         | it. Thank you for the feedback!
        
         | minimaxir wrote:
         | A 0.4 with cosine similarity is not the same as a 0.4 with
         | sigmoid thresholding.
         | 
         | 0.4 cosine similarity is pretty good for real-world data that
         | isn't an near-identical duplicate.
        
       | djoldman wrote:
       | This is a cool way to look at multimodal embeddings. They look at
       | performance as the the percentage of inputs slides from one
       | modality to another:
       | 
       | https://i0.wp.com/blog.voyageai.com/wp-content/uploads/2024/...
        
       | djoldman wrote:
       | This is a key observation that is simple and intuitive:
       | 
       | >All CLIP-like models perform poorly on mixed-modality search due
       | to a phenomenon known as the modality gap. As illustrated in the
       | figure below, the closest vector to the snippet "I address you,
       | members of the Seventy-Seventh Congress..." is not its
       | screenshot, but other texts. This leads to search results that
       | are skewed towards items of the same modality; in other words,
       | _text vectors will be closer to irrelevant texts than relevant
       | images in the embedding space_.
        
       | Zopieux wrote:
       | API-only model. No thanks but congrats anyway.
        
         | A4ET8a8uTh0 wrote:
         | Agreed on both parts of the statement. Granted, there are
         | obvious considerations for exclusive API focus beyond just
         | trying get the money from people, but I personally would not
         | consider it based on the fact that they don't offer other
         | options.
        
           | fzliu wrote:
           | I understand the sentiment. We are starting to open source
           | some tools, mostly around embedding model evaluation (i.e.
           | https://github.com/voyage-ai/voyage-evaluation-public), with
           | other stuff coming up.
           | 
           | FWIW, there are other deployment options besides the API as
           | well: AWS (https://docs.voyageai.com/docs/aws-marketplace-
           | model-package), Azure (https://docs.voyageai.com/docs/azure-
           | marketplace-managed-app...), Snowflake
           | (https://docs.voyageai.com/docs/snowflake), and vector
           | database integrations
           | (https://docs.voyageai.com/docs/integrations-and-other-
           | librar..., https://milvus.io/docs/integrate_with_voyageai.md,
           | https://docs.pinecone.io/integrations/voyage,
           | https://weaviate.io/developers/weaviate/model-
           | providers/voya...,
           | https://qdrant.tech/documentation/embeddings/voyage/, etc).
        
       | jonathan-adly wrote:
       | If you are interested in that space, would throw our project in
       | the mix which uses ColPali under the hood transparently.
       | 
       | https://github.com/tjmlabs/ColiVara
       | 
       | The main benchmark for this is the Vidore leaderboard. Where we
       | would love to see where VoyageAI performs compared to the more
       | open-source implementations.
        
       | skeptrune wrote:
       | I wish people would take the time to put in real datasets and
       | make qualitative analysis of when and why "foo new solution" is
       | better.
       | 
       | Quantitative benchmarks are great, but sparse.
        
       | tinyhouse wrote:
       | Funny, all those big name Stanford advisors for a company that
       | builds embeddings... A couple of strong MLEs can deliver
       | everything they are doing. This shouldn't be a company but OK...
       | I'm sure some clueless VCs in SV gave them money.
       | 
       | And just to be clear. I don't think that delivering strong
       | embeddings for different domains is an easy task. However, it's
       | 2024 not 2016.
        
       ___________________________________________________________________
       (page generated 2024-11-17 23:00 UTC)