[HN Gopher] Embeddings are underrated
       ___________________________________________________________________
        
       Embeddings are underrated
        
       Author : misonic
       Score  : 289 points
       Date   : 2024-11-01 03:24 UTC (19 hours ago)
        
 (HTM) web link (technicalwriting.dev)
 (TXT) w3m dump (technicalwriting.dev)
        
       | kaycebasques wrote:
       | Cool, first time I've seen one of my posts trend without me
       | submitting it myself. Hopefully it's clear from the domain name
       | and intro that I'm suggesting technical writers are underrating
       | how useful embeddings can be in our work. I know ML practitioners
       | do not underrate them.
        
         | dartos wrote:
         | Yeah embeddings are the unsung killer feature of LLMs
        
         | donavanm wrote:
         | You might want to highlight chunking and how embeddings
         | can/should represent subsections of your document as well. It
         | seems relevant to me for cases like similarity or semantics
         | search, getting the reader to the relevant portion of the
         | document or page.
         | 
         | Theres probably some interesting ideas around tokenization and
         | metadata as well. For example, if you're processing the raw
         | file I expect you want to strip out a lot of markup before
         | tokenization of the content. Conversely, some markup like code
         | blocks or examples would be meaningful for tokenization and
         | embedding anyways.
         | 
         | I wonder if both of those ideas can be combined for something
         | like automated footnotes and annotations. Linking or mouseover
         | relevant content from elsewhere in the documentation.
        
           | MrGreenTea wrote:
           | Do you have any resources you recommend for representing sub
           | sections? I'm currently prototyping a note/thoughts editor
           | where one feature is suggesting related documents/thoughts
           | (think linked notes in Obsidian) for which I would like to
           | suggest sub sections and not only full documents.
        
             | donavanm wrote:
             | Sorry, no good references off hand. I've had to help write
             | & generate public docs in DocBook in the past. But no
             | expert on either editors, nlp, or embeddings besides
             | hacking around some tools for my own note taking. My
             | assumption is youll want to use your existing markup
             | structure, if you have it. Or naively split on paragraphs
             | with a tool like spacy. Or get real fancy and use dynamic
             | ranges; something like an accumulation window that
             | aggregates adjacent sentences based on individual
             | similarity, break on total size or dissimilarity, and then
             | treat that aggregate as the range to "chunk."
        
               | MrGreenTea wrote:
               | Thanks for the elaborate and helpful response. I'm also
               | hacking on this as a personal note taking project and
               | already started playing around with your ideas. Thanks!
        
         | enjeyw wrote:
         | Haha yeah I was about to comment that I recall a period just
         | after Word2Vec came out where embeddings were most definitely
         | not underrated but rather the most hyped ML thing out there!
        
       | rahimnathwani wrote:
       | I'm not sure why the voyage-3 models aren't on the MTEB
       | leaderboard. The code for the leaderboard suggests they should be
       | there:
       | https://huggingface.co/spaces/mteb/leaderboard/commit/b7faae...
       | 
       | But I don't see them when I filter the list for 'voyage'.
        
         | newrotik wrote:
         | It is unclear this model should be on that leaderboard because
         | we don't know whether it has been trained on mteb test data.
         | 
         | It is worth noting that their own published material [0] does
         | not entail any score from any dataset from the mteb benchmark.
         | 
         | This may sound nit picky, but considering transformers'
         | parroting capabilities, having seen test data during training
         | should be expected to completely invalidate those scores.
         | 
         | [0] see excel spreadsheet linked here
         | https://blog.voyageai.com/2024/09/18/voyage-3/
        
           | jdthedisciple wrote:
           | I'm critical of the low number of embedding dims.
           | 
           | Could hurt performance in niche applications, in my
           | estimation.
           | 
           | Looking forward to try the announced large models though.
        
         | fzliu wrote:
         | (I work at Voyage)
         | 
         | Many of the top-performing models that you see on the MTEB
         | retrieval for English and Chinese tend to overfit to the
         | benchmark nowadays. voyage-3 and voyage-3-lite are also pretty
         | small in size compared to a lot of the 7B models that take the
         | top spots, and we don't want to hurt performance on other real-
         | world tasks just to do well on MTEB.
        
           | jdthedisciple wrote:
           | It would still be great to know how it compares?
           | 
           | Why should I pick voyage-3 if for all I know it sucks when it
           | comes to retrieval accuracy (my personally most important
           | metric)?
        
             | fzliu wrote:
             | We provide retrieval metrics for a variety of datasets and
             | languages: https://blog.voyageai.com/2024/09/18/voyage-3/.
             | I also personally encourage folks to either test on their
             | own data or to find an open source dataset that closely
             | resembles the documents they are trying to search (we
             | provide a ton of free tokens for the evaluating our
             | models).
        
           | kkielhofner wrote:
           | > we don't want to hurt performance on other real-world tasks
           | just to do well on MTEB
           | 
           | Nice!
           | 
           | Fortunately MTEB lets you sort by model parameter size
           | because using 7B parameter LLMs for embeddings is just...
           | Yuck.
        
       | quantadev wrote:
       | That was a good post. Vector Embeddings are in some sense a
       | summary of a doc that's unique similar to a hashcode of a doc. It
       | makes me think it would be cool if there were some universal
       | standard for generating embeddings, but I guess they'll be
       | different for each AI model, so they can't have the same kind of
       | "permanence" hash codes have.
       | 
       | It definitely also seems like there should be lots of ways to
       | utilize "Cosine Similarity" (or other closeness algos) in
       | databases and other information processing apps that we haven't
       | really exploited yet. For example you could almost build a new
       | kind of Job Search Service that matches job descriptions to job
       | candidates based on nothing but a vector similarity between
       | resume and job description. That's probably so obvious it's being
       | done, already.
        
         | kqr wrote:
         | For one point of inspiration, see
         | https://entropicthoughts.com/determining-tag-quality
         | 
         | I really like the picture you are drawing with "semantic
         | hashes"!
        
           | quantadev wrote:
           | Yeah for "Semantic Hashes" (that's a good word for them!)
           | we'd need some sort of "Canonical LLM" model that isn't
           | necessarily used for inference, nor does it need to even be
           | all that smart, but it just needs to be public for the world.
           | It would need to be updated like every 2 to 5 years tho to
           | account for new words or words changing meaning? ...but maybe
           | could be updated in such a way as to not "invalidate" prior
           | vectors, if that makes sense? For example "ride a bicycle"
           | would still point in the same direction even after a refresh
           | of the canonical model? It seems like feeding the same
           | training set could replicate the same model values, but there
           | are nonlinear instabilities which could make it disintegrate.
        
         | helloplanets wrote:
         | I guess it might be possible to retroactively create an
         | embeddings model which could take several different models'
         | embeddings, and translate them into the same format.
        
           | genuinelydang wrote:
           | No. That's like saying you can transplant a person's neuronal
           | action potentials into another person's brain and have it
           | make sense to them.
        
             | helloplanets wrote:
             | That metaphor is skipping the most important part in
             | between! You wouldn't be transplanting anything directly,
             | you'd have a separate step in between, which would attempt
             | to translate these action potentials.
             | 
             | The point of the translating model in between would be that
             | it would re weight each and every one of the values of the
             | embedding, after being trained on a massive dataset of
             | original text -> vector embedding for model A + vector
             | embedding for model B. If you have billions of parameters
             | trained to do this translation between just two specific
             | models to start with, wouldn't this be in the realm of
             | possible?
        
               | quantadev wrote:
               | A translation between models doesn't seem possible
               | because there are actually no "common dimensions" at all
               | between models. That is, each dimension has a completely
               | different semantic meaning, in different models, but also
               | it's the combination of dimension values that begin to
               | impart real "meaning".
               | 
               | For example, the number of different unit vector
               | combinations in a 1500 dimensional space is like the
               | number of different ways of "ordering" the components,
               | which is 5^4114 .
               | 
               | EDIT: And the point of that factorial is that even if the
               | dimensions were "identical" across two different LLMs but
               | merely "scrambled" (in ordering) there would be that
               | large number to contend with to "unscramble".
        
             | tempusalaria wrote:
             | This is very similar to how LLMs are taught to understand
             | images in llava style models (the image embeddings are
             | encoded into the existing language token stream)
        
           | batch12 wrote:
           | This is definitely possible. I made something like this. It
           | worked pretty well for cosine similarity in my testing.
        
           | nostrebored wrote:
           | This is done with two models in most standard biencoder
           | approaches. This is how multimodal embedding search works. We
           | want to train a model such that the location of the text
           | embeddings that represent an item and the image embeddings
           | for that item are colocated.
        
         | genuinelydang wrote:
         | "you could almost build a new kind of Job Search Service that
         | matches job descriptions to job candidates"
         | 
         | The key word being "almost". Yes, you can get similarity
         | matches between job requirements and candidate resumes, but
         | those matches are not useful for the task of finding an optimal
         | candidate for a job.
         | 
         | For example, say a job requires A and B.
         | 
         | Candidate 1 is a junior who has done some work with A, B and C.
         | 
         | Candidate 2 is a senior and knows A, B, C, D, E and F by heart.
         | All are relevant to the job and would make 2 the optimal
         | candidate, even though C-F are not explicitly stated in the job
         | requirements.
         | 
         | Candidate 1 would seem a much better candidate than 2, because
         | 1's embedding vector is closer to the job embedding vector.
        
           | coldtea wrote:
           | Even that is just static information.
           | 
           | We don't know if Candidate 2 really "knows A, B, C, D, E and
           | F by heart", just that they claim to. They could be adding
           | whatever to their skill list just, even though they hardly
           | used it, just because it' a buzzword.
           | 
           | So Candidate 1 could still blow them out of the water in
           | performance, and even be able to trivially learn D, and E in
           | a short while on the job if needed.
           | 
           | The skill vector wont tell much by itself, and even prevent
           | finding the better candidate if its used for screening.
        
             | quantadev wrote:
             | So your point is that LLMs can't tell when job candidates
             | are lying on their resume? Well that's true, but neither
             | can humans. lol.
        
           | nostrebored wrote:
           | That's not accurate. You can explicitly bake in these types
           | of search behaviors with model training.
           | 
           | People do this in ecommerce with the concept of user
           | embeddings and product embeddings, where the result of
           | personalized recommendations is just a user embedding search.
        
           | quantadev wrote:
           | > not useful for the task of finding an optimal candidate
           | 
           | That statement is just flat out incorrect on it's face,
           | however it did make me think of something I hadn't though of
           | before, which is this:
           | 
           | Embedding vectors can be made to have a "scale" (multiplier)
           | on specific terms which represent the amount of "weight" to
           | add to that term. For example if I have 10 years experience
           | in Java Web Development, then we can take the actual
           | components of that vector embedding (i.e. for string "Java
           | Web Development") and multiply them by some proportionality
           | of 10, and that results in a vector that is "Further" into
           | that direction. This represents an "amount" of directional
           | into the Java Web direction.
           | 
           | So this means even with vector embeddings we can scale out to
           | specific amounts of experience. Now here's the cool part. You
           | can then take all THOSE scaled vectors (one for each
           | individual job candidate skill) and average them to get a
           | single point in space which CAN be compared as a single
           | scalar distance from what the Job Requirements specify.
        
           | OutOfHere wrote:
           | The trick is evaluate the score for each skill, also weighing
           | it by the years of experience with the skill, then sum the
           | evaluations. This will address your problem 100%.
           | 
           | Also, what a candidate claims as a skill is totally
           | irrelevant and can be a lie. It is the work experience that
           | matters, and skills can be extracted from it.
        
         | rasulkireev wrote:
         | I tried doing something like that: https://gettjalerts.com/
         | 
         | I added semantic search, but I'm workin on adding resume
         | upload/parsing to do automatic matching.
        
         | SCUSKU wrote:
         | It does exist! I built this for the monthly Who's Hiring
         | threads: https://hnresumetojobs.com/
         | 
         | It just does cosine similarity with OpenAI embeddings +
         | pgVector. It's not perfect by any means, but it's useful. It
         | could probably stand to be improved with a re-ranker, but I
         | just never got around to it.
        
           | quantadev wrote:
           | Very cool. I knew it was too obvious an idea to be missed!
           | Did you read my comments below about how you can maybe "scale
           | up" a vector based on number of years of experience. I think
           | that will work. It makes somebody with 10 yrs Java Experience
           | closer to the target than someone with only 5yrs, if the
           | target is 10 years! -- but the problem is someone with 20yrs
           | looks even worse when they should look better! My problem in
           | my life. hahaha. Too much experience.
           | 
           | I think the best "matching" factor is to minimize total
           | distance where each distance is the time-multiplied vector
           | for a specific skill.
        
       | Aeolun wrote:
       | Is there some way to compare different embeddings for different
       | use cases?
        
         | jdthedisciple wrote:
         | Search for MTEB Leaderboard on huggingface
        
       | fzliu wrote:
       | Great post!
       | 
       | One quick minor note is that the resulting embeddings for the
       | same text string could be different, depending on what you
       | specify the input type as for retrieval tasks (i.e. query or
       | document) -- check out the `input_type` parameter here:
       | https://docs.voyageai.com/reference/embeddings-api.
        
       | thund wrote:
       | Doesn't OpenAI embedding model support 8191/8192 tokens? That
       | aside, declaring a winner by token size is misleading. There are
       | more important factors like cross language support and precision
       | for example
        
         | jdthedisciple wrote:
         | Yep, voyage-3 is not even anywhere in the top of the MTEB
         | leaderboard if you order by `retrieval score` desc.
         | 
         | stella_en_1.5B_v5 seems to be an unsung hero model in that
         | regard
         | 
         | plus you may not even want such large token sizes if you just
         | need accurate retrieval of snippets of text (like 1-2
         | sentences)
        
           | kaycebasques wrote:
           | Thanks thund and jdthedisciple for these points and
           | corrections. I'll update the section today.
        
             | kaycebasques wrote:
             | Updated the section to refer to the "Retrieval Average"
             | column of the MTEB leaderboard. Is that the right column to
             | refer to? Can someone link me to an explanation of how that
             | benchmark works? Couldn't find a good link on it
        
         | OutOfHere wrote:
         | And that's not all because token encodings of different models
         | can be very different.
        
       | nerdright wrote:
       | Great post indeed! I totally agree that embeddings are
       | underrated. I feel like the "information retrieval/discovery"
       | world is stuck using spears (i.e., term/keyword-based discovery)
       | instead of embracing the modern tools (i.e., semantic-based
       | discovery).
       | 
       | The other day I found myself trying to figure out some common
       | themes across a bunch of comments I was looking at. I felt lazy
       | to go through all of them so I turned my attention to the
       | "Sentence Transformers" lib. I converted each comment into a
       | vector embedding, applied k-means clustering on these embeddings,
       | then gave each cluster to ChatGPT to summarize the corresponding
       | comments. I have to admit, it was fun doing this and saved me
       | lots of time!
        
         | Gooblebrai wrote:
         | Interesting approach. Did you tell GPT to summarise the
         | comments of each cluster after grouping them?
        
       | empiko wrote:
       | My hot take: embeddings are overrated. They are overfitted on
       | word overlap, leading to both many false positives and false
       | negatives. If you identify a specific problem with them ("I
       | really want to match items like these, but it does not work"), it
       | is almost impossible to fix them. I often see them being used
       | inappropriately, by people who read about their magical
       | properties, but didn't really care about evaluating their
       | results.
        
         | cheevly wrote:
         | You can easily fix this using embedding arithmetic to build
         | embedding classifiers.
        
           | mbanerjeepalmer wrote:
           | Are there good examples of this working in the wild? Before I
           | comb through all ten blue links... https://www.google.com/sea
           | rch?q=embedding%20arithmetic%20emb...
        
         | nostrebored wrote:
         | "I really want to match items like these, but it does not work"
         | is just a fine tuning problem.
        
           | empiko wrote:
           | Yes, in a sense that if you have infinite appropriate dataset
           | and compute. No, in a sense what is practically achievable.
        
             | nostrebored wrote:
             | You don't need infinite data. You need ~100k samples. It's
             | also not particularly expensive.
        
         | deepsquirrelnet wrote:
         | I think there is a deeper technical truth to this that hints at
         | how much space there is to be gained in optimization.
         | 
         | 1) that matryoshka representations work so well, and as few as
         | 64 dimensions account for a large majority of the performance
         | 
         | 2) that dimensional collapse is observed. Look at your cosine
         | similarity scores and be amazed that everything is pretty
         | similar and despite being a -1 to 1 scale, almost nothing is
         | ever less than 0.8 for most models
         | 
         | I think we're at the infancy in this technology, even with all
         | of the advances in recent years.
        
       | mrob wrote:
       | Embeddings are the only aspect of modern AI I'm excited about
       | because they're the only one that gives more power to humans
       | instead of taking it away. They're the "bicycle for our minds" of
       | Steve Jobs fame; intelligence amplification not intelligence
       | replacement. IMO, the biggest improvement in computer usability
       | in my lifetime was the introduction of fast and ubiquitous local
       | search. I use Firefox's "Find in Page" feature probably 10 or
       | more times per day. I use find and grep probably every day. When
       | I read man pages or logs, I navigate by search. Git would be
       | vastly less useful without git grep. Embeddings have the
       | potential to solve the biggest weakness of search by giving us
       | fuzzy search that's actually useful.
        
         | gwervc wrote:
         | I agee with this view. Generative AI robs us of something
         | (thinking, practicing) which is the long term ability to
         | practice a skill and improve oneself in exchange of an
         | immediate (often crappy) result. Embeddings is a tech that can
         | help us solve problem, ut we still have to do most of the work.
        
           | wussboy wrote:
           | I'm not sure it robs us. It makes it possible, but many
           | people including myself find the artistic products of AI to
           | be utterly without value for the reasons you list. I will
           | always cherish the product of lifelong dedication and human
           | skill
        
             | jacobr1 wrote:
             | It doesn't diminish - but I do find it interesting how it
             | influences. Realism became less important, less
             | interesting, though still valued to a lesser degree, with
             | the ubiquity of photography. Where will human creativity
             | move towards when certain task become trivially machine
             | replicable? Where will human ingenuity _enabled_ by new
             | technology make new art possible?
        
           | larve wrote:
           | I ask LLMs to give me exercises, tutorials then write up my
           | experience into "course notes", along with flashcards. I ask
           | it to simulate a teacher, I ask it to simulate students that
           | I have to teach, etc...
           | 
           | I haven't found a tool that is more effective in helping me
           | learn.
        
             | greentxt wrote:
             | Great for learning for learning sake. Learning with the
             | intention of pursuing a career requires the economic/job
             | model too, which is the problem.
        
           | stocknoob wrote:
           | Does a player piano rob you of playing music yourself? A car
           | from walking? A wheelbarrow from working out? It's up to you
           | if you want to stop practicing!
           | 
           | Chess has become even more popular despite computers that can
           | "rob us" of the joy. They're even better practice partners.
        
             | crashabr wrote:
             | An individual car doesn't stop you from walking but a
             | culture that centers cars leads to cities where walking is
             | outright dangerous.
             | 
             | Most car owners would never say outright "I want a car-
             | centric culture". But car manufacturers lobbied for it, and
             | step by step, we got both the deployment of useful car
             | infrastructure, and the destruction or ignoring of all
             | amenities useful for people walking or cycling.
             | 
             | Now let's go back to the period where cars start to become
             | enormously popular, and cities start to build neighborhoods
             | without sidewalks. There was probably someone at the time
             | complaining about the risk of cars overtaking walking and
             | leading to stores being more far away etc. And in front of
             | them was probably someone like you calling them a luddite
             | and being oblivious of second order effects.
        
         | TeMPOraL wrote:
         | So you're saying, embeddings are fine, as long as we refrain
         | from making full use of their capabilities? We've hit on a
         | mathematical construct that seems to be able to _capture
         | understanding_ , and you're saying that the biggest models are
         | too big, we need to scale down, only use embeddings for
         | surface-level basic similarities?
         | 
         | I too think embeddings are vastly underutilized, and chat
         | interface is not the be-all, end-all (not to mention, "chat
         | with your program/PDF/documentation" just sounds plain stupid).
         | However, whether current AI tools are replacing or amplifying
         | your intelligence, is entirely down to how you use them.
         | 
         | As for search, yes, that was a huge breakthrough and powerful
         | amplifier. 2+ decades ago. At this point it's computer use 101
         | - which makes it sad when dealing with programs or websites
         | that are opaque to search, and "ubiquitous local search" is
         | still not here. Embeddings can and hopefully will give us
         | better fuzzy/semantic searching, but if you push this far
         | enough, you'll have to stop and ask - if the search tool is now
         | capable to understand some aspects of my data, why not surface
         | this understanding as a different view into data, instead of
         | just invoking it in the background when user makes a search
         | query?
        
           | autokad wrote:
           | I have found that embeddings + LLM is very successful. I'm
           | going to make the words up as to not yield my work publicly,
           | but I had to classify something into 3 categories. I asked a
           | simple llm to label it, it was 95% accurate. taking the min
           | distance from the word embeddings to the mean category
           | embeddings was about 96%. When I gave gave the LLM the
           | embedding prediction, the LLM was 98% accurate.
           | 
           | There were issues an embedding model might not do well on
           | where as the LLM could handle. for example: These were camel
           | case words, like WoodPecker, AquafinaBottle, and WoodStock (I
           | changed the words to not reveal private data). WoodPecker and
           | WoodStock would end up with close embedding values because
           | the word Wood dominated the embedding values, but these were
           | supposed to go into 2 different categories.
        
             | bravura wrote:
             | Some of the best performing embedding models
             | (https://huggingface.co/spaces/mteb/leaderboard) are LLMs.
             | Have you tried them?
        
             | kkielhofner wrote:
             | > word Wood dominated the embedding values, but these were
             | supposed to go into 2 different categories
             | 
             | When faced with a similar challenge we developed a custom
             | tokenizer, pretrained BERT base model[0], and finally a
             | SPLADE-esque sparse embedding model[1] on top of that.
             | 
             | [0] - https://huggingface.co/atomic-canyon/fermi-bert-1024
             | 
             | [1] - https://huggingface.co/atomic-canyon/fermi-1024
        
               | bravura wrote:
               | Do you mind sharing why you chose SPLADE-esque sparse
               | embeddings?
               | 
               | I have been working on embeddings for a while.
               | 
               | For different reasons I have recently become very
               | interested in learned sparse embeddings. So I am curious
               | what led you to choose them for your application, and
               | why?
        
               | kkielhofner wrote:
               | > Do you mind sharing why you chose SPLADE-esque sparse
               | embeddings?
               | 
               | I can provide what I can provide publicly. The first
               | thing we ever do is develop benchmarks given the
               | uniqueness of the nuclear energy space and our
               | application. In this case it's FermiBench[0].
               | 
               | When working with operating nuclear power plants there
               | are some fairly unique challenges:
               | 
               | 1. Document collections tend to be in the billions of
               | pages. When you have regulatory requirements to
               | extensively document EVERYTHING and plants that have been
               | operating for several decades you end up with a lot of
               | data...
               | 
               | 2. There are very strict security requirements -
               | generally speaking everything is on-prem and hard air-
               | gapped. We don't have the luxury of cloud elasticity.
               | Sparse embeddings are very efficient especially in terms
               | of RAM and storage. Especially important when factoring
               | in budgetary requirements. We're already dropping in
               | eight H100s (minimum) so it starts to creep up fast...
               | 
               | 3. Existing document/record management systems in the
               | nuclear space are keyword search based if they have
               | search at all. This has led to substantial user
               | conditioning - they're not exactly used to what we'd call
               | "semantic search". Sparse embeddings in combination with
               | other techniques bridge that well.
               | 
               | 4. Interpretability. It's nice to be able to peek at the
               | embedding and be able to get something out of it at a
               | glance.
               | 
               | So it's basically a combination of efficiency,
               | performance, and meeting users where they are. Our Fermi
               | model series is still v1 but we've found performance (in
               | every sense of the word) to be very good based on
               | benchmarking and initial user testing.
               | 
               | I should also add that some aspects of this (like
               | pretrained BERT) are fairly compute-intense to train.
               | Fortunately we work with the Department of Energy Oak
               | Ridge National Laboratory and developed all of this on
               | Frontier[1] (for free).
               | 
               | [0] - https://huggingface.co/datasets/atomic-
               | canyon/FermiBench
               | 
               | [1] -
               | https://en.wikipedia.org/wiki/Frontier_(supercomputer)
        
         | inbread wrote:
         | I've been experimenting with using embeddings for finding the
         | relevant git commits, as I often don't know or remember the
         | exact word that was used. So I created my own little tool for
         | embedding and finding commits by commit messages. Maybe you'll
         | also find it useful: https://github.com/adrianmfi/git-semantic-
         | similarity
        
           | chamomeal wrote:
           | Very cool, I'll try this out!
        
         | mgraczyk wrote:
         | All modern AI technology can give more power to humans, you
         | just have to use the right tools. Every AI tool I can think of
         | has made me more productive.
         | 
         | LLMs help me write code faster and understand new libraries,
         | image generation helps me build sites and emails faster, etc
        
         | attentive wrote:
         | there is fzf, depending on your definition of "useful"
        
       | imgabe wrote:
       | Is there any benefit to fine-tuning a model on your corpus before
       | using it to generate embeddings? Would that improve the quality
       | of the matches?
        
         | gunalx wrote:
         | Yes. Especially if you work in a not well supported language
         | and/or have specific datapairs you want to match that might be
         | out of ordinary text.
         | 
         | Training your own fine tune takes a really short time and GPU
         | resources, and you can easily outperform even sota models on
         | your specific problem with a smaller model/vector space
         | 
         | Then again on general English text and doing a basic fuzzy
         | search. I would not really expect high performance gains.
        
       | tomthe wrote:
       | Nice introduction, but I think that ranking the models purely by
       | their input token limits is not a useful exercise. Looking at the
       | MTEB leaderboard is better (although a lot of the models are
       | probably overfitting to their test set).
       | 
       | This is a good time to chill for my visualization of 5
       | Millionembeddings of HN posts, users and comments:
       | https://tomthe.github.io/hackmap/
        
         | kaycebasques wrote:
         | Thanks, a couple other people gave me this same feedback in
         | another comment thread and it definitely makes sense not to
         | overindex on input token size. Will update that section in a
         | bit.
        
       | l5870uoo9y wrote:
       | Are there any visualization libraries that visualize embeddings
       | in a vector space?
        
         | f_devd wrote:
         | UMAP: https://umap-learn.readthedocs.io/en/latest/
         | 
         | scikit-learn also has options: https://scikit-
         | learn.org/stable/auto_examples/manifold/plot_...
        
         | sk11001 wrote:
         | There's attempts but you can only do so much in
         | hundreds/thousands of dimensions. Most of the time the
         | visualization doesn't really provide anything meaningful.
        
         | beejiu wrote:
         | My instinct would be a principal component analysis (which
         | someone has demonstrated here:
         | https://www.youtube.com/watch?app=desktop&v=brt88wwoZtI). Not
         | sure it would tell you much though, but it looks nice.
        
         | OutOfHere wrote:
         | If you need them visualized, you're already on the wrong track.
        
       | adamgordonbell wrote:
       | I was using embeddings to group articles by topic, and hit a
       | specific issue. Say I had 10 articles about 3 topics, and
       | articles are either dry or very casual in tone.
       | 
       | I found clustering by topic was hard, because tone dimensions (
       | whatever they were ) seemed to dominate.
       | 
       | How can you pull apart the embeddings? Maybe use an LLM to
       | extract a topic, and then cluster by extracted topic?
       | 
       | In the end I found it easier to just ask an LLM to group articles
       | by topic.
        
         | eamag wrote:
         | I agree, I tried several methods during my pet project [1], and
         | all of them have their pros and cons. Looks like creating
         | topics first and predicting them using LLM works the best
         | 
         | [1] https://eamag.me/2024/Automated-Paper-Classification
        
         | coredog64 wrote:
         | Allegedly, the new hotness in RAG is exactly that. Use a
         | smaller LLM to summarize the article and include that summary
         | alongside the article when generating the embedding.
         | 
         | Potentially solves your issue, but it is also handy when you
         | have to chunk a larger document and would lose context from
         | calculating the embedding just on the chunk.
        
       | joerick wrote:
       | The thing that puzzles me about embeddings is that they're so
       | untargeted, they represent everything about the input string.
       | 
       | Is there a method for dimensionality reduction of embeddings for
       | different applications? Let's say I'm building a system to find
       | similar tech support conversations and I am only interested in
       | the content of the discussion, not the tone of it.
       | 
       | How could I derive an embedding that represents only content and
       | not tone?
        
         | adamgordonbell wrote:
         | Agreed.. biggest problem with off the shelf embeddings I hit.
         | Need a way to decompose embeddings.
        
         | johndough wrote:
         | You can do math with word embeddings. A famous example (which I
         | now see has also been mentioned in the article) is to compute
         | the "woman vector" by subtracting "man" from "woman". You can
         | then add the "woman vector" to e.g. the "king" vector to obtain
         | a vector which is somewhat close to "queen".
         | 
         | To adapt this to your problem of ignoring writing style in
         | queries, you could collect a few text samples with different
         | writing styles but same content to compute a "style direction".
         | Then when you do a query for some specific content, subtract
         | the projection of your query embedding onto the style direction
         | to eliminate the style:                   query_without_style =
         | query - dot(query, style_direction) * style_direction
         | 
         | I suspect this also works with text embeddings, but you might
         | have to train the embedding network in some special way to
         | maximize the effectiveness of embedding arithmetic. Vector
         | normalization might also be important, or maybe not. Probably
         | depends on the training.
         | 
         | Another approach would be to compute a "content direction"
         | instead of a "style direction" and eliminate every aspect of a
         | query that is not content. Depending on what kind of texts you
         | are working with, data collection for one or the other
         | direction might be easier or have more/fewer biases.
         | 
         | And if you feel especially lazy when collecting data to compute
         | embedding directions, you can generate texts with different
         | styles using e.g. ChatGPT. This will probably not work as well
         | as carefully handpicked texts, but you can make up for it with
         | volume to some degree.
        
           | joerick wrote:
           | Interesting, but your hypothesis assumes that 'tone' is one-
           | dimensional, that there is a single axis you can remove. I
           | think tone is very multidimensional, I'd expect to be
           | removing multiple 'directions' from the embedding.
        
             | johndough wrote:
             | You could of course compute multiple "tone" directions for
             | every "tone" you can identify and subtract all of them. It
             | might work better, but it will definitely be more work.
        
             | jerf wrote:
             | I would say rather that the "standard example" is
             | simplified, but it does capture an essential truth about
             | the vectors. The surprise is not that the real world is
             | complicated and nothing is simply expressible as a vector
             | and that treating it as such doesn't 100% work in every way
             | in every circumstance all of the time. That's obvious.
             | Everyone who might work with embeddings gets it, and if
             | they don't, they soon will. The surprise is that it _does_
             | work as well as it does and does seem to be capturing more
             | than a naive skepticism would expect.
        
             | mattnewton wrote:
             | No, I don't think the author is saying one dimensional -
             | the vectors are represented by magnitudes in almost all of
             | the embedding dimensions.
             | 
             | They are still a "direction" in the way that [0.5, 0.5] in
             | x,y space is a 45 degree angle, and in that direction it
             | has a magnitude of around 0.7
             | 
             | So of course you could probably define some other vector
             | space where many of the different labeled vectors are
             | translated to magnitudes in the original embedding space,
             | letting you do things like have a "tone" slider.
        
               | TeMPOraL wrote:
               | I think GP is saying that GGP assumes "tone" is one
               | direction, in the sense there exists a vector V
               | representing "tone direction", and you can scale "tone"
               | independently by multiplying that vector with a scalar -
               | hence, 1 dimension.
               | 
               | I'd say this assumption is both right and wrong. Wrong,
               | because it's unlikely there's a direction in embedding
               | space corresponding to a platonic ideal of "tone". Right,
               | because I suspect that, for sufficiently large embedding
               | space (on the order of what goes into current LLMs), any
               | continuous concept we can articulate will have a
               | corresponding direction in the embedding space, that's
               | roughly as sharp as our ability to precisely define the
               | concept.
        
         | loa_in_ wrote:
         | They don't represent everything. In theory they do but in
         | reality the choice of dimensions is a function of the model
         | itself. It's unique to each model.
        
           | joerick wrote:
           | Yeah, 'everything' as in 'everything that the model cares
           | about' :)
        
         | macNchz wrote:
         | Depends on the nature of the content you're working with, but
         | I've had some good results using an LLM during indexing to
         | generate a search document by rephrasing the original text in a
         | standardized way. Then you can search against the embeddings of
         | that document, and perhaps boost based on keyword similarity to
         | the original text.
        
           | joerick wrote:
           | Nice workaround. I just wish there was a less 'lossy' way to
           | go about it!
        
             | jacobr1 wrote:
             | Could you explicitly train a set of embeddings that
             | performed that step in the process? For example which
             | computing the loss, you compare the difference against the
             | normalized text rather than the original. Or alternatively
             | do this as a fine-tuning. Then you would have embedding
             | that optimized for the characteristics you care about.
        
             | hobs wrote:
             | Normal full text search stuff helps reduce the search space
             | - eg lemming, stemming, query simplification stuff were all
             | way before LLMs.
        
           | mrshu wrote:
           | This is also often referred to as Hypothetical Document
           | Embeddings (https://arxiv.org/abs/2212.10496).
        
             | adamgordonbell wrote:
             | Do you have examples of this? Please say more!
        
         | mrshu wrote:
         | Though not exactly what you are after, Contextual Document
         | Embeddings (https://huggingface.co/jxm/cde-small-v1), which
         | generate embeddings based on "surrounding context" might be of
         | some interest.
         | 
         | With 281M params it's also relatively small (at least for an
         | embedding model) so one can play with it relatively easily.
        
         | nostrebored wrote:
         | There are a few things you can do. If these access patterns are
         | well known ahead of time, you can train subdomain behavior into
         | the embedding models by using prefixing. E.g. content: fixing a
         | broken printer, tone: frustration about broken printer, and
         | "fixing a broken printer" can all be served by a single model.
         | 
         | We have customers doing this in production in other contexts.
         | 
         | If you have fundamentally different access patterns (e.g. doc
         | -> doc retrieval instead of query -> doc retrieval) then it's
         | often time to just maintain another embedding index with a
         | different model.
        
         | _pastel wrote:
         | You could fine-tune the embedding model to reduce cosine
         | distance on a more specific function.
        
       | NameError wrote:
       | This article really resonates with me - I've heard people (and
       | vector database companies) describe transformer embeddings +
       | vector databases as primarily a solution for "memory/context for
       | your chatbot, to mitigate hallucinations", which seems like a
       | really specific (and kinda dubious, in my experience) use case
       | for a really general tool.
       | 
       | I've found all of the RAG applications I've tried to be pretty
       | underwhelming, but semantic search itself (especially combined
       | with full-text search) is very cool.
        
         | moffkalast wrote:
         | I dare say RAG with vector DBs is underwhelming because
         | embeddings are not underrated but appropriately rated, and will
         | not give you relevant info in every case. In fact, the way LLMs
         | retrieve info internally [0] already works along the same
         | principle and is a large factor in their unreliability.
         | 
         | [0] https://nonint.com/2023/10/18/is-the-reversal-curse-a-
         | genera...
        
       | dmezzetti wrote:
       | Author of txtai (https://github.com/neuml/txtai) here. I've been
       | in the embeddings space since 2020 before the world of
       | LLMs/GenAI.
       | 
       | In principle, I agree with much of the sentiment here. Embeddings
       | can get you pretty far. If the goal is to find information and
       | citations/links, you can accomplish most of that with a simple
       | embeddings/vector search.
       | 
       | GenAI does have an upside in that it can distill and process
       | those results into something more refined. One of the main
       | production use cases is retrieval augmented generation (RAG). The
       | "R" is usually a vector search but doesn't have to be.
       | 
       | As we see with things like ChatGPT search and Perplexity, there
       | is a push towards using LLMs to summarize the results but also
       | linking to the results to increase user confidence. Even Google
       | Search now has that GenAI section at the top. In general, users
       | just aren't going to accept LLM responses without source
       | citations at this point. The question is if the summary provides
       | value or if the citations really provide the most value. If it's
       | the later, then Embeddings will get the job done.
        
       | esafak wrote:
       | Underrated by people are unfamiliar with machine learning, maybe.
        
         | vindex10 wrote:
         | I actually tend to agree. In the article, I didn't see the
         | strong argument highlighting what powerful feature exactly
         | people were missing in relation to embeddings. Those who work
         | in ML they probably know these basics.
         | 
         | It is a nice read though - explaining the basics of vector
         | spaces, similarity and how it is used in modern ML
         | applications.
        
           | kaycebasques wrote:
           | > Hopefully it's clear from the domain name and intro that
           | I'm suggesting technical writers are underrating how useful
           | embeddings can be in our work. I know ML practitioners do not
           | underrate them.
           | 
           | https://news.ycombinator.com/item?id=42014036
           | 
           | > I didn't see the strong argument highlighting what powerful
           | feature exactly people were missing in relation to embeddings
           | 
           | I had to leave out specific applications as "an exercise for
           | the reader" for various reasons. Long story short, embeddings
           | provide a path to make progress on some of the fundamental
           | problems of technical writing.
        
         | lokar wrote:
         | Even by ML people from 25 years ago. It's a black box function
         | that maps from a ~30k space to a ~1k space. It's a better
         | function then things like PCA, but does the same thing.
        
         | kkielhofner wrote:
         | LLMs have nearly completely sucked the oxygen out of the room
         | when it comes to machine learning or "AI".
         | 
         | I'm shocked at the number of startups, etc you see trying to do
         | RAG, etc that basically have no idea what they are, how they
         | actually work, etc.
         | 
         | The "R" in RAG stands for retrieval - as in the entire field of
         | information retrieval. But let's ignore that and skip right to
         | the "G" (generative)...
         | 
         | Garbage in, garbage out people!
        
       | jonathanrmumm wrote:
       | Embeddings are a new jump to universality, like the alphabet or
       | numbers.
       | https://thebeginningofinfinity.xyz/Jump%20to%20Universality
        
         | OutOfHere wrote:
         | Mind-blowing. In effect, among humans, what separates the
         | civilized from the crude is the quest for universality among
         | the civilized. To say it differently, thinking in terms of
         | attaining universality is the mark of a civilized mind.
         | 
         | I made an episode to appreciate the book:
         | https://podcasters.spotify.com/pod/show/podgenai/episodes/Th...
        
       | freediver wrote:
       | What would be really cool if somebody figured out how to do
       | embeddings -> text.
        
         | kabla wrote:
         | Is it not possible? I'm not that familiar with the topic. Doing
         | some sort of averaging over a large corpus of separate texts
         | could be interesting and probably would also have a lot of
         | applications. Let's say that you are gathering feedback from a
         | large group of people and want to summarize it in an anonymized
         | way. I imagine you'd need embeddings with a somewhat large
         | dimensionality though?
        
         | cubefox wrote:
         | I wonder if someone has already tried to do that. Though this
         | might go in a similar direction:
         | https://arxiv.org/abs/1711.00043
        
         | 0x1ceb00da wrote:
         | That's chatgpt
        
         | kaibee wrote:
         | Hmm as a very stupid first pass...
         | 
         | 0. Generate an embedding of some text, so that you have a known
         | good embedding, this will be your target.
         | 
         | 1. Generate an array of random tokens the length of the
         | response you want.
         | 
         | 2. Compute the embedding of this response.
         | 
         | 3. Pick a random sub-section of the response and randomize the
         | tokens in it again.
         | 
         | 4. Compute the embedding of your new response.
         | 
         | 5. If the embeddings are closer together, keep your random
         | changes, otherwise discard them, go back to step 2.
         | 
         | 6. Repeat this process until going back to step 2 stops
         | improving your score. Also you'll probably want to shrink the
         | size of the sub-section you're randomizing the closer your
         | computed embedding is to your target embedding. Also you might
         | be able to be cleverer by doing some kind of masking strategy?
         | Like let's say the first half of your response text already was
         | actually the true text of the target embedding. An ideal
         | randomizer would see that randomizing the first half almost
         | always makes the result worse, and so would target the 2nd half
         | more often (I'm hoping that embeddings work like this?).
         | 
         | 7. Do this N times and use an LLM to score and discard the
         | worst N-1 results. I expect that 99.9% of the time you're
         | basically producing adversarial examples w/ this strategy.
         | 
         | 8. Feed this last result into an LLM and ask it to clean it up.
        
         | OutOfHere wrote:
         | Reconstruct text from SONAR embeddings:
         | https://github.com/facebookresearch/SONAR?tab=readme-ov-file...
        
       | ericholscher wrote:
       | This is a great post. I've also been having a lot of fun working
       | with embeddings, with lots of those pages being documentation. We
       | write up a quick post on how are using them in prod, if you want
       | to go from having an embedding to actually using them in a web
       | app:
       | 
       | https://www.ethicalads.io/blog/2024/04/using-embeddings-in-p...
        
         | kaycebasques wrote:
         | Thanks, Eric. So what you're really telling me is that you
         | might make an exception to the "no tools talks" general policy
         | for Write The Docs conference talks and let me nerd out on
         | embeddings for 30 mins?? ;P
        
           | ericholscher wrote:
           | Haha. I think they are definitely relevant, and I'd call them
           | a technology more than a tool.
           | 
           | That is mostly just that we don't want folks going up and
           | doing a 30 minute demo of Sphinx or something :-)
        
       | huijzer wrote:
       | > Is it terrible for the environment?
       | 
       | > I don't know. After the model has been created (trained), I'm
       | pretty sure that generating embeddings is much less
       | computationally intensive than generating text. But it also seems
       | to be the case that embedding models are trained in similar ways
       | as text generation models2, with all the energy usage that
       | implies. I'll update this section when I find out more.
       | 
       | Although I do care about the environment, this question is
       | completely the wrong one if you ask me. There is the public
       | opinion (mainstream media?) some kind of idea that we should use
       | less AI and somehow this would solve our climate problems.
       | 
       | As a counterexample, let's go to the extreme. Let's ban Google
       | Maps because it does take computational resources from the phone.
       | As a result more people will take wrong routes, and thus use more
       | petrol. Say you use one gallon of petrol extra, that then wastes
       | 34 kWh. This is of course the equivalent of running 34 powerful
       | vacuum cleaners on full power for an hour. In contrast, say you
       | downloaded your map, then the total "cost" is only the power used
       | by the phone. A mobile phone has a battery of about 4 mAh, so
       | 0,004 Ah * 4.2 V = 0.168 W, or 0.000168 kW. This means that the
       | phone is about 200 000 times as efficient! And then we didn't
       | even consider the time-saving for the human.
       | 
       | It's the same with running embeddings for doc generation. An
       | Nvidia H100 consumes about 700 W, so say 1 kWh after an hour of
       | running. 1 kWh should be enough to do a bunch of embedding runs.
       | If this then saves, for example, one workday including the
       | driving back and forth to the office, then again the tradeoff is
       | highly in favor of the compute.
        
         | archerx wrote:
         | If people really cared about the environment then they would
         | ban residential air conditioning, it's a luxury.
         | 
         | I lived somewhere that would get to 40c in the summers and an
         | oscillating fan was good enough to keep cool, the AC was nice
         | to have but it wasn't necessary.
         | 
         | I find it very hypocritical when people tell you to change your
         | lifestyle for climate change but they have the A/C blasting all
         | day long, everyday.
        
           | wussboy wrote:
           | "Reduce" was never going to work. Only the deep
           | electrification of our economy will save us.
        
             | TeMPOraL wrote:
             | Indeed, and A/C is kind of a prime example of why it's
             | beneficial, given how much more energy-efficient heat pumps
             | are for cooling and heating than just about anything else.
        
           | ausbah wrote:
           | in a somewhat ironic twist, ac is crucial survival
           | infrastructure in some parts of the world when heat comes
           | during hot seasons. phoenix usa, parts of india, etc
        
           | coredog64 wrote:
           | Prepare to have your mind blown: Heat pumps are very energy
           | efficient at moving heat around. Significantly more so than
           | the oil-fired boiler frequently found in the basement of a
           | big-city apartment building.
        
           | BeetleB wrote:
           | > I lived somewhere that would get to 40c in the summers and
           | an oscillating fan was good enough to keep cool
           | 
           | If it's not too humid.
           | 
           | I've lived in places with relatively high humidity and
           | 35-40C, and have had the misfortune of not having an AC. Fans
           | are not enough. I mean, sure, you can _survive_ , but it
           | really, really sucks.
        
           | petesergeant wrote:
           | > that would get to 40c in the summers and an oscillating fan
           | was good enough to keep cool
           | 
           | This is entirely meaningless without providing the humidity.
           | At higher than 70% relative humidity 40C is potentially
           | fatal.
        
           | TeMPOraL wrote:
           | > _I lived somewhere that would get to 40c in the summers and
           | an oscillating fan was good enough to keep cool, the AC was
           | nice to have but it wasn't necessary._
           | 
           | As others have said, that works if you lived in a very dry
           | area. And perhaps it was a house or a building optimized for
           | airflow. And you didn't have much to do during the day. And
           | I'm guessing you're young, and healthy.
           | 
           | Here in central Europe, sustained 40degC in the summer would
           | rack up a significant body count. Anything above 30degC sucks
           | really bad if you have any work to do. Or if, say, you're
           | pregnant or have small children. And you live in a city.
           | 
           | Residential A/C isn't a luxury anymore, it's becoming a
           | necessity very fast. Fortunately, heat pumps are one of the
           | _most efficient_ inventions of humankind. In particular,
           | "A/C blasting all day long" beats anything else you could do
           | to mitigate the heat if it involved getting into a car. And
           | then it also beats whatever else you're doing to heat your
           | place during the winter.
        
         | dleeftink wrote:
         | Long-term, its not about barring progress, but to have progress
         | _and_ more energy efficient models. The sum-games we play in
         | regard to energy usage, don 't necessarily stack up in dynamic
         | systems; the increased energy usage of generative models may
         | very well lead to less compute hours spent behind the desk
         | drafting, revising, redrafting and doing it all over again once
         | the next project comes around.
         | 
         | What remains though, is that increased productivity has rarely
         | lead to a decrease in energy usage. Whether energy scarcity
         | will drive model optimisation is anyone's guess, but it would
         | be a differentiating feature on a market saturated with
         | similarly capable offerings.
        
       | ggnore7452 wrote:
       | if anything i would consider embeddings bit overrated, or it is
       | safer to underrate them.
       | 
       | They're not the silver bullet many initially hoped for, they're
       | not a complete replacement for simpler methods like BM25. They
       | only have very limited "semantic understanding" (and as people
       | throw increasingly large chunks into embedding models, the
       | meanings can get even fuzzier)
       | 
       | Overly high expectations lets people believe that embeddings will
       | retrieve exactly what they mean, and With larger top-k values and
       | LLMs that are exceptionally good at rationalizing responses, it
       | can be difficult to notice mismatches unless you examine the
       | results closely.
        
         | nostrebored wrote:
         | Off the shelf embedding models definitely underpromise and
         | overdeliver. In ten years I'd be very surprised if companies
         | weren't fine-tuning embedding models for search based on their
         | data in any competitive domains.
        
           | kkielhofner wrote:
           | My startup (Atomic Canyon) developed embedding models for the
           | nuclear energy space[0].
           | 
           | Let's just say that if you think off-the-shelf embedding
           | models are going to work well with this kind of highly
           | specialized content you're going to have a rough time.
           | 
           | [0] - https://huggingface.co/atomic-canyon/fermi-1024
        
         | kkielhofner wrote:
         | > they're not a complete replacement for simpler methods like
         | BM25
         | 
         | There are embedding approaches that balance "semantic
         | understanding" with BM25-ish.
         | 
         | They're still pretty obscure outside of the information
         | retrieval space but sparse embeddings[0] are the "most" widely
         | used.
         | 
         | [0] - https://zilliz.com/learn/sparse-and-dense-embeddings
        
         | deepsquirrelnet wrote:
         | Absolutely. Embeddings have been around a while and most people
         | don't realize it wasn't until the e5 series of models from
         | Microsoft that they even benchmarked as well as BM25 in
         | retrieval scores, while being significantly more costly to
         | compute.
         | 
         | I think sparse retrieval with cross encoders doing reranking is
         | still significantly better than embeddings. Embedding indexes
         | are also difficult to scale since hnsw consumes too much memory
         | above a few million vectors and ivfpq has issues with recall.
        
       | mlinksva wrote:
       | https://technicalwriting.dev/data/embeddings.html#let-a-thou...
       | 
       | > As docs site owners, I wonder if we should start freely
       | providing embeddings for our content to anyone who wants them,
       | via REST APIs or well-known URIs. Who knows what kinds of cool
       | stuff our communities can build with this extra type of data
       | about our docs?
       | 
       | Interesting idea. You'd have to specify the exact embedding model
       | used to generate an embedding, right? Is there a well understood
       | convention for such identification like say
       | model_name:model_version:model_hash or something? For technical
       | docs, obviously very broad field, is there an embedding model (or
       | small number) widely used or obviously highly suitable that a
       | site ownwer could choose one and have some reasonable expectation
       | that publishing embeddings for their docs generated using that
       | model would be useful to others? (Naive questions, I am not
       | embedded in the field.)
        
         | skybrian wrote:
         | It seems like sharing the text itself would be a better API,
         | since it lets API users calculate their own embeddings easily.
         | This is what the crawlers for search engines do. If they use
         | embeddings internally, that's up to them, and it doesn't need
         | to be baked into the protocol.
        
         | treefarmer wrote:
         | Yeah, this is the main issue with the suggestion. Embeddings
         | can only be compared to each other if they are in the same
         | space (e.g., generated by the same model). Providing embeddings
         | of a specific kind would require users to use the same model,
         | which can quickly become problematic if you're using a closed-
         | source embedding model (like OpenAI's or Cohere's).
        
       | OutOfHere wrote:
       | This article shows the incorrect value for the OpenAI text-
       | embedding-3-large Input Limit as 3072 which is actually its
       | output limit [1]. The correct value is 8191 [2].
       | 
       | Edit: This value has now been fixed in the article.
       | 
       | [1]
       | https://platform.openai.com/docs/models/embeddings#embedding...
       | 
       | [2]
       | https://platform.openai.com/docs/guides/embeddings/#embeddin...
       | 
       | Also, what each model means by a token can be very different due
       | to the use of different model-specific encodings, so ultimately
       | one must compare the number of characters, not tokens.
        
         | kaycebasques wrote:
         | A couple other issues with that section surfaced here:
         | 
         | * https://news.ycombinator.com/item?id=42014683
         | 
         | * https://news.ycombinator.com/item?id=42015282
         | 
         | Updating that section now
        
         | abound wrote:
         | "Reckless" seems a bit aggressive for what is likely an honest
         | mistake in an otherwise very nice article.
        
           | OutOfHere wrote:
           | Edited.
        
       | tootie wrote:
       | Is it accurate to say that any data that can be tokenized can be
       | turned into embeddings?
        
       | eproxus wrote:
       | I wonder if this can be used to detect code similarity between
       | e.g. function or files etc.? Or are the existing algorithms
       | overly trained on written prose?
        
         | OutOfHere wrote:
         | Yes, of course it can be used in that way, but the quality of
         | the result depends on whether the model was also trained on
         | such code or not.
        
       | hambandit wrote:
       | Embeddings from things like one-hot, count vectorization, tf-idf,
       | etc into dimensionality reduction techniques like SVD and PCA
       | have been around for a long time and also provided the ability to
       | compare any two pieces of text to each other. Yes, neural
       | networks and LLMs have provided the ability for the context of
       | each word to affect the whole document's embedding and capture
       | more meaning, potentially that pesky "semantic" sort even; but
       | they still are fundamentally a dimensionality reduction
       | technique.
        
       ___________________________________________________________________
       (page generated 2024-11-01 23:01 UTC)