[HN Gopher] Embeddings are underrated
___________________________________________________________________
Embeddings are underrated
Author : misonic
Score : 289 points
Date : 2024-11-01 03:24 UTC (19 hours ago)
(HTM) web link (technicalwriting.dev)
(TXT) w3m dump (technicalwriting.dev)
| kaycebasques wrote:
| Cool, first time I've seen one of my posts trend without me
| submitting it myself. Hopefully it's clear from the domain name
| and intro that I'm suggesting technical writers are underrating
| how useful embeddings can be in our work. I know ML practitioners
| do not underrate them.
| dartos wrote:
| Yeah embeddings are the unsung killer feature of LLMs
| donavanm wrote:
| You might want to highlight chunking and how embeddings
| can/should represent subsections of your document as well. It
| seems relevant to me for cases like similarity or semantics
| search, getting the reader to the relevant portion of the
| document or page.
|
| Theres probably some interesting ideas around tokenization and
| metadata as well. For example, if you're processing the raw
| file I expect you want to strip out a lot of markup before
| tokenization of the content. Conversely, some markup like code
| blocks or examples would be meaningful for tokenization and
| embedding anyways.
|
| I wonder if both of those ideas can be combined for something
| like automated footnotes and annotations. Linking or mouseover
| relevant content from elsewhere in the documentation.
| MrGreenTea wrote:
| Do you have any resources you recommend for representing sub
| sections? I'm currently prototyping a note/thoughts editor
| where one feature is suggesting related documents/thoughts
| (think linked notes in Obsidian) for which I would like to
| suggest sub sections and not only full documents.
| donavanm wrote:
| Sorry, no good references off hand. I've had to help write
| & generate public docs in DocBook in the past. But no
| expert on either editors, nlp, or embeddings besides
| hacking around some tools for my own note taking. My
| assumption is youll want to use your existing markup
| structure, if you have it. Or naively split on paragraphs
| with a tool like spacy. Or get real fancy and use dynamic
| ranges; something like an accumulation window that
| aggregates adjacent sentences based on individual
| similarity, break on total size or dissimilarity, and then
| treat that aggregate as the range to "chunk."
| MrGreenTea wrote:
| Thanks for the elaborate and helpful response. I'm also
| hacking on this as a personal note taking project and
| already started playing around with your ideas. Thanks!
| enjeyw wrote:
| Haha yeah I was about to comment that I recall a period just
| after Word2Vec came out where embeddings were most definitely
| not underrated but rather the most hyped ML thing out there!
| rahimnathwani wrote:
| I'm not sure why the voyage-3 models aren't on the MTEB
| leaderboard. The code for the leaderboard suggests they should be
| there:
| https://huggingface.co/spaces/mteb/leaderboard/commit/b7faae...
|
| But I don't see them when I filter the list for 'voyage'.
| newrotik wrote:
| It is unclear this model should be on that leaderboard because
| we don't know whether it has been trained on mteb test data.
|
| It is worth noting that their own published material [0] does
| not entail any score from any dataset from the mteb benchmark.
|
| This may sound nit picky, but considering transformers'
| parroting capabilities, having seen test data during training
| should be expected to completely invalidate those scores.
|
| [0] see excel spreadsheet linked here
| https://blog.voyageai.com/2024/09/18/voyage-3/
| jdthedisciple wrote:
| I'm critical of the low number of embedding dims.
|
| Could hurt performance in niche applications, in my
| estimation.
|
| Looking forward to try the announced large models though.
| fzliu wrote:
| (I work at Voyage)
|
| Many of the top-performing models that you see on the MTEB
| retrieval for English and Chinese tend to overfit to the
| benchmark nowadays. voyage-3 and voyage-3-lite are also pretty
| small in size compared to a lot of the 7B models that take the
| top spots, and we don't want to hurt performance on other real-
| world tasks just to do well on MTEB.
| jdthedisciple wrote:
| It would still be great to know how it compares?
|
| Why should I pick voyage-3 if for all I know it sucks when it
| comes to retrieval accuracy (my personally most important
| metric)?
| fzliu wrote:
| We provide retrieval metrics for a variety of datasets and
| languages: https://blog.voyageai.com/2024/09/18/voyage-3/.
| I also personally encourage folks to either test on their
| own data or to find an open source dataset that closely
| resembles the documents they are trying to search (we
| provide a ton of free tokens for the evaluating our
| models).
| kkielhofner wrote:
| > we don't want to hurt performance on other real-world tasks
| just to do well on MTEB
|
| Nice!
|
| Fortunately MTEB lets you sort by model parameter size
| because using 7B parameter LLMs for embeddings is just...
| Yuck.
| quantadev wrote:
| That was a good post. Vector Embeddings are in some sense a
| summary of a doc that's unique similar to a hashcode of a doc. It
| makes me think it would be cool if there were some universal
| standard for generating embeddings, but I guess they'll be
| different for each AI model, so they can't have the same kind of
| "permanence" hash codes have.
|
| It definitely also seems like there should be lots of ways to
| utilize "Cosine Similarity" (or other closeness algos) in
| databases and other information processing apps that we haven't
| really exploited yet. For example you could almost build a new
| kind of Job Search Service that matches job descriptions to job
| candidates based on nothing but a vector similarity between
| resume and job description. That's probably so obvious it's being
| done, already.
| kqr wrote:
| For one point of inspiration, see
| https://entropicthoughts.com/determining-tag-quality
|
| I really like the picture you are drawing with "semantic
| hashes"!
| quantadev wrote:
| Yeah for "Semantic Hashes" (that's a good word for them!)
| we'd need some sort of "Canonical LLM" model that isn't
| necessarily used for inference, nor does it need to even be
| all that smart, but it just needs to be public for the world.
| It would need to be updated like every 2 to 5 years tho to
| account for new words or words changing meaning? ...but maybe
| could be updated in such a way as to not "invalidate" prior
| vectors, if that makes sense? For example "ride a bicycle"
| would still point in the same direction even after a refresh
| of the canonical model? It seems like feeding the same
| training set could replicate the same model values, but there
| are nonlinear instabilities which could make it disintegrate.
| helloplanets wrote:
| I guess it might be possible to retroactively create an
| embeddings model which could take several different models'
| embeddings, and translate them into the same format.
| genuinelydang wrote:
| No. That's like saying you can transplant a person's neuronal
| action potentials into another person's brain and have it
| make sense to them.
| helloplanets wrote:
| That metaphor is skipping the most important part in
| between! You wouldn't be transplanting anything directly,
| you'd have a separate step in between, which would attempt
| to translate these action potentials.
|
| The point of the translating model in between would be that
| it would re weight each and every one of the values of the
| embedding, after being trained on a massive dataset of
| original text -> vector embedding for model A + vector
| embedding for model B. If you have billions of parameters
| trained to do this translation between just two specific
| models to start with, wouldn't this be in the realm of
| possible?
| quantadev wrote:
| A translation between models doesn't seem possible
| because there are actually no "common dimensions" at all
| between models. That is, each dimension has a completely
| different semantic meaning, in different models, but also
| it's the combination of dimension values that begin to
| impart real "meaning".
|
| For example, the number of different unit vector
| combinations in a 1500 dimensional space is like the
| number of different ways of "ordering" the components,
| which is 5^4114 .
|
| EDIT: And the point of that factorial is that even if the
| dimensions were "identical" across two different LLMs but
| merely "scrambled" (in ordering) there would be that
| large number to contend with to "unscramble".
| tempusalaria wrote:
| This is very similar to how LLMs are taught to understand
| images in llava style models (the image embeddings are
| encoded into the existing language token stream)
| batch12 wrote:
| This is definitely possible. I made something like this. It
| worked pretty well for cosine similarity in my testing.
| nostrebored wrote:
| This is done with two models in most standard biencoder
| approaches. This is how multimodal embedding search works. We
| want to train a model such that the location of the text
| embeddings that represent an item and the image embeddings
| for that item are colocated.
| genuinelydang wrote:
| "you could almost build a new kind of Job Search Service that
| matches job descriptions to job candidates"
|
| The key word being "almost". Yes, you can get similarity
| matches between job requirements and candidate resumes, but
| those matches are not useful for the task of finding an optimal
| candidate for a job.
|
| For example, say a job requires A and B.
|
| Candidate 1 is a junior who has done some work with A, B and C.
|
| Candidate 2 is a senior and knows A, B, C, D, E and F by heart.
| All are relevant to the job and would make 2 the optimal
| candidate, even though C-F are not explicitly stated in the job
| requirements.
|
| Candidate 1 would seem a much better candidate than 2, because
| 1's embedding vector is closer to the job embedding vector.
| coldtea wrote:
| Even that is just static information.
|
| We don't know if Candidate 2 really "knows A, B, C, D, E and
| F by heart", just that they claim to. They could be adding
| whatever to their skill list just, even though they hardly
| used it, just because it' a buzzword.
|
| So Candidate 1 could still blow them out of the water in
| performance, and even be able to trivially learn D, and E in
| a short while on the job if needed.
|
| The skill vector wont tell much by itself, and even prevent
| finding the better candidate if its used for screening.
| quantadev wrote:
| So your point is that LLMs can't tell when job candidates
| are lying on their resume? Well that's true, but neither
| can humans. lol.
| nostrebored wrote:
| That's not accurate. You can explicitly bake in these types
| of search behaviors with model training.
|
| People do this in ecommerce with the concept of user
| embeddings and product embeddings, where the result of
| personalized recommendations is just a user embedding search.
| quantadev wrote:
| > not useful for the task of finding an optimal candidate
|
| That statement is just flat out incorrect on it's face,
| however it did make me think of something I hadn't though of
| before, which is this:
|
| Embedding vectors can be made to have a "scale" (multiplier)
| on specific terms which represent the amount of "weight" to
| add to that term. For example if I have 10 years experience
| in Java Web Development, then we can take the actual
| components of that vector embedding (i.e. for string "Java
| Web Development") and multiply them by some proportionality
| of 10, and that results in a vector that is "Further" into
| that direction. This represents an "amount" of directional
| into the Java Web direction.
|
| So this means even with vector embeddings we can scale out to
| specific amounts of experience. Now here's the cool part. You
| can then take all THOSE scaled vectors (one for each
| individual job candidate skill) and average them to get a
| single point in space which CAN be compared as a single
| scalar distance from what the Job Requirements specify.
| OutOfHere wrote:
| The trick is evaluate the score for each skill, also weighing
| it by the years of experience with the skill, then sum the
| evaluations. This will address your problem 100%.
|
| Also, what a candidate claims as a skill is totally
| irrelevant and can be a lie. It is the work experience that
| matters, and skills can be extracted from it.
| rasulkireev wrote:
| I tried doing something like that: https://gettjalerts.com/
|
| I added semantic search, but I'm workin on adding resume
| upload/parsing to do automatic matching.
| SCUSKU wrote:
| It does exist! I built this for the monthly Who's Hiring
| threads: https://hnresumetojobs.com/
|
| It just does cosine similarity with OpenAI embeddings +
| pgVector. It's not perfect by any means, but it's useful. It
| could probably stand to be improved with a re-ranker, but I
| just never got around to it.
| quantadev wrote:
| Very cool. I knew it was too obvious an idea to be missed!
| Did you read my comments below about how you can maybe "scale
| up" a vector based on number of years of experience. I think
| that will work. It makes somebody with 10 yrs Java Experience
| closer to the target than someone with only 5yrs, if the
| target is 10 years! -- but the problem is someone with 20yrs
| looks even worse when they should look better! My problem in
| my life. hahaha. Too much experience.
|
| I think the best "matching" factor is to minimize total
| distance where each distance is the time-multiplied vector
| for a specific skill.
| Aeolun wrote:
| Is there some way to compare different embeddings for different
| use cases?
| jdthedisciple wrote:
| Search for MTEB Leaderboard on huggingface
| fzliu wrote:
| Great post!
|
| One quick minor note is that the resulting embeddings for the
| same text string could be different, depending on what you
| specify the input type as for retrieval tasks (i.e. query or
| document) -- check out the `input_type` parameter here:
| https://docs.voyageai.com/reference/embeddings-api.
| thund wrote:
| Doesn't OpenAI embedding model support 8191/8192 tokens? That
| aside, declaring a winner by token size is misleading. There are
| more important factors like cross language support and precision
| for example
| jdthedisciple wrote:
| Yep, voyage-3 is not even anywhere in the top of the MTEB
| leaderboard if you order by `retrieval score` desc.
|
| stella_en_1.5B_v5 seems to be an unsung hero model in that
| regard
|
| plus you may not even want such large token sizes if you just
| need accurate retrieval of snippets of text (like 1-2
| sentences)
| kaycebasques wrote:
| Thanks thund and jdthedisciple for these points and
| corrections. I'll update the section today.
| kaycebasques wrote:
| Updated the section to refer to the "Retrieval Average"
| column of the MTEB leaderboard. Is that the right column to
| refer to? Can someone link me to an explanation of how that
| benchmark works? Couldn't find a good link on it
| OutOfHere wrote:
| And that's not all because token encodings of different models
| can be very different.
| nerdright wrote:
| Great post indeed! I totally agree that embeddings are
| underrated. I feel like the "information retrieval/discovery"
| world is stuck using spears (i.e., term/keyword-based discovery)
| instead of embracing the modern tools (i.e., semantic-based
| discovery).
|
| The other day I found myself trying to figure out some common
| themes across a bunch of comments I was looking at. I felt lazy
| to go through all of them so I turned my attention to the
| "Sentence Transformers" lib. I converted each comment into a
| vector embedding, applied k-means clustering on these embeddings,
| then gave each cluster to ChatGPT to summarize the corresponding
| comments. I have to admit, it was fun doing this and saved me
| lots of time!
| Gooblebrai wrote:
| Interesting approach. Did you tell GPT to summarise the
| comments of each cluster after grouping them?
| empiko wrote:
| My hot take: embeddings are overrated. They are overfitted on
| word overlap, leading to both many false positives and false
| negatives. If you identify a specific problem with them ("I
| really want to match items like these, but it does not work"), it
| is almost impossible to fix them. I often see them being used
| inappropriately, by people who read about their magical
| properties, but didn't really care about evaluating their
| results.
| cheevly wrote:
| You can easily fix this using embedding arithmetic to build
| embedding classifiers.
| mbanerjeepalmer wrote:
| Are there good examples of this working in the wild? Before I
| comb through all ten blue links... https://www.google.com/sea
| rch?q=embedding%20arithmetic%20emb...
| nostrebored wrote:
| "I really want to match items like these, but it does not work"
| is just a fine tuning problem.
| empiko wrote:
| Yes, in a sense that if you have infinite appropriate dataset
| and compute. No, in a sense what is practically achievable.
| nostrebored wrote:
| You don't need infinite data. You need ~100k samples. It's
| also not particularly expensive.
| deepsquirrelnet wrote:
| I think there is a deeper technical truth to this that hints at
| how much space there is to be gained in optimization.
|
| 1) that matryoshka representations work so well, and as few as
| 64 dimensions account for a large majority of the performance
|
| 2) that dimensional collapse is observed. Look at your cosine
| similarity scores and be amazed that everything is pretty
| similar and despite being a -1 to 1 scale, almost nothing is
| ever less than 0.8 for most models
|
| I think we're at the infancy in this technology, even with all
| of the advances in recent years.
| mrob wrote:
| Embeddings are the only aspect of modern AI I'm excited about
| because they're the only one that gives more power to humans
| instead of taking it away. They're the "bicycle for our minds" of
| Steve Jobs fame; intelligence amplification not intelligence
| replacement. IMO, the biggest improvement in computer usability
| in my lifetime was the introduction of fast and ubiquitous local
| search. I use Firefox's "Find in Page" feature probably 10 or
| more times per day. I use find and grep probably every day. When
| I read man pages or logs, I navigate by search. Git would be
| vastly less useful without git grep. Embeddings have the
| potential to solve the biggest weakness of search by giving us
| fuzzy search that's actually useful.
| gwervc wrote:
| I agee with this view. Generative AI robs us of something
| (thinking, practicing) which is the long term ability to
| practice a skill and improve oneself in exchange of an
| immediate (often crappy) result. Embeddings is a tech that can
| help us solve problem, ut we still have to do most of the work.
| wussboy wrote:
| I'm not sure it robs us. It makes it possible, but many
| people including myself find the artistic products of AI to
| be utterly without value for the reasons you list. I will
| always cherish the product of lifelong dedication and human
| skill
| jacobr1 wrote:
| It doesn't diminish - but I do find it interesting how it
| influences. Realism became less important, less
| interesting, though still valued to a lesser degree, with
| the ubiquity of photography. Where will human creativity
| move towards when certain task become trivially machine
| replicable? Where will human ingenuity _enabled_ by new
| technology make new art possible?
| larve wrote:
| I ask LLMs to give me exercises, tutorials then write up my
| experience into "course notes", along with flashcards. I ask
| it to simulate a teacher, I ask it to simulate students that
| I have to teach, etc...
|
| I haven't found a tool that is more effective in helping me
| learn.
| greentxt wrote:
| Great for learning for learning sake. Learning with the
| intention of pursuing a career requires the economic/job
| model too, which is the problem.
| stocknoob wrote:
| Does a player piano rob you of playing music yourself? A car
| from walking? A wheelbarrow from working out? It's up to you
| if you want to stop practicing!
|
| Chess has become even more popular despite computers that can
| "rob us" of the joy. They're even better practice partners.
| crashabr wrote:
| An individual car doesn't stop you from walking but a
| culture that centers cars leads to cities where walking is
| outright dangerous.
|
| Most car owners would never say outright "I want a car-
| centric culture". But car manufacturers lobbied for it, and
| step by step, we got both the deployment of useful car
| infrastructure, and the destruction or ignoring of all
| amenities useful for people walking or cycling.
|
| Now let's go back to the period where cars start to become
| enormously popular, and cities start to build neighborhoods
| without sidewalks. There was probably someone at the time
| complaining about the risk of cars overtaking walking and
| leading to stores being more far away etc. And in front of
| them was probably someone like you calling them a luddite
| and being oblivious of second order effects.
| TeMPOraL wrote:
| So you're saying, embeddings are fine, as long as we refrain
| from making full use of their capabilities? We've hit on a
| mathematical construct that seems to be able to _capture
| understanding_ , and you're saying that the biggest models are
| too big, we need to scale down, only use embeddings for
| surface-level basic similarities?
|
| I too think embeddings are vastly underutilized, and chat
| interface is not the be-all, end-all (not to mention, "chat
| with your program/PDF/documentation" just sounds plain stupid).
| However, whether current AI tools are replacing or amplifying
| your intelligence, is entirely down to how you use them.
|
| As for search, yes, that was a huge breakthrough and powerful
| amplifier. 2+ decades ago. At this point it's computer use 101
| - which makes it sad when dealing with programs or websites
| that are opaque to search, and "ubiquitous local search" is
| still not here. Embeddings can and hopefully will give us
| better fuzzy/semantic searching, but if you push this far
| enough, you'll have to stop and ask - if the search tool is now
| capable to understand some aspects of my data, why not surface
| this understanding as a different view into data, instead of
| just invoking it in the background when user makes a search
| query?
| autokad wrote:
| I have found that embeddings + LLM is very successful. I'm
| going to make the words up as to not yield my work publicly,
| but I had to classify something into 3 categories. I asked a
| simple llm to label it, it was 95% accurate. taking the min
| distance from the word embeddings to the mean category
| embeddings was about 96%. When I gave gave the LLM the
| embedding prediction, the LLM was 98% accurate.
|
| There were issues an embedding model might not do well on
| where as the LLM could handle. for example: These were camel
| case words, like WoodPecker, AquafinaBottle, and WoodStock (I
| changed the words to not reveal private data). WoodPecker and
| WoodStock would end up with close embedding values because
| the word Wood dominated the embedding values, but these were
| supposed to go into 2 different categories.
| bravura wrote:
| Some of the best performing embedding models
| (https://huggingface.co/spaces/mteb/leaderboard) are LLMs.
| Have you tried them?
| kkielhofner wrote:
| > word Wood dominated the embedding values, but these were
| supposed to go into 2 different categories
|
| When faced with a similar challenge we developed a custom
| tokenizer, pretrained BERT base model[0], and finally a
| SPLADE-esque sparse embedding model[1] on top of that.
|
| [0] - https://huggingface.co/atomic-canyon/fermi-bert-1024
|
| [1] - https://huggingface.co/atomic-canyon/fermi-1024
| bravura wrote:
| Do you mind sharing why you chose SPLADE-esque sparse
| embeddings?
|
| I have been working on embeddings for a while.
|
| For different reasons I have recently become very
| interested in learned sparse embeddings. So I am curious
| what led you to choose them for your application, and
| why?
| kkielhofner wrote:
| > Do you mind sharing why you chose SPLADE-esque sparse
| embeddings?
|
| I can provide what I can provide publicly. The first
| thing we ever do is develop benchmarks given the
| uniqueness of the nuclear energy space and our
| application. In this case it's FermiBench[0].
|
| When working with operating nuclear power plants there
| are some fairly unique challenges:
|
| 1. Document collections tend to be in the billions of
| pages. When you have regulatory requirements to
| extensively document EVERYTHING and plants that have been
| operating for several decades you end up with a lot of
| data...
|
| 2. There are very strict security requirements -
| generally speaking everything is on-prem and hard air-
| gapped. We don't have the luxury of cloud elasticity.
| Sparse embeddings are very efficient especially in terms
| of RAM and storage. Especially important when factoring
| in budgetary requirements. We're already dropping in
| eight H100s (minimum) so it starts to creep up fast...
|
| 3. Existing document/record management systems in the
| nuclear space are keyword search based if they have
| search at all. This has led to substantial user
| conditioning - they're not exactly used to what we'd call
| "semantic search". Sparse embeddings in combination with
| other techniques bridge that well.
|
| 4. Interpretability. It's nice to be able to peek at the
| embedding and be able to get something out of it at a
| glance.
|
| So it's basically a combination of efficiency,
| performance, and meeting users where they are. Our Fermi
| model series is still v1 but we've found performance (in
| every sense of the word) to be very good based on
| benchmarking and initial user testing.
|
| I should also add that some aspects of this (like
| pretrained BERT) are fairly compute-intense to train.
| Fortunately we work with the Department of Energy Oak
| Ridge National Laboratory and developed all of this on
| Frontier[1] (for free).
|
| [0] - https://huggingface.co/datasets/atomic-
| canyon/FermiBench
|
| [1] -
| https://en.wikipedia.org/wiki/Frontier_(supercomputer)
| inbread wrote:
| I've been experimenting with using embeddings for finding the
| relevant git commits, as I often don't know or remember the
| exact word that was used. So I created my own little tool for
| embedding and finding commits by commit messages. Maybe you'll
| also find it useful: https://github.com/adrianmfi/git-semantic-
| similarity
| chamomeal wrote:
| Very cool, I'll try this out!
| mgraczyk wrote:
| All modern AI technology can give more power to humans, you
| just have to use the right tools. Every AI tool I can think of
| has made me more productive.
|
| LLMs help me write code faster and understand new libraries,
| image generation helps me build sites and emails faster, etc
| attentive wrote:
| there is fzf, depending on your definition of "useful"
| imgabe wrote:
| Is there any benefit to fine-tuning a model on your corpus before
| using it to generate embeddings? Would that improve the quality
| of the matches?
| gunalx wrote:
| Yes. Especially if you work in a not well supported language
| and/or have specific datapairs you want to match that might be
| out of ordinary text.
|
| Training your own fine tune takes a really short time and GPU
| resources, and you can easily outperform even sota models on
| your specific problem with a smaller model/vector space
|
| Then again on general English text and doing a basic fuzzy
| search. I would not really expect high performance gains.
| tomthe wrote:
| Nice introduction, but I think that ranking the models purely by
| their input token limits is not a useful exercise. Looking at the
| MTEB leaderboard is better (although a lot of the models are
| probably overfitting to their test set).
|
| This is a good time to chill for my visualization of 5
| Millionembeddings of HN posts, users and comments:
| https://tomthe.github.io/hackmap/
| kaycebasques wrote:
| Thanks, a couple other people gave me this same feedback in
| another comment thread and it definitely makes sense not to
| overindex on input token size. Will update that section in a
| bit.
| l5870uoo9y wrote:
| Are there any visualization libraries that visualize embeddings
| in a vector space?
| f_devd wrote:
| UMAP: https://umap-learn.readthedocs.io/en/latest/
|
| scikit-learn also has options: https://scikit-
| learn.org/stable/auto_examples/manifold/plot_...
| sk11001 wrote:
| There's attempts but you can only do so much in
| hundreds/thousands of dimensions. Most of the time the
| visualization doesn't really provide anything meaningful.
| beejiu wrote:
| My instinct would be a principal component analysis (which
| someone has demonstrated here:
| https://www.youtube.com/watch?app=desktop&v=brt88wwoZtI). Not
| sure it would tell you much though, but it looks nice.
| OutOfHere wrote:
| If you need them visualized, you're already on the wrong track.
| adamgordonbell wrote:
| I was using embeddings to group articles by topic, and hit a
| specific issue. Say I had 10 articles about 3 topics, and
| articles are either dry or very casual in tone.
|
| I found clustering by topic was hard, because tone dimensions (
| whatever they were ) seemed to dominate.
|
| How can you pull apart the embeddings? Maybe use an LLM to
| extract a topic, and then cluster by extracted topic?
|
| In the end I found it easier to just ask an LLM to group articles
| by topic.
| eamag wrote:
| I agree, I tried several methods during my pet project [1], and
| all of them have their pros and cons. Looks like creating
| topics first and predicting them using LLM works the best
|
| [1] https://eamag.me/2024/Automated-Paper-Classification
| coredog64 wrote:
| Allegedly, the new hotness in RAG is exactly that. Use a
| smaller LLM to summarize the article and include that summary
| alongside the article when generating the embedding.
|
| Potentially solves your issue, but it is also handy when you
| have to chunk a larger document and would lose context from
| calculating the embedding just on the chunk.
| joerick wrote:
| The thing that puzzles me about embeddings is that they're so
| untargeted, they represent everything about the input string.
|
| Is there a method for dimensionality reduction of embeddings for
| different applications? Let's say I'm building a system to find
| similar tech support conversations and I am only interested in
| the content of the discussion, not the tone of it.
|
| How could I derive an embedding that represents only content and
| not tone?
| adamgordonbell wrote:
| Agreed.. biggest problem with off the shelf embeddings I hit.
| Need a way to decompose embeddings.
| johndough wrote:
| You can do math with word embeddings. A famous example (which I
| now see has also been mentioned in the article) is to compute
| the "woman vector" by subtracting "man" from "woman". You can
| then add the "woman vector" to e.g. the "king" vector to obtain
| a vector which is somewhat close to "queen".
|
| To adapt this to your problem of ignoring writing style in
| queries, you could collect a few text samples with different
| writing styles but same content to compute a "style direction".
| Then when you do a query for some specific content, subtract
| the projection of your query embedding onto the style direction
| to eliminate the style: query_without_style =
| query - dot(query, style_direction) * style_direction
|
| I suspect this also works with text embeddings, but you might
| have to train the embedding network in some special way to
| maximize the effectiveness of embedding arithmetic. Vector
| normalization might also be important, or maybe not. Probably
| depends on the training.
|
| Another approach would be to compute a "content direction"
| instead of a "style direction" and eliminate every aspect of a
| query that is not content. Depending on what kind of texts you
| are working with, data collection for one or the other
| direction might be easier or have more/fewer biases.
|
| And if you feel especially lazy when collecting data to compute
| embedding directions, you can generate texts with different
| styles using e.g. ChatGPT. This will probably not work as well
| as carefully handpicked texts, but you can make up for it with
| volume to some degree.
| joerick wrote:
| Interesting, but your hypothesis assumes that 'tone' is one-
| dimensional, that there is a single axis you can remove. I
| think tone is very multidimensional, I'd expect to be
| removing multiple 'directions' from the embedding.
| johndough wrote:
| You could of course compute multiple "tone" directions for
| every "tone" you can identify and subtract all of them. It
| might work better, but it will definitely be more work.
| jerf wrote:
| I would say rather that the "standard example" is
| simplified, but it does capture an essential truth about
| the vectors. The surprise is not that the real world is
| complicated and nothing is simply expressible as a vector
| and that treating it as such doesn't 100% work in every way
| in every circumstance all of the time. That's obvious.
| Everyone who might work with embeddings gets it, and if
| they don't, they soon will. The surprise is that it _does_
| work as well as it does and does seem to be capturing more
| than a naive skepticism would expect.
| mattnewton wrote:
| No, I don't think the author is saying one dimensional -
| the vectors are represented by magnitudes in almost all of
| the embedding dimensions.
|
| They are still a "direction" in the way that [0.5, 0.5] in
| x,y space is a 45 degree angle, and in that direction it
| has a magnitude of around 0.7
|
| So of course you could probably define some other vector
| space where many of the different labeled vectors are
| translated to magnitudes in the original embedding space,
| letting you do things like have a "tone" slider.
| TeMPOraL wrote:
| I think GP is saying that GGP assumes "tone" is one
| direction, in the sense there exists a vector V
| representing "tone direction", and you can scale "tone"
| independently by multiplying that vector with a scalar -
| hence, 1 dimension.
|
| I'd say this assumption is both right and wrong. Wrong,
| because it's unlikely there's a direction in embedding
| space corresponding to a platonic ideal of "tone". Right,
| because I suspect that, for sufficiently large embedding
| space (on the order of what goes into current LLMs), any
| continuous concept we can articulate will have a
| corresponding direction in the embedding space, that's
| roughly as sharp as our ability to precisely define the
| concept.
| loa_in_ wrote:
| They don't represent everything. In theory they do but in
| reality the choice of dimensions is a function of the model
| itself. It's unique to each model.
| joerick wrote:
| Yeah, 'everything' as in 'everything that the model cares
| about' :)
| macNchz wrote:
| Depends on the nature of the content you're working with, but
| I've had some good results using an LLM during indexing to
| generate a search document by rephrasing the original text in a
| standardized way. Then you can search against the embeddings of
| that document, and perhaps boost based on keyword similarity to
| the original text.
| joerick wrote:
| Nice workaround. I just wish there was a less 'lossy' way to
| go about it!
| jacobr1 wrote:
| Could you explicitly train a set of embeddings that
| performed that step in the process? For example which
| computing the loss, you compare the difference against the
| normalized text rather than the original. Or alternatively
| do this as a fine-tuning. Then you would have embedding
| that optimized for the characteristics you care about.
| hobs wrote:
| Normal full text search stuff helps reduce the search space
| - eg lemming, stemming, query simplification stuff were all
| way before LLMs.
| mrshu wrote:
| This is also often referred to as Hypothetical Document
| Embeddings (https://arxiv.org/abs/2212.10496).
| adamgordonbell wrote:
| Do you have examples of this? Please say more!
| mrshu wrote:
| Though not exactly what you are after, Contextual Document
| Embeddings (https://huggingface.co/jxm/cde-small-v1), which
| generate embeddings based on "surrounding context" might be of
| some interest.
|
| With 281M params it's also relatively small (at least for an
| embedding model) so one can play with it relatively easily.
| nostrebored wrote:
| There are a few things you can do. If these access patterns are
| well known ahead of time, you can train subdomain behavior into
| the embedding models by using prefixing. E.g. content: fixing a
| broken printer, tone: frustration about broken printer, and
| "fixing a broken printer" can all be served by a single model.
|
| We have customers doing this in production in other contexts.
|
| If you have fundamentally different access patterns (e.g. doc
| -> doc retrieval instead of query -> doc retrieval) then it's
| often time to just maintain another embedding index with a
| different model.
| _pastel wrote:
| You could fine-tune the embedding model to reduce cosine
| distance on a more specific function.
| NameError wrote:
| This article really resonates with me - I've heard people (and
| vector database companies) describe transformer embeddings +
| vector databases as primarily a solution for "memory/context for
| your chatbot, to mitigate hallucinations", which seems like a
| really specific (and kinda dubious, in my experience) use case
| for a really general tool.
|
| I've found all of the RAG applications I've tried to be pretty
| underwhelming, but semantic search itself (especially combined
| with full-text search) is very cool.
| moffkalast wrote:
| I dare say RAG with vector DBs is underwhelming because
| embeddings are not underrated but appropriately rated, and will
| not give you relevant info in every case. In fact, the way LLMs
| retrieve info internally [0] already works along the same
| principle and is a large factor in their unreliability.
|
| [0] https://nonint.com/2023/10/18/is-the-reversal-curse-a-
| genera...
| dmezzetti wrote:
| Author of txtai (https://github.com/neuml/txtai) here. I've been
| in the embeddings space since 2020 before the world of
| LLMs/GenAI.
|
| In principle, I agree with much of the sentiment here. Embeddings
| can get you pretty far. If the goal is to find information and
| citations/links, you can accomplish most of that with a simple
| embeddings/vector search.
|
| GenAI does have an upside in that it can distill and process
| those results into something more refined. One of the main
| production use cases is retrieval augmented generation (RAG). The
| "R" is usually a vector search but doesn't have to be.
|
| As we see with things like ChatGPT search and Perplexity, there
| is a push towards using LLMs to summarize the results but also
| linking to the results to increase user confidence. Even Google
| Search now has that GenAI section at the top. In general, users
| just aren't going to accept LLM responses without source
| citations at this point. The question is if the summary provides
| value or if the citations really provide the most value. If it's
| the later, then Embeddings will get the job done.
| esafak wrote:
| Underrated by people are unfamiliar with machine learning, maybe.
| vindex10 wrote:
| I actually tend to agree. In the article, I didn't see the
| strong argument highlighting what powerful feature exactly
| people were missing in relation to embeddings. Those who work
| in ML they probably know these basics.
|
| It is a nice read though - explaining the basics of vector
| spaces, similarity and how it is used in modern ML
| applications.
| kaycebasques wrote:
| > Hopefully it's clear from the domain name and intro that
| I'm suggesting technical writers are underrating how useful
| embeddings can be in our work. I know ML practitioners do not
| underrate them.
|
| https://news.ycombinator.com/item?id=42014036
|
| > I didn't see the strong argument highlighting what powerful
| feature exactly people were missing in relation to embeddings
|
| I had to leave out specific applications as "an exercise for
| the reader" for various reasons. Long story short, embeddings
| provide a path to make progress on some of the fundamental
| problems of technical writing.
| lokar wrote:
| Even by ML people from 25 years ago. It's a black box function
| that maps from a ~30k space to a ~1k space. It's a better
| function then things like PCA, but does the same thing.
| kkielhofner wrote:
| LLMs have nearly completely sucked the oxygen out of the room
| when it comes to machine learning or "AI".
|
| I'm shocked at the number of startups, etc you see trying to do
| RAG, etc that basically have no idea what they are, how they
| actually work, etc.
|
| The "R" in RAG stands for retrieval - as in the entire field of
| information retrieval. But let's ignore that and skip right to
| the "G" (generative)...
|
| Garbage in, garbage out people!
| jonathanrmumm wrote:
| Embeddings are a new jump to universality, like the alphabet or
| numbers.
| https://thebeginningofinfinity.xyz/Jump%20to%20Universality
| OutOfHere wrote:
| Mind-blowing. In effect, among humans, what separates the
| civilized from the crude is the quest for universality among
| the civilized. To say it differently, thinking in terms of
| attaining universality is the mark of a civilized mind.
|
| I made an episode to appreciate the book:
| https://podcasters.spotify.com/pod/show/podgenai/episodes/Th...
| freediver wrote:
| What would be really cool if somebody figured out how to do
| embeddings -> text.
| kabla wrote:
| Is it not possible? I'm not that familiar with the topic. Doing
| some sort of averaging over a large corpus of separate texts
| could be interesting and probably would also have a lot of
| applications. Let's say that you are gathering feedback from a
| large group of people and want to summarize it in an anonymized
| way. I imagine you'd need embeddings with a somewhat large
| dimensionality though?
| cubefox wrote:
| I wonder if someone has already tried to do that. Though this
| might go in a similar direction:
| https://arxiv.org/abs/1711.00043
| 0x1ceb00da wrote:
| That's chatgpt
| kaibee wrote:
| Hmm as a very stupid first pass...
|
| 0. Generate an embedding of some text, so that you have a known
| good embedding, this will be your target.
|
| 1. Generate an array of random tokens the length of the
| response you want.
|
| 2. Compute the embedding of this response.
|
| 3. Pick a random sub-section of the response and randomize the
| tokens in it again.
|
| 4. Compute the embedding of your new response.
|
| 5. If the embeddings are closer together, keep your random
| changes, otherwise discard them, go back to step 2.
|
| 6. Repeat this process until going back to step 2 stops
| improving your score. Also you'll probably want to shrink the
| size of the sub-section you're randomizing the closer your
| computed embedding is to your target embedding. Also you might
| be able to be cleverer by doing some kind of masking strategy?
| Like let's say the first half of your response text already was
| actually the true text of the target embedding. An ideal
| randomizer would see that randomizing the first half almost
| always makes the result worse, and so would target the 2nd half
| more often (I'm hoping that embeddings work like this?).
|
| 7. Do this N times and use an LLM to score and discard the
| worst N-1 results. I expect that 99.9% of the time you're
| basically producing adversarial examples w/ this strategy.
|
| 8. Feed this last result into an LLM and ask it to clean it up.
| OutOfHere wrote:
| Reconstruct text from SONAR embeddings:
| https://github.com/facebookresearch/SONAR?tab=readme-ov-file...
| ericholscher wrote:
| This is a great post. I've also been having a lot of fun working
| with embeddings, with lots of those pages being documentation. We
| write up a quick post on how are using them in prod, if you want
| to go from having an embedding to actually using them in a web
| app:
|
| https://www.ethicalads.io/blog/2024/04/using-embeddings-in-p...
| kaycebasques wrote:
| Thanks, Eric. So what you're really telling me is that you
| might make an exception to the "no tools talks" general policy
| for Write The Docs conference talks and let me nerd out on
| embeddings for 30 mins?? ;P
| ericholscher wrote:
| Haha. I think they are definitely relevant, and I'd call them
| a technology more than a tool.
|
| That is mostly just that we don't want folks going up and
| doing a 30 minute demo of Sphinx or something :-)
| huijzer wrote:
| > Is it terrible for the environment?
|
| > I don't know. After the model has been created (trained), I'm
| pretty sure that generating embeddings is much less
| computationally intensive than generating text. But it also seems
| to be the case that embedding models are trained in similar ways
| as text generation models2, with all the energy usage that
| implies. I'll update this section when I find out more.
|
| Although I do care about the environment, this question is
| completely the wrong one if you ask me. There is the public
| opinion (mainstream media?) some kind of idea that we should use
| less AI and somehow this would solve our climate problems.
|
| As a counterexample, let's go to the extreme. Let's ban Google
| Maps because it does take computational resources from the phone.
| As a result more people will take wrong routes, and thus use more
| petrol. Say you use one gallon of petrol extra, that then wastes
| 34 kWh. This is of course the equivalent of running 34 powerful
| vacuum cleaners on full power for an hour. In contrast, say you
| downloaded your map, then the total "cost" is only the power used
| by the phone. A mobile phone has a battery of about 4 mAh, so
| 0,004 Ah * 4.2 V = 0.168 W, or 0.000168 kW. This means that the
| phone is about 200 000 times as efficient! And then we didn't
| even consider the time-saving for the human.
|
| It's the same with running embeddings for doc generation. An
| Nvidia H100 consumes about 700 W, so say 1 kWh after an hour of
| running. 1 kWh should be enough to do a bunch of embedding runs.
| If this then saves, for example, one workday including the
| driving back and forth to the office, then again the tradeoff is
| highly in favor of the compute.
| archerx wrote:
| If people really cared about the environment then they would
| ban residential air conditioning, it's a luxury.
|
| I lived somewhere that would get to 40c in the summers and an
| oscillating fan was good enough to keep cool, the AC was nice
| to have but it wasn't necessary.
|
| I find it very hypocritical when people tell you to change your
| lifestyle for climate change but they have the A/C blasting all
| day long, everyday.
| wussboy wrote:
| "Reduce" was never going to work. Only the deep
| electrification of our economy will save us.
| TeMPOraL wrote:
| Indeed, and A/C is kind of a prime example of why it's
| beneficial, given how much more energy-efficient heat pumps
| are for cooling and heating than just about anything else.
| ausbah wrote:
| in a somewhat ironic twist, ac is crucial survival
| infrastructure in some parts of the world when heat comes
| during hot seasons. phoenix usa, parts of india, etc
| coredog64 wrote:
| Prepare to have your mind blown: Heat pumps are very energy
| efficient at moving heat around. Significantly more so than
| the oil-fired boiler frequently found in the basement of a
| big-city apartment building.
| BeetleB wrote:
| > I lived somewhere that would get to 40c in the summers and
| an oscillating fan was good enough to keep cool
|
| If it's not too humid.
|
| I've lived in places with relatively high humidity and
| 35-40C, and have had the misfortune of not having an AC. Fans
| are not enough. I mean, sure, you can _survive_ , but it
| really, really sucks.
| petesergeant wrote:
| > that would get to 40c in the summers and an oscillating fan
| was good enough to keep cool
|
| This is entirely meaningless without providing the humidity.
| At higher than 70% relative humidity 40C is potentially
| fatal.
| TeMPOraL wrote:
| > _I lived somewhere that would get to 40c in the summers and
| an oscillating fan was good enough to keep cool, the AC was
| nice to have but it wasn't necessary._
|
| As others have said, that works if you lived in a very dry
| area. And perhaps it was a house or a building optimized for
| airflow. And you didn't have much to do during the day. And
| I'm guessing you're young, and healthy.
|
| Here in central Europe, sustained 40degC in the summer would
| rack up a significant body count. Anything above 30degC sucks
| really bad if you have any work to do. Or if, say, you're
| pregnant or have small children. And you live in a city.
|
| Residential A/C isn't a luxury anymore, it's becoming a
| necessity very fast. Fortunately, heat pumps are one of the
| _most efficient_ inventions of humankind. In particular,
| "A/C blasting all day long" beats anything else you could do
| to mitigate the heat if it involved getting into a car. And
| then it also beats whatever else you're doing to heat your
| place during the winter.
| dleeftink wrote:
| Long-term, its not about barring progress, but to have progress
| _and_ more energy efficient models. The sum-games we play in
| regard to energy usage, don 't necessarily stack up in dynamic
| systems; the increased energy usage of generative models may
| very well lead to less compute hours spent behind the desk
| drafting, revising, redrafting and doing it all over again once
| the next project comes around.
|
| What remains though, is that increased productivity has rarely
| lead to a decrease in energy usage. Whether energy scarcity
| will drive model optimisation is anyone's guess, but it would
| be a differentiating feature on a market saturated with
| similarly capable offerings.
| ggnore7452 wrote:
| if anything i would consider embeddings bit overrated, or it is
| safer to underrate them.
|
| They're not the silver bullet many initially hoped for, they're
| not a complete replacement for simpler methods like BM25. They
| only have very limited "semantic understanding" (and as people
| throw increasingly large chunks into embedding models, the
| meanings can get even fuzzier)
|
| Overly high expectations lets people believe that embeddings will
| retrieve exactly what they mean, and With larger top-k values and
| LLMs that are exceptionally good at rationalizing responses, it
| can be difficult to notice mismatches unless you examine the
| results closely.
| nostrebored wrote:
| Off the shelf embedding models definitely underpromise and
| overdeliver. In ten years I'd be very surprised if companies
| weren't fine-tuning embedding models for search based on their
| data in any competitive domains.
| kkielhofner wrote:
| My startup (Atomic Canyon) developed embedding models for the
| nuclear energy space[0].
|
| Let's just say that if you think off-the-shelf embedding
| models are going to work well with this kind of highly
| specialized content you're going to have a rough time.
|
| [0] - https://huggingface.co/atomic-canyon/fermi-1024
| kkielhofner wrote:
| > they're not a complete replacement for simpler methods like
| BM25
|
| There are embedding approaches that balance "semantic
| understanding" with BM25-ish.
|
| They're still pretty obscure outside of the information
| retrieval space but sparse embeddings[0] are the "most" widely
| used.
|
| [0] - https://zilliz.com/learn/sparse-and-dense-embeddings
| deepsquirrelnet wrote:
| Absolutely. Embeddings have been around a while and most people
| don't realize it wasn't until the e5 series of models from
| Microsoft that they even benchmarked as well as BM25 in
| retrieval scores, while being significantly more costly to
| compute.
|
| I think sparse retrieval with cross encoders doing reranking is
| still significantly better than embeddings. Embedding indexes
| are also difficult to scale since hnsw consumes too much memory
| above a few million vectors and ivfpq has issues with recall.
| mlinksva wrote:
| https://technicalwriting.dev/data/embeddings.html#let-a-thou...
|
| > As docs site owners, I wonder if we should start freely
| providing embeddings for our content to anyone who wants them,
| via REST APIs or well-known URIs. Who knows what kinds of cool
| stuff our communities can build with this extra type of data
| about our docs?
|
| Interesting idea. You'd have to specify the exact embedding model
| used to generate an embedding, right? Is there a well understood
| convention for such identification like say
| model_name:model_version:model_hash or something? For technical
| docs, obviously very broad field, is there an embedding model (or
| small number) widely used or obviously highly suitable that a
| site ownwer could choose one and have some reasonable expectation
| that publishing embeddings for their docs generated using that
| model would be useful to others? (Naive questions, I am not
| embedded in the field.)
| skybrian wrote:
| It seems like sharing the text itself would be a better API,
| since it lets API users calculate their own embeddings easily.
| This is what the crawlers for search engines do. If they use
| embeddings internally, that's up to them, and it doesn't need
| to be baked into the protocol.
| treefarmer wrote:
| Yeah, this is the main issue with the suggestion. Embeddings
| can only be compared to each other if they are in the same
| space (e.g., generated by the same model). Providing embeddings
| of a specific kind would require users to use the same model,
| which can quickly become problematic if you're using a closed-
| source embedding model (like OpenAI's or Cohere's).
| OutOfHere wrote:
| This article shows the incorrect value for the OpenAI text-
| embedding-3-large Input Limit as 3072 which is actually its
| output limit [1]. The correct value is 8191 [2].
|
| Edit: This value has now been fixed in the article.
|
| [1]
| https://platform.openai.com/docs/models/embeddings#embedding...
|
| [2]
| https://platform.openai.com/docs/guides/embeddings/#embeddin...
|
| Also, what each model means by a token can be very different due
| to the use of different model-specific encodings, so ultimately
| one must compare the number of characters, not tokens.
| kaycebasques wrote:
| A couple other issues with that section surfaced here:
|
| * https://news.ycombinator.com/item?id=42014683
|
| * https://news.ycombinator.com/item?id=42015282
|
| Updating that section now
| abound wrote:
| "Reckless" seems a bit aggressive for what is likely an honest
| mistake in an otherwise very nice article.
| OutOfHere wrote:
| Edited.
| tootie wrote:
| Is it accurate to say that any data that can be tokenized can be
| turned into embeddings?
| eproxus wrote:
| I wonder if this can be used to detect code similarity between
| e.g. function or files etc.? Or are the existing algorithms
| overly trained on written prose?
| OutOfHere wrote:
| Yes, of course it can be used in that way, but the quality of
| the result depends on whether the model was also trained on
| such code or not.
| hambandit wrote:
| Embeddings from things like one-hot, count vectorization, tf-idf,
| etc into dimensionality reduction techniques like SVD and PCA
| have been around for a long time and also provided the ability to
| compare any two pieces of text to each other. Yes, neural
| networks and LLMs have provided the ability for the context of
| each word to affect the whole document's embedding and capture
| more meaning, potentially that pesky "semantic" sort even; but
| they still are fundamentally a dimensionality reduction
| technique.
___________________________________________________________________
(page generated 2024-11-01 23:01 UTC)