[HN Gopher] Show HN: Wordllama - Things you can do with the toke...
       ___________________________________________________________________
        
       Show HN: Wordllama - Things you can do with the token embeddings of
       an LLM
        
       After working with LLMs for long enough, I found myself wanting a
       lightweight utility for doing various small tasks to prepare
       inputs, locate information and create evaluators. This library is
       two things: a very simple model and utilities that inference it
       (eg. fuzzy deduplication). The target platform is CPU, and it's
       intended to be light, fast and pip installable -- a library that
       lowers the barrier to working with strings _semantically_. You
       don't need to install pytorch to use it, or any deep learning
       runtimes.  How can this be accomplished? The model is simply token
       embeddings that are average pooled. To create this model, I
       extracted token embedding (nn.Embedding) vectors from LLMs,
       concatenated them along the embedding dimension, added a learnable
       weight parameter, and projected them to a smaller dimension. Using
       the sentence transformers framework and datasets, I trained the
       pooled embedding with multiple negatives ranking loss and
       matryoshka representation learning so they can be truncated. After
       training, the weights and projections are no longer needed, because
       there is no contextual calculations. I inference the entire token
       vocabulary and save the new token embeddings to be loaded to numpy.
       While the results are not impressive compared to transformer
       models, they perform well on MTEB benchmarks compared to word
       embedding models (which they are most similar to), while being much
       smaller in size (smallest model, 32k vocab, 64-dim is only 4MB).
       On the utility side, I've been adding some tools that I think it'll
       be useful for. In addition to general embedding, there's algorithms
       for ranking, filtering, clustering, deduplicating and similarity.
       Some of them have a cython implementation, and I'm continuing to
       work on benchmarking them and improving them as I have time. In
       addition to "standard" models that use cosine similarity for some
       algorithms, there are binarized models that use hamming distance.
       This is a slightly faster, similarity algorithm, with significantly
       less memory per embedding (float32 -> 1 bit).  Hope you enjoy it,
       and find it useful. PS I haven't figured out Windows builds yet,
       but Linux and Mac are supported.
        
       Author : deepsquirrelnet
       Score  : 319 points
       Date   : 2024-09-15 03:25 UTC (19 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dspoka wrote:
       | Looks cool! Any advantages to the mini-lm model - it seems better
       | on most mteb tasks but wondering if maybe inference or something
       | is better.
        
         | lennxa wrote:
         | looks like it's the size of the model itself, more lightweight
         | and faster. mini-lm is 80mb while the smallest one here is
         | 16mb.
        
           | authorfly wrote:
           | Mini-lm isn't optimized to be as small as possible though,
           | and is kind of dated. It was trained on a tiny amount of
           | similarity pairs compared to what we have available today.
           | 
           | As of the last time I did it in 2022, Mini-lm can be
           | distilled to 40mb with only limited loss in accuracy, so can
           | paraphrase-MiniLM-L3-v1 (down to 21mb), by reducing the
           | dimensions by half or more and projecting a custom matrix
           | optimization(optionally, including domain specific or more
           | recent training pairs). I imagine today you could get it down
           | to 32mb (= project to ~156 dim) without accuracy loss.
        
             | byefruit wrote:
             | What are some recent sources for high quality similarity
             | pairs?
        
         | deepsquirrelnet wrote:
         | Mini-lm is a better embedding model. This model does not
         | perform attention calculations, or use a deep learning
         | framework after training. You won't get the contextual benefits
         | of transformer models in this one.
         | 
         | It's not meant to be a state of the art model though. I've put
         | in pretty limiting constraints in order to keep dependencies,
         | size and hardware requirements low, and speed high.
         | 
         | Even for a word embedding model it's quite lightweight, as
         | those have much larger vocabularies are are typically a few
         | gigabytes.
        
           | ryeguy_24 wrote:
           | Which do use attention? Any recommendations?
        
             | deepsquirrelnet wrote:
             | Most current models are transformer encoders that use
             | attention. I like most of the options that ollama provides.
             | 
             | I think this one is currently the top of the MTEB
             | leaderboard, but large dimension vectors and a multi
             | billion parameter model: https://huggingface.co/nvidia/NV-
             | Embed-v1
        
             | nostrebored wrote:
             | Depends immensely on use case -- what are your compute
             | limitations? are you fine with remote code? are you doing
             | symmetric or asymmetric retrieval? do you need support in
             | one language or many languages? do you need to work on just
             | text or (audio, video, image)? are you working in a
             | specific domain?
             | 
             | A lot of people wind up using models based purely on one or
             | two benchmarks and wind up viewing embedding based projects
             | as a failure.
             | 
             | If you do answer some of those I'd be happy to give my
             | anecdotal feedback :)
        
               | ryeguy_24 wrote:
               | Sorry, I wasn't clear. I was speaking about utility
               | models/libraries to compute things like meaning
               | similarity with not just token embeddings but with
               | attention too. I'm really interested in finding a good
               | utility that leverages the transformer to compute
               | "meaning similarity" between two texts.
        
       | authorfly wrote:
       | Nice. I like the tiny size a lot, that's already an advantage
       | over SBERTs smallest models.
       | 
       | But it seems quite dated technically - which I understand is a
       | tradeoff for performance - but can you provide a way to toggle
       | between different types of similarity (e.g. semantic, NLI, noun-
       | abstract)?
       | 
       | E.g. I sometimes want "Freezing" and "Burning" to be very similar
       | (1) as in regards to say grouping/clustering articles in a
       | newspaper into categories like "Extreme environmental events",
       | like on MTEB/Sentence-Similarity, as classic Word2Vec/GloVe would
       | do. But if this was a chemistry article, I want them to be
       | opposite, like ChatGPT embeddings would be. And sometime I want
       | to use NLI embeddings to work our the causal link between two
       | things. Because the latter two embedding types are more recent
       | (2019+), they are where the technical opportunity is, not the
       | older MTEB/semantic similarity ones which have been performant
       | enough for many use cases since 2014 and 2019 received a big
       | boost with mini-lm-v2 etc.
       | 
       | For the above 3 embedding types I can use SBERT but the
       | dimensions are large, models quite large, and having to load
       | multiple models for different similarity types is straining on
       | resources, it often takes about 6GB because generative embedding
       | models (or E5 etc) are large, as are NLI models.
        
         | deepsquirrelnet wrote:
         | Great ideas - I'll run some experiments and see how feasible it
         | is. I'd want to see how performance is if I train on a single
         | type of similarity. Without any contextual computation, I am
         | not sure there are other options for doing it. It may require
         | switching between models, but that's not much of an issue.
        
         | refulgentis wrote:
         | Its a 17 MB model that benchmarks obviously worse than MiniLM
         | v2 (which is SBERT). I run V3 on ONNX on every platform you can
         | think of with a 23 MB model.
         | 
         | I don't intend for that to be read as dismissive, it's just
         | important to understand work like this in context - here, it's
         | that there's a cool trick where if you get to an advanced
         | understanding of LLMs, you notice they have embeddings too, and
         | if that is your lens, it's much more straightforward to take a
         | step forward and mess with those, than take a step back and
         | survey the state of embeddings.
        
         | curl-up wrote:
         | I assume that by "ChatGPT embeddings" you mean OpenAI embedding
         | models. In that case, "burning" and "freezing" are not opposite
         | at all, with a cosine similarity of 0.46 (running on text-
         | embedding-large-3 with 1024 dimensions). "Perfectly opposite"
         | embeddings would have a similarity of -1.
         | 
         | It's a common mistake people make, thinking that words that
         | have the opposite meaning will have opposite embeddings.
         | Instead, words with opposite meanings have _a lot_ in common,
         | e.g. both  "burning" and "freezing" are related to temperature,
         | physics, they're both english words, they're both words that
         | can be a verb, a noun and an adjective (not that many such
         | words), they're both spelled correctly, etc. All these features
         | end up being a part of the embedding.
        
           | magicalhippo wrote:
           | This might be a dumb question but... if I get the embeddings
           | of words with a common theme like "burning", "warm", "cool",
           | "freezing", would I be able to relatively well fit an arc (or
           | line) between them? So that if I interpolate along that
           | arc/line, I get vectors close to "hot" and "cold"?
        
       | anonymousfilter wrote:
       | Has anyone thought of using embeddings to solve Little Alchemy?
       | #sample-use
        
         | batch12 wrote:
         | Looks like someone remade https://neal.fun/infinite-craft/
        
           | wakaru44 wrote:
           | I thought it was the other way around. First little alchemy,
           | then they used an LLM to create a better version of it.
        
             | batch12 wrote:
             | That's possible- I may have assumed incorrectly
        
       | jcmeyrignac wrote:
       | Any plan for languages other than english? This would be a
       | perfect tool for french language.
        
         | deepsquirrelnet wrote:
         | It's certainly feasible. I'd need to put together a corpus for
         | training and I'm not terribly familiar with what's available
         | for French language.
         | 
         | I have done some training with the Mistral family of models,
         | and that's probably what I'd think to try first on a French
         | corpus.
         | 
         | Feel free to open an issue and I'll work on it as I find time.
        
       | ttpphd wrote:
       | This is great for game making! Thank you!
        
       | warangal wrote:
       | Embeddings capture a lot of semantic information based on the
       | training data and objective function, and can be used
       | independently for a lot of useful tasks.
       | 
       | I used to use embeddings from the text-encoder of CLIP model, to
       | augment the prompt to better match corresponding images. For
       | example given a word "building" in prompt, i would find the
       | nearest neighbor in the embedding matrix like "concrete",
       | "underground" etc. and substitute/append those after the
       | corresponding word. This lead to a higher recall for most of the
       | queries in my limited experiments!
        
         | nostrebored wrote:
         | Yup, and you can train these in-domain contextual relationships
         | into the embedding models.
         | 
         | https://www.marqo.ai/blog/generalized-contrastive-learning-f...
        
         | deepsquirrelnet wrote:
         | That's a really cool idea. I'll think about it some more,
         | because it sounds like a feasible implementation for this. I
         | think if you take the magnitude of any token embedding in
         | wordllama, it might also help identify important tokens to
         | augment. But it might work a lot better if trained on data
         | selected for this task.
        
       | visarga wrote:
       | This shows just how much semantic content is embedded in the
       | tokens themselves.
        
       | Der_Einzige wrote:
       | I wrote a set of "language games" which used a similar set of
       | functions years ago:
       | https://github.com/Hellisotherpeople/Language-games
        
       ___________________________________________________________________
       (page generated 2024-09-15 23:01 UTC)