https://github.com/dleemiller/WordLlama Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + GitHub Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions By size + Enterprise + Teams + Startups By industry + Healthcare + Financial services + Manufacturing By use case + CI/CD & Automation + DevOps + DevSecOps * Resources Topics + AI + DevOps + Security + Software Development + View all Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up Reseting focus You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} dleemiller / WordLlama Public * Notifications You must be signed in to change notification settings * Fork 7 * Star 492 Things you can do with the token embeddings of an LLM License MIT license 492 stars 7 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 1 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights dleemiller/WordLlama This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Name Name Last commit Last commit message date Latest commit History 161 Commits .github/workflows .github/workflows build_tools build_tools tests tests wordllama wordllama .gitignore .gitignore LICENSE LICENSE MANIFEST.in MANIFEST.in README.md README.md classifiers.txt classifiers.txt dataset_loader.py dataset_loader.py eval_mteb.py eval_mteb.py find_mteb.sh find_mteb.sh pyproject.toml pyproject.toml setup.py setup.py train.py train.py wordllama.png wordllama.png View all files Repository files navigation * README * MIT license WordLlama WordLlama is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity and ranking with minimal inference-time dependencies and optimized for CPU hardware. Word Llama Table of Contents * Quick Start * What is it? * MTEB Results * Embed Text * Training Notes * Roadmap * Extracting Token Embeddings * Citations * License Quick Start Install: pip install wordllama Load the 256-dim model. from wordllama import WordLlama # Load the default WordLlama model wl = WordLlama.load() # Calculate similarity between two sentences similarity_score = wl.similarity("i went to the car", "i went to the pawn shop") print(similarity_score) # Output: 0.06641249096796882 # Rank documents based on their similarity to a query query = "i went to the car" candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"] ranked_docs = wl.rank(query, candidates) print(ranked_docs) # Output: # [ # ('i went to the vehicle', 0.7441646856486314), # ('i went to the truck', 0.2832691551894259), # ('i went to the shop', 0.19732814982305436), # ('i went to the park', 0.15101404519322253) # ] # additional inference methods wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init wl.filter(query, candidates, threshold=0.3) # filter candidates based on query wl.topk(query, candidates, k=3) # return topk strings based on query What is it? WordLlama is a utility for NLP and word embedding model that recycles components from large language models (LLMs) to create efficient and compact word representations (such as GloVe, Word2Vec or FastText). WordLlama begins by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLama3 70B), and training a small context-less model in a general purpose embedding framework. WordLlama improves on all MTEB benchmarks above word models like GloVe 300d, while being substantially smaller in size (16MB default model @ 256-dim vs >2GB). Features of WordLlama include: 1. Matryoshka Representations: Truncate embedding dimension as needed. 2. Low Resource Requirements: A simple token lookup with average pooling, enables this to operate fast on CPU. 3. Binarization: Models trained using the straight through estimator can be packed to small integer arrays for even faster hamming distance calculations. (coming soon) 4. Numpy-only inference: Lightweight and simple. For flexibility, WordLlama employs the Matryoshka representation learning training technique. The largest model (1024-dim) can be truncated to 64, 128, 256 or 512. For binary embedding models, we implement straight-through estimators during training. For dense embeddings, 256 dimensions sufficiently captures most of the performance, while for binary embeddings validation accuracy is close to saturation at 512-dimensions (64 bytes packed). The final weights are saved after weighting, projection and truncation of the entire tokenizer vocabulary. Thus, WordLlama becomes a single embedding matrix (nn.Embedding) that is considerably smaller than the gigabyte-sized llm codebooks we start with. The original tokenizer is still used to preprocess the text into tokens, and the reduced size token embeddings are average pooled. There is very little computation required, and the resulting model sizes range from 16mb to 250mb for the 128k llama3 vocabulary. It's good option for some nlp-lite tasks. You can train sklearn classifiers on it, perform basic semantic matching, fuzzy deduplication, ranking and clustering. I think it should work well for creating LLM output evaluators, or other preparatory tasks involved in multi-hop or agentic workflows. You can perform your own llm surgery and train your own model on consumer GPUs in a few hours. Because of its fast and portable size, it makes a good "Swiss-Army Knife" utility for exploratory analysis and utility applications. MTEB Results (l2_supercat) Metric WL64 WL128 WL256 WL512 WL1024 GloVe Komninos all-MiniLM-L6-v2 (X) 300d Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35 Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04 Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05 Pair 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37 Classification STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90 CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32 SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81 The l2_supercat is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens). Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary). Other Models Results Llama3-based: l3_supercat Embed Text Here's how you can load pre-trained embeddings and use them to embed text: from wordllama import WordLlama # Load pre-trained embeddings # truncate dimension to 64 wl = WordLlama.load(trunc_dim=64) # Embed text embeddings = wl.embed(["the quick brown fox jumps over the lazy dog", "and all that jazz"]) print(embeddings.shape) # (2, 64) Binary embedding models can be used like this: # Binary embeddings are packed into uint64 # 64-dims => array of 1x uint64 wl = WordLlama.load(trunc_dim=64, binary=True) # this will download the binary model from huggingface wl.embed("I went to the car") # Output: array([[3029168427562626]], dtype=uint64) # load binary trained model trained with straight through estimator wl = WordLlama.load(dim=1024, binary=True) # Uses the hamming similarity to binarize similarity_score = wl.similarity("i went to the car", "i went to the pawn shop") print(similarity_score) # Output: 0.57421875 ranked_docs = wl.rank("i went to the car", ["van", "truck"]) wl.binary = False # turn off hamming and use cosine # load a different model class wl = WordLlama.load(config="l3_supercat", dim=1024) # downloads model from HF Training Notes Binary embedding models showed more pronounced improvement at higher dimensions, and either 512 or 1024 is recommended for binary embedding. L2 Supercat was trained using a batch size of 512 on a single A100 for 12 hours. Roadmap * Working on adding inference features: + Semantic text splitting * Add example notebooks + DSPy evaluators + RAG pipelines Extracting Token Embeddings To extract token embeddings from a model, ensure you have agreed to the user agreement and logged in using the Hugging Face CLI (for llama3 models). You can then use the following snippet: from wordllama.extract import extract_safetensors # Extract embeddings for the specified configuration extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors") HINT: Embeddings are usually in the first safetensors file, but not always. Sometimes there is a manifest, sometimes you have to snoop around and figure it out. For training, use the scripts in the github repo. You have to add a configuration file (copy/modify an existing one into the folder). $ pip install wordllama[train] $ python train.py train --config your_new_config (training stuff happens) $ python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/ (saves 1 model per matryoshka dim) Citations If you use WordLlama in your research or project, please consider citing it as follows: @software{miller2024wordllama, author = {Miller, D. Lee}, title = {WordLlama: Recycled Token Embeddings from Large Language Models}, year = {2024}, url = {https://github.com/dleemiller/wordllama}, version = {0.2.6} } License This project is licensed under the MIT License. About Things you can do with the token embeddings of an LLM Resources Readme License MIT license Activity Stars 492 stars Watchers 8 watching Forks 7 forks Report repository Releases 8 tags Packages 0 No packages published Languages * Python 85.0% * Shell 7.6% * Cython 7.4% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.