https://github.com/mukel/llama3.java Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} mukel / llama3.java Public * Notifications * Fork 6 * Star 96 * Practical Llama 3 inference in Java License View license 96 stars 6 forks Branches Tags Activity Star Notifications * Code * Issues 0 * Pull requests 0 * Discussions * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights mukel/llama3.java This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Name Name Last commit message Last commit date Latest commit History 2 Commits LICENSE LICENSE Llama3.java Llama3.java Makefile Makefile README.md README.md View all files Repository files navigation * README * License Llama3.java Practical Llama 3 inference implemented in a single Java file. [330573897-7939588c-c0ff-4261-b67f-8a54bad59ab5] This project is the successor of llama2.java based on llama2.c by Andrej Karpathy and his excellent educational videos. Besides the educational value, this project will be used to test and tune compiler optimizations and features on the JVM, particularly for the Graal compiler. Features * Single file, no dependencies * GGUF format parser * Llama 3 tokenizer based on minbpe * Llama 3 inference with Grouped-Query Attention * Support for Q8_0 and Q4_0 quantizations * Fast matrix-vector multiplication routines for quantized tensors using Java's Vector API * Simple CLI with --chat and --instruct modes. Here's the interactive --chat mode in action: [330590378-2245f59d-6c86-49c3-87d3-8b1a2cb83a91] Setup Download pure Q4_0 and (optionally) Q8_0 quantized .gguf files from: https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF The ~4.3GB pure Q4_0 quantized model is recommended, please be gentle with huggingface.co servers: curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf # Optionally download the Q8_0 quantized model ~8GB # curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf Optional: quantize to pure Q4_0 manually In the wild, Q8_0 quantizations are fine, but Q4_0 quantizations are rarely pure e.g. the output.weights tensor is quantized with Q6_K, instead of Q4_0. A pure Q4_0 quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source with the quantize utility from llama.cpp as follows: ./quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0 Build and run Java 21+ is required, in particular the MemorySegment mmap-ing feature. jbang is a perfect fit for this use case, just: jbang Llama3.java --help Or execute directly, also via jbang: chmod +x Llama3.java ./Llama3.java --help Optional: Makefile + manually build and run A simple Makefile is provided, run make to produce llama3.jar or manually: javac -g --enable-preview -source 21 --add-modules jdk.incubator.vector -d target/classes Llama3.java jar -cvfe llama3.jar com.llama4j.Llama3 LICENSE -C target/classes . Run the resulting llama3.jar as follows: java --enable-preview --add-modules jdk.incubator.vector -jar llama3.jar --help Performance Important Note On GraalVM, please note that the Graal compiler doesn't support the Vector API yet, run with -Dllama.VectorAPI=false, but expect sub-optimal performance. Vanilla OpenJDK 21+ is recommended for now, which supports the Vector API. llama.cpp Vanilla llama.cpp built with make -j 20. ./main --version version: 2879 (4f026363) built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu Executed as follows: ./main -m ../Meta-Llama-3-8B-Instruct-Q4_0.gguf \ -n 512 \ -s 42 \ -p "<|start_of_header_id|>user<|end_of_header_id|>Why is the sky blue?<|eot_id|><|start_of_header_id|>assistant<|end_of_header_id|>\n\n" \ --interactive-specials Collected the "eval time" metric in tokens\s. Llama3.java Running on OpenJDK 21.0.2. jbang Llama3.java \ --model ./Meta-Llama-3-8B-Instruct-Q4_0.gguf \ --max-tokens 512 \ --seed 42 \ --stream false \ --prompt "Why is the sky blue?" Results Notebook Intel 13900H 6pC+8eC/20T 64GB (5200) Linux 6.6.26 Model tokens/s Implementation Llama-3-8B-Instruct-Q4_0.gguf 7.53 llama.cpp Llama-3-8B-Instruct-Q4_0.gguf 6.95 llama3.java Llama-3-8B-Instruct-Q8_0.gguf 5.16 llama.cpp Llama-3-8B-Instruct-Q8_0.gguf 4.02 llama3.java Workstation AMD 3950X 16C/32T 64GB (3200) Linux 6.6.25 **Notes Running on a single CCD e.g. taskset -c 0-15 jbang Llama3.java ... since inference is constrained by memory bandwidth. Model tokens/s Implementation Llama-3-8B-Instruct-Q4_0.gguf 9.26 llama.cpp Llama-3-8B-Instruct-Q4_0.gguf 8.03 llama3.java Llama-3-8B-Instruct-Q8_0.gguf 5.79 llama.cpp Llama-3-8B-Instruct-Q8_0.gguf 4.92 llama3.java License MIT About Practical Llama 3 inference in Java Topics java llama llm llms llm-inference llama3 Resources Readme License View license Activity Stars 96 stars Watchers 9 watching Forks 6 forks Report repository Releases No releases published Packages 0 No packages published Languages * Java 98.4% * Makefile 1.6% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.