https://github.com/mukel/llama3.java

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
mukel / llama3.java Public

  * Notifications
  * Fork 6
  * Star 96
  * 

Practical Llama 3 inference in Java

License

View license
96 stars 6 forks Branches Tags Activity
Star
Notifications

  * Code
  * Issues 0
  * Pull requests 0
  * Discussions
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

mukel/llama3.java

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
 main
BranchesTags
  
Go to file
Code

Folders and files

   Name        Name     Last commit message     Last commit date
Latest commit

 

History

2 Commits
 
LICENSE     LICENSE                          

Llama3.java Llama3.java                      

Makefile    Makefile                         

README.md   README.md                        

View all files

Repository files navigation

  * README
  * License

Llama3.java

 

Practical Llama 3 inference implemented in a single Java file.

          [330573897-7939588c-c0ff-4261-b67f-8a54bad59ab5]

This project is the successor of llama2.java based on llama2.c by
Andrej Karpathy and his excellent educational videos.

Besides the educational value, this project will be used to test and
tune compiler optimizations and features on the JVM, particularly for
the Graal compiler.

Features

 

  * Single file, no dependencies
  * GGUF format parser
  * Llama 3 tokenizer based on minbpe
  * Llama 3 inference with Grouped-Query Attention
  * Support for Q8_0 and Q4_0 quantizations
  * Fast matrix-vector multiplication routines for quantized tensors
    using Java's Vector API
  * Simple CLI with --chat and --instruct modes.

Here's the interactive --chat mode in action:

          [330590378-2245f59d-6c86-49c3-87d3-8b1a2cb83a91]

Setup

 

Download pure Q4_0 and (optionally) Q8_0 quantized .gguf files from:
https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF

The ~4.3GB pure Q4_0 quantized model is recommended, please be gentle
with huggingface.co servers:

curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf

# Optionally download the Q8_0 quantized model ~8GB
# curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf

Optional: quantize to pure Q4_0 manually

 

In the wild, Q8_0 quantizations are fine, but Q4_0 quantizations are
rarely pure e.g. the output.weights tensor is quantized with Q6_K,
instead of Q4_0.
A pure Q4_0 quantization can be generated from a high precision (F32,
F16, BFLOAT16) .gguf source with the quantize utility from llama.cpp
as follows:

./quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0

Build and run

 

Java 21+ is required, in particular the MemorySegment mmap-ing
feature.

jbang is a perfect fit for this use case, just:

jbang Llama3.java --help

Or execute directly, also via jbang:

chmod +x Llama3.java
./Llama3.java --help

Optional: Makefile + manually build and run

 

A simple Makefile is provided, run make to produce llama3.jar or
manually:

javac -g --enable-preview -source 21 --add-modules jdk.incubator.vector -d target/classes Llama3.java
jar -cvfe llama3.jar com.llama4j.Llama3 LICENSE -C target/classes .

Run the resulting llama3.jar as follows:

java --enable-preview --add-modules jdk.incubator.vector -jar llama3.jar --help

Performance

 

Important Note
On GraalVM, please note that the Graal compiler doesn't support the
Vector API yet, run with -Dllama.VectorAPI=false, but expect
sub-optimal performance.
Vanilla OpenJDK 21+ is recommended for now, which supports the Vector
API.

llama.cpp

 

Vanilla llama.cpp built with make -j 20.

./main --version
version: 2879 (4f026363)
built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu

Executed as follows:

./main -m ../Meta-Llama-3-8B-Instruct-Q4_0.gguf \
  -n 512 \
  -s 42 \
  -p "<|start_of_header_id|>user<|end_of_header_id|>Why is the sky blue?<|eot_id|><|start_of_header_id|>assistant<|end_of_header_id|>\n\n" \
  --interactive-specials

Collected the "eval time" metric in tokens\s.

Llama3.java

 

Running on OpenJDK 21.0.2.

jbang Llama3.java \
  --model ./Meta-Llama-3-8B-Instruct-Q4_0.gguf \
  --max-tokens 512 \
  --seed 42 \
  --stream false \
  --prompt "Why is the sky blue?"

Results

 

Notebook Intel 13900H 6pC+8eC/20T 64GB (5200) Linux 6.6.26

 

            Model             tokens/s Implementation
Llama-3-8B-Instruct-Q4_0.gguf 7.53     llama.cpp
Llama-3-8B-Instruct-Q4_0.gguf 6.95     llama3.java
Llama-3-8B-Instruct-Q8_0.gguf 5.16     llama.cpp
Llama-3-8B-Instruct-Q8_0.gguf 4.02     llama3.java

Workstation AMD 3950X 16C/32T 64GB (3200) Linux 6.6.25

 

**Notes
Running on a single CCD e.g. taskset -c 0-15 jbang Llama3.java ...
since inference is constrained by memory bandwidth.

            Model             tokens/s Implementation
Llama-3-8B-Instruct-Q4_0.gguf 9.26     llama.cpp
Llama-3-8B-Instruct-Q4_0.gguf 8.03     llama3.java
Llama-3-8B-Instruct-Q8_0.gguf 5.79     llama.cpp
Llama-3-8B-Instruct-Q8_0.gguf 4.92     llama3.java

License

 

MIT

About

Practical Llama 3 inference in Java

Topics

java llama llm llms llm-inference llama3

Resources

Readme

License

View license
Activity

Stars

96 stars

Watchers

9 watching

Forks

6 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Languages

  * Java 98.4%
  * Makefile 1.6%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.