[HN Gopher] Show HN: We made our own inference engine for Apple ...
___________________________________________________________________
Show HN: We made our own inference engine for Apple Silicon
We wrote our inference engine on Rust, it is faster than llama cpp
in all of the use cases. Your feedback is very welcomed. Written
from scratch with idea that you can add support of any kernel and
platform.
Author : darkolorin
Score : 136 points
Date : 2025-07-15 11:29 UTC (11 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| sharifulin wrote:
| Wow! Sounds super interesting
| slavasmirnov wrote:
| that's exactly we are looking for not to waste on apis. Wonder
| how significant trade offs are
| TheMagicHorsey wrote:
| Amazing!
|
| How was your experience using Rust on this project? I'm
| considering a project in an adjacent space and I'm trying to
| decide between Rust, C, and Zig. Rust seems a bit burdensome with
| its complexity compared to C and Zig. Reminds me of C++ in its
| complexity (although not as bad). I find it difficult to walk
| through and understand a complicated Rust repository. I don't
| have that problem with C and Zig for the most part.
|
| But I'm wondering if I just need to invest more time in Rust. How
| was your learning curve with the language?
| adastra22 wrote:
| You are confusing familiarity with intrinsic complexity. I have
| 20 years experience with C/C++ before switching to rust a few
| years ago. After the initial hurdle, it is way easier and very
| simple to follow.
| ednevsky wrote:
| nice
| ewuhic wrote:
| >faster than llama cpp in all of the use cases
|
| What's your deliberate, well-thought roadmap for achieving
| adoption similar to llama cpp?
| pants2 wrote:
| Probably getting acquired by Apple :)
| khurs wrote:
| Ollama is the leader isn't it?
|
| Brew stats (downloads last 30 days)
|
| Ollama - 28,232 Lama.cpp - 7,826
| mintflow wrote:
| just curios, will it be supported on iOS, it would be great to
| build local llm app with this project.
| AlekseiSavin wrote:
| already) https://github.com/trymirai/uzu-swift
| cwlcwlcwlingg wrote:
| Wondering why use Rust other than C++
| adastra22 wrote:
| Why use C++?
| khurs wrote:
| So C++ users don't need to learn something new.
| bee_rider wrote:
| I wonder why they didn't use Fortran.
| giancarlostoro wrote:
| ...or D? or Go? or Java? C#? Zig? etc they chose what they were
| most comfortable with. Rust is fine, it's not for everyone
| clearly, but those who use it produce high quality software, I
| would argue similar with Go, without all the unnecessary mental
| overhead of C or C++
| outworlder wrote:
| Why use C++ for greenfield projects?
| khurs wrote:
| The recommendation from the security agencies is to prefer Rust
| over C++ as less risk of exploits.
|
| Checked and Lama.cpp used C++ (obviously) and Llama uses Go.
| greggh wrote:
| "trymirai", every time I hear the word Mirai I think of the large
| IOT DDoS botnet. Maybe it's just me though.
| fnord77 wrote:
| I think of the goofy Toyota fuel cell car. I think a grand
| total of about 6 have been sold (leased) in california
| rnxrx wrote:
| I'm curious about why the performance gains mentioned were so
| substantial for Qwen vs Llama?
| AlekseiSavin wrote:
| it looks like llama.cpp has some performance issues with bf16
| homarp wrote:
| Can you explain the type of quantization you support?
|
| would https://docs.unsloth.ai/basics/kimi-k2-how-to-run-locally
| be faster with mirai?
| AlekseiSavin wrote:
| right now, we support AWQ but are currently working on various
| quantization methods in https://github.com/trymirai/lalamo
| smpanaro wrote:
| In practice, how often do the models use the ANE? It sounds like
| you are optimizing for speed which in my experience always favors
| GPU.
| AlekseiSavin wrote:
| You're right, modern edge devices are powerful enough to run
| small models, so the real bottleneck for a forward pass is
| usually memory bandwidth, which defines the upper theoretical
| limit for inference speed. Right now, we've figured out how to
| run computations in a granular way on specific processing
| units, but we expect the real benefits to come later when we
| add support for VLMs and advanced speculative decoding, where
| you process more than one token at a time
| J_Shelby_J wrote:
| VLMs = very large models?
| mmorse1217 wrote:
| Probably vision language models.
| skybrian wrote:
| What are the units on the benchmark results? I'm guessing higher
| is better?
| AlekseiSavin wrote:
| yeah, tokens per second
| dcreater wrote:
| Somewhat faster on small models. Requires new format.
|
| Not sure what the goal is for this project? Not seeing how this
| presents adequate benefits to get adopted by the community
| koakuma-chan wrote:
| Written in Rust is a big one for me.
| worldsavior wrote:
| It's utilizing Apple ANE and probably other optimization tools
| provided by Apple's framework. Not sure if llama.cpp uses them,
| but if they're not then the benchmark on GitHub says it all.
| zdw wrote:
| How does this bench compared to MLX?
| jasonjmcghee wrote:
| I use MLX in lmstudio and it doesn't have whatever issues llama
| cpp is showing here.
|
| Qwen3-0.6B at 5 t/s doesn't make any sense. Something is
| clearly wrong for that specific model.
| giancarlostoro wrote:
| Hoping the author can answer, I'm still learning about how this
| all works. My understanding is that inference is "using the
| model" so to speak. How is this faster than established inference
| engines specifically on Mac? Are models generic enough that if
| you build e.g. an inference engine focused on AMD GPUs or even
| Intel GPUs, would they achieve reasonable performance? I always
| assumed because Nvidia is king of AI that you had to suck it up,
| or is it just that most inference engines being used are married
| to Nvidia?
|
| I would love to understand how universal these models can become.
| darkolorin wrote:
| Basically "faster" means better performance e.g. tokens/s
| without loosing quality (benchmarks scores for models). So when
| we say faster we provide more tokens per second than llama cpp.
| That means we effectively utilize hardware API available (for
| example we wrote our own kernels) to perform better.
| nodesocket wrote:
| I just spun up a AWS EC2 g6.xlarge instance to do some llm work.
| The GPU is NVIDIA L4 24GB and costs $0.8048/per hour. Starting to
| think about switching to an Apple mac2-m2.metal instance for
| $0.878/ per hour. Big question is the Mac instance only has 24GB
| of unified memory.
| khurs wrote:
| Unified memory doesn't compare to a Nvidia GPU, the latter is
| much better.
|
| Just depends on what performance level you need.
| floam wrote:
| How does this compare to https://github.com/Anemll/Anemll?
| zackangelo wrote:
| We also wrote our inference engine in rust for mixlayer, happy to
| answer any questions from those trying to do the same.
|
| Looks like this uses ndarray and mpsgraph (which I did not know
| about!), we opted to use candle instead.
| khurs wrote:
| Have you added it to HomeBrew and other package managers yet?
|
| Also any app deployed to PROD but developed on Mac need to be
| consistent i.e. work on Linux/in container.
___________________________________________________________________
(page generated 2025-07-15 23:01 UTC)