[HN Gopher] Show HN: Real-time AI (audio/video in, voice out) on...
___________________________________________________________________
Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with
Gemma E2B
Related: https://news.ycombinator.com/item?id=47653752
Author : karimf
Score : 258 points
Date : 2026-04-05 17:53 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dvt wrote:
| Solid work and great showcase, I've done a bunch of stuff with
| Kokoro and the latency is incredible. So crazy how badly Apple
| dropped the ball... feels like your demo should be a Siri demo (I
| mean that in the most complimentary way possible).
| karimf wrote:
| Thank you. This reminds me of a paragraph from the LatentSpace
| newsletter [0]
|
| > The excellent on device capabilities makes one wonder if
| these are the basis for the models that will be deployed in New
| Siri under the deal with Apple....
|
| https://www.latent.space/p/ainews-gemma-4-the-best-small-mul...
| k-almuraee wrote:
| Amazing, love your work ,
| zerop wrote:
| I have been looking forward to build something like this using
| open models. A voice assisstant I can talk while I am driving, as
| I do have long commute. I do use chatGPT voice mode and it works
| great for querying any information or discussions. But I want to
| do tasks like browsing web, act like a social media manager for
| my business etc.
| jwr wrote:
| That is very, very interesting. I've been hoping to have an
| assistant in the workshop (hands-free!) that I could talk to and
| have it help me with simple tasks: timers, calculating, digging
| up notes, etc. -- basically, what the phone assistants were
| supposed to be, but aren't.
|
| "You will have to unlock your iphone first" is kind of a deal-
| breaker when you are in the middle of mixing polyurethane resin
| and have gloves and a mask on.
|
| More and more I find that we have the technology, but the
| supposedly "tech" companies are the gatekeepers, preventing us
| from using the technological advances and holding us back years
| behind the state of the art.
|
| I'll be trying this out on my Macbook, looks very promising!
| mentalgear wrote:
| You might be interested in the open-source https://www.home-
| assistant.io/voice-pe/ .
| QuercusMax wrote:
| I've been replacing my Google Homes and Chromecasts with
| Snapcast streamers, and this is the next thing I've been
| planning to look into.
|
| It's truly absurd how the Google voice assistant USED to work
| properly for setting timers, playing music, etc, and then
| they had to break it 15 times and finally replace it with
| much slower AI that only kinda does what you want. I'm done.
|
| Selfhosted is the way to go if you want to keep your sanity.
| My wife has basically given up on any Google/Apple voice
| assistants being able to do anything useful above "set a 10
| minute timer".
| huijzer wrote:
| > More and more I find that we have the technology, but the
| supposedly "tech" companies are the gatekeepers
|
| Yes same with RSS readers being dropped by large companies.
| Worked too good I guess!
| gtowey wrote:
| The computing power we all have in our pockets is staggering.
| It could be tool that truly makes our lives easier, but instead
| it's mostly a device that is frustrating to use. Companies have
| decided to make it simply another conduit for advertising. It's
| a tool for them to sell us more stuff. Basic usability be
| damned.
| jamilton wrote:
| Siri does have a setting that'll activate it if you say "hey
| siri" while the phone is locked. Obvious privacy and battery
| usage concerns though, and it's still Siri, so it's a little
| clunky.
| jwr wrote:
| Mhm. I think I use that. But then I say "call my wife" and it
| says "you'll need to unlock your iphone first".
|
| It's clear Tim Cook doesn't ever try to use Siri wearing
| gloves. Or ever, for that matter :-)
| mft_ wrote:
| Siri (on iOS 18, at least) will call people for me without
| unlocking, in response to a voice command only - I just
| double-checked...
| divan wrote:
| Can someone quickly vibe code MacOS native app for that so it
| doesn't require running terminal commands and searching for that
| browser tab? (: (also for iOS, pls)
| duartefdias wrote:
| Would you pay 2$ for that MacOS native desktop app?
| est wrote:
| I am making something similar. Also been using Kokoro for TTS.
| Very cool project!
|
| Gemma 4 is kinda too heavyweight even with E2B. I am sticking
| with qwen 0.8B at the moment.
| logicallee wrote:
| It might interest people to know you can also easily fine-tune
| the text portion of this specific model (E2B) to behave however
| you want! I fine-tuned it to talk like a pirate but you can get
| it to do anything you have (or can generate) training data for.
| (This wouldn't make it to the text to speech portion though.) So
| you can easily train it to act a certain way or give certain
| types of responses.
|
| Video: https://www.youtube.com/live/WuCxWJhrkIM
|
| Generated writeup:
| https://taonexus.com/publicfiles/apr2026/pirate-gemma-journa...
| magzter wrote:
| This is so cool, I'm always speaking to people about how the
| advancement in the SOTA hosted AI's is also happening in the
| local model space, i.e. the SOTA hosted AI models 6-12 months ago
| are what we're seeing now being able to run locally on average
| hardware - this is such an amazing way to actually demo it.
| an0n-elem wrote:
| Cool work buddy:)
| myultidevhq wrote:
| This is really impressive for running locally on an M3 Pro. The
| latency looks surprisingly good for real-time audio and video
| input.
|
| Curious about one thing though, how does it handle switching
| between languages? I work with both Greek and English daily and
| local models usually struggle with that.
|
| Great work, bookmarking this.
| karimf wrote:
| During my limited testing, it works better than I expected at
| handling multiple languages in a single session. Perhaps I just
| had a low expectation since I've mostly worked with English-
| only STT models.
| crsAbtEvrthng wrote:
| If I run this without internet connection it says "loading..." at
| the bottom of the localhost site and won't work.
|
| If I run this with internet connected it works flawlessly. Even
| if I disconnect my internet afterwards it still goes on working
| fine.
|
| Why there has to be an internet connection established at the
| time I open the localhost site when all of this should be working
| purely on device?
|
| Despite of this, I am really impressed that this actually works
| so fast with video input on my M4 Pro 48 GB.
| karimf wrote:
| Huh that's weird. I just tried it and it works on my machine.
| Could you perhaps create a GitHub issue and share the
| reproduction steps and any relevant logs?
| crsAbtEvrthng wrote:
| Don't have the time right now but will play around with it
| next weekend for sure and will give you more feedback with
| logs when I see that I can reproduce it.
|
| For now what I did was:
|
| - Tested in Chrome/Safari/Firefox on Tahoe.
|
| - Followed the quick start install instructions from github
| repo
|
| - Everything worked
|
| - Closed terminal
|
| - Disconnected internet (Wifi off)
|
| - Opened terminal
|
| - Started server again (uv run server.py)
|
| - Opened localhost in browser, it asked for camera/mic
| normally, granted access, saw camera live feed but
| "loading..." at bottom center of the site and AI did not
| listen/respond
|
| - Reproduced this about 3 times with switching between wifi
| on/off before starting the server, always the same (working
| with internet; not working without)
|
| - Figured it also works fine if I start the server with
| internet connected and disconnect it afterwards
| rubicon33 wrote:
| Is there anything unique here happening for the video aspect or
| is it just taking snapshots over and over?
|
| I've been looking for a good video summarizing / understanding
| model!
| karimf wrote:
| Nothing unique, it's just taking a snapshot when it's
| processing the input. Even processing a single image will
| increase the TTFT by ~0.5s on my machine, so for now, it seems
| to be impossible for feeding a live video and expecting a real-
| time response.
|
| In regards to the video capability, I haven't tested it myself,
| but here's a benchmark/comparison from Google [0]
|
| [0] https://huggingface.co/blog/gemma4#video-understanding
| rubicon33 wrote:
| I totally get these are very hard problems so solve and that
| we're on the bleeding edge of what's possible but I can't
| help and wonder when someone is going to crack real video
| understanding.
|
| sure, maybe it's still frame-by-frame but so fast and so
| often that the model retains a rolling context of what's
| going on and can answer cleanly temporal questions.
|
| "how packages were delivered over the last hour", etc.
| noodlebreak wrote:
| I have to try it out on my idle laptops. I've been meaning to run
| some models on them for low cost tasks that need AI - like
| sorting and filtering photos from 100s of thousands that I have
| amassed over the years. And applying general size reduction
| compression to the filtered ones.
|
| Btw if anyone has already created such a pipeline/workflow using
| such models, please lmk!
| spwa4 wrote:
| I've been trying to do this, but I can't get voice recognition to
| work fast enough (meaning live) with Gemma E2B, on either an M1
| max (64GB), a 5060 Ti (16Gb) or a SnapDragon 8 Gen2.
|
| Any pointers?
| inzlab wrote:
| Real time ai sounds like the future
___________________________________________________________________
(page generated 2026-04-06 23:01 UTC)