[HN Gopher] Show HN: Real-time AI (audio/video in, voice out) on...
       ___________________________________________________________________
        
       Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with
       Gemma E2B
        
       Related: https://news.ycombinator.com/item?id=47653752
        
       Author : karimf
       Score  : 258 points
       Date   : 2026-04-05 17:53 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dvt wrote:
       | Solid work and great showcase, I've done a bunch of stuff with
       | Kokoro and the latency is incredible. So crazy how badly Apple
       | dropped the ball... feels like your demo should be a Siri demo (I
       | mean that in the most complimentary way possible).
        
         | karimf wrote:
         | Thank you. This reminds me of a paragraph from the LatentSpace
         | newsletter [0]
         | 
         | > The excellent on device capabilities makes one wonder if
         | these are the basis for the models that will be deployed in New
         | Siri under the deal with Apple....
         | 
         | https://www.latent.space/p/ainews-gemma-4-the-best-small-mul...
        
       | k-almuraee wrote:
       | Amazing, love your work ,
        
       | zerop wrote:
       | I have been looking forward to build something like this using
       | open models. A voice assisstant I can talk while I am driving, as
       | I do have long commute. I do use chatGPT voice mode and it works
       | great for querying any information or discussions. But I want to
       | do tasks like browsing web, act like a social media manager for
       | my business etc.
        
       | jwr wrote:
       | That is very, very interesting. I've been hoping to have an
       | assistant in the workshop (hands-free!) that I could talk to and
       | have it help me with simple tasks: timers, calculating, digging
       | up notes, etc. -- basically, what the phone assistants were
       | supposed to be, but aren't.
       | 
       | "You will have to unlock your iphone first" is kind of a deal-
       | breaker when you are in the middle of mixing polyurethane resin
       | and have gloves and a mask on.
       | 
       | More and more I find that we have the technology, but the
       | supposedly "tech" companies are the gatekeepers, preventing us
       | from using the technological advances and holding us back years
       | behind the state of the art.
       | 
       | I'll be trying this out on my Macbook, looks very promising!
        
         | mentalgear wrote:
         | You might be interested in the open-source https://www.home-
         | assistant.io/voice-pe/ .
        
           | QuercusMax wrote:
           | I've been replacing my Google Homes and Chromecasts with
           | Snapcast streamers, and this is the next thing I've been
           | planning to look into.
           | 
           | It's truly absurd how the Google voice assistant USED to work
           | properly for setting timers, playing music, etc, and then
           | they had to break it 15 times and finally replace it with
           | much slower AI that only kinda does what you want. I'm done.
           | 
           | Selfhosted is the way to go if you want to keep your sanity.
           | My wife has basically given up on any Google/Apple voice
           | assistants being able to do anything useful above "set a 10
           | minute timer".
        
         | huijzer wrote:
         | > More and more I find that we have the technology, but the
         | supposedly "tech" companies are the gatekeepers
         | 
         | Yes same with RSS readers being dropped by large companies.
         | Worked too good I guess!
        
         | gtowey wrote:
         | The computing power we all have in our pockets is staggering.
         | It could be tool that truly makes our lives easier, but instead
         | it's mostly a device that is frustrating to use. Companies have
         | decided to make it simply another conduit for advertising. It's
         | a tool for them to sell us more stuff. Basic usability be
         | damned.
        
         | jamilton wrote:
         | Siri does have a setting that'll activate it if you say "hey
         | siri" while the phone is locked. Obvious privacy and battery
         | usage concerns though, and it's still Siri, so it's a little
         | clunky.
        
           | jwr wrote:
           | Mhm. I think I use that. But then I say "call my wife" and it
           | says "you'll need to unlock your iphone first".
           | 
           | It's clear Tim Cook doesn't ever try to use Siri wearing
           | gloves. Or ever, for that matter :-)
        
             | mft_ wrote:
             | Siri (on iOS 18, at least) will call people for me without
             | unlocking, in response to a voice command only - I just
             | double-checked...
        
       | divan wrote:
       | Can someone quickly vibe code MacOS native app for that so it
       | doesn't require running terminal commands and searching for that
       | browser tab? (: (also for iOS, pls)
        
         | duartefdias wrote:
         | Would you pay 2$ for that MacOS native desktop app?
        
       | est wrote:
       | I am making something similar. Also been using Kokoro for TTS.
       | Very cool project!
       | 
       | Gemma 4 is kinda too heavyweight even with E2B. I am sticking
       | with qwen 0.8B at the moment.
        
       | logicallee wrote:
       | It might interest people to know you can also easily fine-tune
       | the text portion of this specific model (E2B) to behave however
       | you want! I fine-tuned it to talk like a pirate but you can get
       | it to do anything you have (or can generate) training data for.
       | (This wouldn't make it to the text to speech portion though.) So
       | you can easily train it to act a certain way or give certain
       | types of responses.
       | 
       | Video: https://www.youtube.com/live/WuCxWJhrkIM
       | 
       | Generated writeup:
       | https://taonexus.com/publicfiles/apr2026/pirate-gemma-journa...
        
       | magzter wrote:
       | This is so cool, I'm always speaking to people about how the
       | advancement in the SOTA hosted AI's is also happening in the
       | local model space, i.e. the SOTA hosted AI models 6-12 months ago
       | are what we're seeing now being able to run locally on average
       | hardware - this is such an amazing way to actually demo it.
        
       | an0n-elem wrote:
       | Cool work buddy:)
        
       | myultidevhq wrote:
       | This is really impressive for running locally on an M3 Pro. The
       | latency looks surprisingly good for real-time audio and video
       | input.
       | 
       | Curious about one thing though, how does it handle switching
       | between languages? I work with both Greek and English daily and
       | local models usually struggle with that.
       | 
       | Great work, bookmarking this.
        
         | karimf wrote:
         | During my limited testing, it works better than I expected at
         | handling multiple languages in a single session. Perhaps I just
         | had a low expectation since I've mostly worked with English-
         | only STT models.
        
       | crsAbtEvrthng wrote:
       | If I run this without internet connection it says "loading..." at
       | the bottom of the localhost site and won't work.
       | 
       | If I run this with internet connected it works flawlessly. Even
       | if I disconnect my internet afterwards it still goes on working
       | fine.
       | 
       | Why there has to be an internet connection established at the
       | time I open the localhost site when all of this should be working
       | purely on device?
       | 
       | Despite of this, I am really impressed that this actually works
       | so fast with video input on my M4 Pro 48 GB.
        
         | karimf wrote:
         | Huh that's weird. I just tried it and it works on my machine.
         | Could you perhaps create a GitHub issue and share the
         | reproduction steps and any relevant logs?
        
           | crsAbtEvrthng wrote:
           | Don't have the time right now but will play around with it
           | next weekend for sure and will give you more feedback with
           | logs when I see that I can reproduce it.
           | 
           | For now what I did was:
           | 
           | - Tested in Chrome/Safari/Firefox on Tahoe.
           | 
           | - Followed the quick start install instructions from github
           | repo
           | 
           | - Everything worked
           | 
           | - Closed terminal
           | 
           | - Disconnected internet (Wifi off)
           | 
           | - Opened terminal
           | 
           | - Started server again (uv run server.py)
           | 
           | - Opened localhost in browser, it asked for camera/mic
           | normally, granted access, saw camera live feed but
           | "loading..." at bottom center of the site and AI did not
           | listen/respond
           | 
           | - Reproduced this about 3 times with switching between wifi
           | on/off before starting the server, always the same (working
           | with internet; not working without)
           | 
           | - Figured it also works fine if I start the server with
           | internet connected and disconnect it afterwards
        
       | rubicon33 wrote:
       | Is there anything unique here happening for the video aspect or
       | is it just taking snapshots over and over?
       | 
       | I've been looking for a good video summarizing / understanding
       | model!
        
         | karimf wrote:
         | Nothing unique, it's just taking a snapshot when it's
         | processing the input. Even processing a single image will
         | increase the TTFT by ~0.5s on my machine, so for now, it seems
         | to be impossible for feeding a live video and expecting a real-
         | time response.
         | 
         | In regards to the video capability, I haven't tested it myself,
         | but here's a benchmark/comparison from Google [0]
         | 
         | [0] https://huggingface.co/blog/gemma4#video-understanding
        
           | rubicon33 wrote:
           | I totally get these are very hard problems so solve and that
           | we're on the bleeding edge of what's possible but I can't
           | help and wonder when someone is going to crack real video
           | understanding.
           | 
           | sure, maybe it's still frame-by-frame but so fast and so
           | often that the model retains a rolling context of what's
           | going on and can answer cleanly temporal questions.
           | 
           | "how packages were delivered over the last hour", etc.
        
       | noodlebreak wrote:
       | I have to try it out on my idle laptops. I've been meaning to run
       | some models on them for low cost tasks that need AI - like
       | sorting and filtering photos from 100s of thousands that I have
       | amassed over the years. And applying general size reduction
       | compression to the filtered ones.
       | 
       | Btw if anyone has already created such a pipeline/workflow using
       | such models, please lmk!
        
       | spwa4 wrote:
       | I've been trying to do this, but I can't get voice recognition to
       | work fast enough (meaning live) with Gemma E2B, on either an M1
       | max (64GB), a 5060 Ti (16Gb) or a SnapDragon 8 Gen2.
       | 
       | Any pointers?
        
       | inzlab wrote:
       | Real time ai sounds like the future
        
       ___________________________________________________________________
       (page generated 2026-04-06 23:01 UTC)