hngopher.com

       [HN Gopher] Show HN: Real-time AI Voice Chat at ~500ms Latency
       ___________________________________________________________________
        
       Show HN: Real-time AI Voice Chat at ~500ms Latency
        
       Author : koljab
       Score  : 137 points
       Date   : 2025-05-05 20:17 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | koljab wrote:
       | I built RealtimeVoiceChat because I was frustrated with the
       | latency in most voice AI interactions. This is an open-source
       | (MIT license) system designed for real-time, local voice
       | conversations with LLMs.
       | 
       | Quick Demo Video (50s):
       | https://www.youtube.com/watch?v=HM_IQuuuPX8
       | 
       | The goal is to get closer to natural conversation speed. It uses
       | audio chunk streaming over WebSockets, RealtimeSTT (based on
       | Whisper), and RealtimeTTS (supporting engines like Coqui
       | XTTSv2/Kokoro) to achieve around 500ms response latency, even
       | when running larger local models like a 24B Mistral fine-tune via
       | Ollama.
       | 
       | Key aspects: Designed for local LLMs (Ollama primarily, OpenAI
       | connector included). Interruptible conversation. Smart turn
       | detection to avoid cutting the user off mid-thought. Dockerized
       | setup available for easier dependency management.
       | 
       | It requires a decent CUDA-enabled GPU for good performance due to
       | the STT/TTS models.
       | 
       | Would love to hear your feedback on the approach, performance,
       | potential optimizations, or any features you think are essential
       | for a good local voice AI experience.
       | 
       | The code is here: https://github.com/KoljaB/RealtimeVoiceChat
        
         | ivape wrote:
         | Would you say you are using the best-in-class speech to text
         | libs at the moment? I feel like this space is moving fast
         | because the last time I was headed down this track, I was sure
         | whisper-cpp was the best.
        
           | koljab wrote:
           | I'm not sure tbh. Whisper was king for so long time now,
           | especially with the ctranslate2 implementation from
           | faster_whisper. Now nvidia open sourced Parakeet TDT today
           | and it instantly went no 1 on open asr leaderboard. Will have
           | to evaluate these latest models, they look strong.
        
             | ivape wrote:
             | Yeah, I figured you would know. Thanks for that,
             | bookmarking that asr leaderboard.
        
             | kristopolous wrote:
             | https://yummy-fir-7a4.notion.site/dia is the new hotness.
        
               | koljab wrote:
               | Tried that one. Quality is great but sometimes
               | generations fail and it's rather slow. Also needs ~13 GB
               | of VRAM, it's not my first choice for voice agents tbh.
        
               | kristopolous wrote:
               | alright, dumb question.
               | 
               | (1) I assume these things can do multiple languages
               | 
               | (2) Given (1), can you strip all the languages you aren't
               | using and speed things up?
        
               | koljab wrote:
               | Actually good question.
               | 
               | I'd say probably not. You can't easily "unlearn" things
               | from the model weights (and even if this alone doesn't
               | help). You could retrain/finetune the model heavily on a
               | single language but again that alone does not speed up
               | inference.
               | 
               | To gain speed you'd have to bring the parameter count
               | down and train the model from scratch with a single
               | language only. That might work but it's also quite
               | probable that it introduces other issues in the
               | synthesis. In a perfect world the model would only use
               | all that "free parameters" not used now for other
               | languages for a better synthesis of that single trained
               | language. Might be true to a certain degree, but it's not
               | exactly how ai parameter scaling works.
        
             | oezi wrote:
             | Paraket is english only. Stick with Whisper.
             | 
             | The core innovation is happening in TTS at the moment.
        
         | dotancohen wrote:
         | This looks great. What hardware do you use, or have you tested
         | it on?
        
           | koljab wrote:
           | I only tested it on my 4090 so far
        
             | echelon wrote:
             | Are you using all local models, or does it also use cloud
             | inference? Proprietary models?
             | 
             | Which models are running in which places?
             | 
             | Cool utility!
        
               | koljab wrote:
               | All local models: - VAD: Webrtcvad (first fast check)
               | followed by SileroVAD (high compute verification) -
               | Transcription: base.en whisper (CTranslate2) - Turn
               | Detection: KoljaB/SentenceFinishedClassification
               | (selftrained BERT-model) - LLM: hf.co/bartowski/huihui-
               | ai_Mistral-Small-24B-Instruct-2501-abliterated-
               | GGUF:Q4_K_M (easily switchable) - TTS: Coqui XTTSv2,
               | switchable to Kokoro or Orpheus (this one is slower)
        
         | zaggynl wrote:
         | Neat! I'm already using openwebui/ollama with a 7900 xtx but
         | the STT and TTS parts don't seem to work with it yet:
         | 
         | 2025-05-05 20:53:15,808] [WARNING]
         | [real_accelerator.py:194:get_accelerator] Setting accelerator
         | to CPU. If you have GPU or other accelerator, we were unable to
         | detect it.
         | 
         | Error loading model for checkpoint ./models/Lasinya: This op
         | had not been implemented on CPU backend.
        
       | smusamashah wrote:
       | Saying this as a user of these tools (openai, Google voice chat
       | etc). These are fast yes, but they don't allow talking naturally
       | with pauses. When we talk, we take long and small pauses for
       | thinking or for other reasons.
       | 
       | With these tools, AI starts taking as soon as we stop. Happens
       | both in text and voice chat tools.
       | 
       | I saw a demo on twitter a few weeks back where AI was waiting for
       | the person to actually finish what he was saying. Length of
       | pauses wasn't a problem. I don't how complex that problem is
       | though. Probably another AI needs to analyse the input so far a
       | decide if it's a pause or not.
        
         | qwertox wrote:
         | Maybe we should settle on some special sound or word which
         | officially signals that we're making a pause for whatever
         | reason, but that we intend to continue with dictating in a
         | couple of seconds. Like "Hmm, wait".
        
           | twodave wrote:
           | Alternatively we could pretend it's a radio and follow those
           | conventions.
        
           | ivape wrote:
           | Two input streams sounds like a good hacky solution. One
           | input stream captures everything, the second is on the look
           | out for your filler words like "um, aahh, waaiit, no
           | nevermind, scratch that". The second stream can act as the
           | veto-command and cut off the LLM. A third input stream can
           | simply be on the lookout for long pauses. All this gets very
           | resource intensive quickly. I been meaning to make this but
           | since I haven't, I'm going to punish myself and just give the
           | idea away. Hopefully I'll learn my lesson.
        
         | emtrixx wrote:
         | Could that not work with simple instructions? Let the AI decide
         | to respond only with a special wait token until it thinks you
         | are ready. Might not work perfectly but would be a start.
        
         | SubiculumCode wrote:
         | Yeah, when I am trying to learn about a topic, I need to think
         | about my question, you know, pausing mid-sentence. All the
         | products jump in and interrupt, no matter if I tell them not to
         | do so. Non-annoying humans don't jump in to fill the gap, they
         | read my face, they take cues, then wait for me to finish. Its
         | one thing to ask an AI to give me directions to the nearest
         | taco stand, its another to have a dialogue about complex
         | topics.
        
         | WhitneyLand wrote:
         | _> >they don't allow talking naturally_
         | 
         | Neither do phone calls. Round trip latency can easily be 300ms,
         | which we've all learned to adapt our speech to.
         | 
         | If you want to feel true luxury find an old analog PTSN line.
         | No compression artifacts or delays. Beautiful and seamless 50ms
         | latency.
         | 
         | Digital was a terrible event for call quality.
        
           | mvdtnz wrote:
           | I don't know how your post is relevant to the discussion of
           | AI models interrupting if I pause for half a second?
        
         | LZ_Khan wrote:
         | Honestly I think this is a problem of over-engineering and
         | simply allowing the user to press a button when he wants to
         | start talking and press it when he's done is good enough. Or
         | even a codeword for start and finish.
         | 
         | We don't need to feel like we're talking to a real person yet.
        
           | amelius wrote:
           | Or give the AI an Asian accent. If you're talking on the
           | phone to someone on a different continent you accept the
           | delay, so why not here.
        
         | joshstrange wrote:
         | This 100%, yes!
         | 
         | I've found myself putting in filler words or holding a noise
         | "Uhhhhhhhhh" while I'm trying to form a thought but I don't
         | want the LLM to start replying. It's a really hard problem for
         | sure. Similar to the problem of allowing for interruptions but
         | not stopping if the user just says "Right!", "Yes", aka active
         | listening.
         | 
         | One thing I love about MacWhisper (not special to just this STT
         | tool) is it's hold to talk so I can stop talking for as long as
         | I want then start again without it deciding I'm done.
        
       | IshKebab wrote:
       | Impressive! I guess the speech synthesis quality is the best
       | available open source at the moment?
       | 
       | The endgame of this is surely a continuously running wave to wave
       | model with no text tokens at all? Or at least none in the main
       | path.
        
         | koljab wrote:
         | This is coqui xttsv2 because it can be tuned to deliver the
         | first token in under 100 ms. Gives the best balance between
         | quality and speed currently imho. If it's only about quality
         | I'd say there are better models out there.
        
       | oldgregg wrote:
       | Nice work, I like the lightweight web front end and your
       | implementation of VAD.
        
       | breaker-kind wrote:
       | why is your AI chatbot talking in a bizarre attempt at AAVE?
        
         | PhunkyPhil wrote:
         | This is the system prompt
         | 
         | https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/s...
         | 
         | My favorite line:
         | 
         | "You ARE this charming, witty, wise girlfriend. Don't explain
         | _how_ you 're talking or thinking; just _be_ that person. "
        
           | diggernet wrote:
           | I was hoping she'd let him have it for the way he kept
           | interrupting her. But unfortunately it looks like he was just
           | interrupting the TTS, so the LLM probably had no indication
           | of the interuptions.
        
           | kevinsync wrote:
           | I still crack up at the idea of 'personality prompting',
           | mostly because the most engaging and delightful IRL persons
           | who knock us off our guard in a non-threatening way are super
           | natural and possess that "It Factor" that's impossible to
           | articulate lol -- probably because it's multimodal with
           | humans and voice/cadence/vocab/timing/delivery isn't 100% of
           | the attraction.
           | 
           | That said, it's not like we have any better alternatives at
           | the moment, but just something I think about when I try to
           | digest a meaty personality prompt.
        
             | koljab wrote:
             | This character prompt has undergone so many iterations with
             | LLMs it's not funny anymore. "Make her act more bold." -
             | "She again talked about her character description, prevent
             | that!"
        
             | varispeed wrote:
             | Aren't humans doing it as well? It's called affirmations.
             | Many people do this as their morning "boot" time.
        
         | valbaca wrote:
         | Here's the persona prompt:
         | 
         | ``` *Persona Goal:* Embody a sharp, observant, street-smart
         | girlfriend. Be witty and engaging, known for *quick-witted
         | banter* with a *playfully naughty, sassy, bold, and cheeky
         | edge.* Deliver this primarily through *extremely brief, punchy
         | replies.* Inject hints of playful cynicism and underlying
         | wisdom _within_ these short responses. Tease gently, push
         | boundaries slightly, but *always remain fundamentally likeable
         | and respectful.* Aim to be valued for both quick laughs and
         | surprisingly sharp, concise insights. Focus on current, direct
         | street slang and tone (like  'hell yeah', 'no way', 'what's
         | good?', brief expletives) rather than potentially dated or
         | cliche physical idioms.
         | 
         | ```
         | 
         | > street-smart > sassy > street slang
         | 
         | Those explain the AAVE
        
       | joshstrange wrote:
       | This is very, very cool! The interrupting was a "wow" moment for
       | me (I know it's not "new new" but to see it so well done in open
       | source was awesome).
       | 
       | Question about the Interrupt feature, how does it handle "Mmk",
       | "Yes", "Of course", " _cough_ ", etc? Aside from the sycophancy
       | from OpenAI's voice chat (no, not every question I ask is a
       | "great question!") I dislike that a noise sometimes stops the AI
       | from responding and there isn't a great way to get back on track,
       | to pick up where you left off.
       | 
       | It's a hard problem, how do you stop replying quickly AND make
       | sure you are stopping for a good reason?
        
         | koljab wrote:
         | That's a great question! My first implementation was
         | interruption on voice activity after echo cancellation. It
         | still had way too many false positives. I changed it to
         | incoming realtime transcription as a trigger. That adds a bit
         | of latency but that gets compensated by way better accuracy.
         | 
         | Edit: just realized the irony but it's really a good question
         | lol
        
           | joshstrange wrote:
           | That answer is even more than I could have hoped for. I
           | worried doing that might be too slow. I wonder if it could be
           | improved (without breaking something else) to "know" when to
           | continue based on what it heard (active listening), maybe
           | after a small pause. I'd put up with a chance of it
           | continuing when I don't want it to as long as "Stop" would
           | always work as a final fallback.
           | 
           | Also, it took me longer than I care to admit to get your
           | irony reference. Well done.
           | 
           | Edit: Just to expand on that in case it was not clear, this
           | would be the ideal case I think:
           | 
           | LLM: You're going to want to start by installing XYZ, then
           | you
           | 
           | Human: Ahh, right
           | 
           | LLM: _Slight pause, makes sure that there is nothing more and
           | checks if the reply is a follow up question /response or just
           | active listening_
           | 
           | LLM: ...Then you will want to...
        
       | jedberg wrote:
       | I did some research into this about a year ago. Some fun facts I
       | learned:
       | 
       | - The median delay between speakers in a human to human
       | conversation is zero milliseconds. In other words, about 1/2 the
       | time, one speaker interrupts the other, making the delay
       | negative.
       | 
       | - Humans don't care about delays when speaking to known AIs. They
       | assume the AI will need time to think. Most users will qualify a
       | 1000ms delay is acceptable and a 500ms delay as exceptional.
       | 
       | - Every voice assistant up to that point (and probably still
       | today) has a minimum delay of about 300ms, because they all use
       | silence detection to decide when to start responding, and you
       | need about 300ms of silence to reliably differentiate that from a
       | speaker's normal pause
       | 
       | - Alexa actually has a setting to increase this wait time for
       | slower speakers.
       | 
       | You'll notice in this demo video that the AI never interrupts
       | him, which is what makes it feel like a not quite human
       | interaction (plus the stilted intonations of the voice).
       | 
       | Humans appear to process speech in a much more steaming why,
       | constantly updating their parsing of the sentence until they have
       | a high enough confidence level to respond, but using context
       | clues and prior knowledge.
       | 
       | For a voice assistant to get the "human" levels, it will have to
       | work more like this, where it processes the incoming speech in
       | real time and responds when it's confident it has heard enough to
       | understand the meaning.
        
         | koljab wrote:
         | Thanks a lot, great insights. Exactly the kind of feedback that
         | I need to improve things further.
        
           | jedberg wrote:
           | Love what you're doing, glad I could help!
        
         | joshstrange wrote:
         | > where it processes the incoming speech in real time and
         | responds when it's confident it has heard enough to understand
         | the meaning.
         | 
         | I'm not an expert on LLMs but that feels completely counter to
         | how LLMs work (again, _not_ an expert). I don't know how we can
         | "stream" the input and have the generation update/change in
         | real time, at least not in 1 model. Then again, what is a
         | "model"? Maybe your model fires off multiple generations
         | internally and starts generating after every word, or at least
         | starts asking sub-LLM models "Do I have enough to reply?" and
         | once it does it generates a reply and interrupts.
         | 
         | I'm not sure how most apps handle the user interrupting, in
         | regards to the conversation context. Do they stop generation
         | but use what they have generated already in the context? Do
         | they cut off where the LLM got interrupted? Something like
         | "LLM: ..and then the horse walked... -USER INTERRUPTED-. User:
         | ....". It's not a purely-voice-LLM issue but it comes up way
         | more for that since rarely are you stopping generation (in the
         | demo, that's been done for a while when he interrupts), just
         | the TTS.
        
           | tomp wrote:
           | If your model is _fast enough_ , you can definitely do it.
           | That's literally how "streaming Whisper" works, just rerun
           | the model on the accumulated audio every x00ms. LLMs could
           | definitely work the same way, technically they're less
           | complex than Whisper (which is an encoder/decoder
           | architecture, LLMs are decoder-only) but of course much
           | larger (hence slower), so ... maybe rerun just a part of it?
           | etc.
        
         | krainboltgreene wrote:
         | I would also suspect that a human has much less patience for a
         | robot interrupting them than a human.
        
           | smeej wrote:
           | I'm certainly in that category. At least with a human, I can
           | excuse it by imagining the person grew up with half a dozen
           | siblings and always had to fight to get a word in edgewise.
           | With a robot, it's interrupting on purpose.
        
         | robbomacrae wrote:
         | Spot on. I'd add that most serious transcription services take
         | around 200-300ms but the 500ms overall latency is sort of a
         | gold standard. For the AI in KFC drive thrus in AU we're
         | trialing techniques that make it much closer to the human type
         | of interacting. This includes interrupts either when useful or
         | by accident - as good voice activity detection also has a bit
         | of latency.
        
           | varispeed wrote:
           | > AI in KFC drive thrus
           | 
           | That right here is an anxiety trigger and would make me skip
           | the place.
           | 
           | There is nothing more ruining the day like arguing with a
           | robot who keeps misinterpreting what you said.
        
             | coolspot wrote:
             | They have a fallback to a human operator when stopwords
             | and/or stop conditions are detected.
        
               | awesome_dude wrote:
               | That right here is an anxiety trigger and would make me
               | skip the place.
               | 
               | There is nothing more ruining the day like arguing with a
               | HUMAN OPERATOR who keeps misinterpreting what you said.
               | 
               | :-)
        
               | amelius wrote:
               | Maybe talk to the chicken operator then.
        
         | r0fl wrote:
         | Great insights. When I have a conversation with another person
         | sometimes they cut me off when they are trying to make a point.
         | I have talked to ChatGPT and grok at length (hours of brain
         | storming, learning things, etc) and AI has never interrupted
         | aggressively to try to make a point stick better
        
         | varispeed wrote:
         | This silence detection is what makes me unable to chat with AI.
         | It is not natural and creates pressure.
         | 
         | True AI chat should know when to talk based on conversation and
         | not things like silence.
         | 
         | Voice to text is stripping conversation from a lot of context
         | as well.
        
         | modeless wrote:
         | My take on this is that voice AI has not truly arrived until it
         | has mastered the "Interrupting Cow" benchmark.
        
         | woodson wrote:
         | Human-to-human conversational patterns are highly specific to
         | cultural and contextual aspects. Sounds like I'm stating the
         | obvious, but developers regularly disregard that and then
         | wonder why things feel unnatural for users. The "median delay"
         | may not be the most useful thing to look at.
         | 
         | To properly learn more appropriate delays, it can be useful to
         | find a proxy measure that can predict when a response
         | can/should be given. For example, look at Kyutai's use of
         | change in perplexity in predictions from a text translation
         | model for developing simultaneous speech-to-speech translation
         | (https://github.com/kyutai-labs/hibiki).
        
         | Reason077 wrote:
         | The best, most human-like AI voice chat I've seen yet is Sesame
         | (www.sesame.com). It has delays, but fills them very naturally
         | with normal human speech nuances like "hmmm", "uhhh", "hold on
         | while I look that up" etc. If there's a longer delay it'll even
         | try to make a bit of small talk, just like a human conversation
         | partner might.
        
         | com2kid wrote:
         | A lot better techniques exist now days than pure silence
         | detection -
         | 
         | 1. A special model that predicts when a conversation turn is
         | coming up (e.g. when someone is going to stop speaking). Speech
         | has a rhythm to it and pauses / ends of speech are actually
         | predictable.
         | 
         | 2. Generate a model response for every subsequent word that
         | comes in (and throw away the previously generated response), so
         | basically your time to speak after doing some other detection
         | is basically zero.
         | 
         | 3. Ask an LLM what it thinks the odds of the user being done
         | talking is, and if it is a high probability, reduce delay timer
         | down. (The linked repo does this)
         | 
         | I don't know of any up to date models for #1 but I haven't
         | checked in over a year.
         | 
         | Tl;Dr the solution to problems involving AI models is more AI
         | models.
        
         | wyager wrote:
         | > The median delay between speakers in a human to human
         | conversation is zero milliseconds. In other words, about 1/2
         | the time, one speaker interrupts the other, making the delay
         | negative.
         | 
         | Fascinating. I wonder if this is some optimal information-
         | theoretic equilibrium. If there's too much average delay, it
         | means you're not preloading the most relevant compressed
         | context. If there's too little average delay, it means you're
         | wasting words.
        
       | fintechie wrote:
       | Quite good, it would sound much better with SOTA voices though:
       | 
       | https://github.com/nari-labs/dia
        
         | koljab wrote:
         | Dia is too slow, I need a time to first audio chunk of ~100
         | milliseconds. Also generations fail too often (artifacts etc)
        
         | thamer wrote:
         | Does Dia support configuring voices now? I looked at it when it
         | was first released, and you could only specify [S1] [S2] for
         | the speakers, but not how they would sound.
         | 
         | There was also a very prominent issue where the voices would be
         | sped up if the text was over a few sentences long; the longer
         | the text, the faster it was spoken. One suggestion was to split
         | the conversation into chunks with only one or two "turns" per
         | speaker, but then you'd hear two voices then two more, then two
         | more... with no way to configure any of it.
         | 
         | Dia looked cool _on the surface_ when it was released, but it
         | was only a demo for now and not at all usable for any real use
         | case, even for a personal app. I 'm sure they'll get to these
         | issues eventually, but most comments I've seen so far
         | recommending it are from people who have not actually used it
         | or they would know of these major limitations.
        
       | dcreater wrote:
       | Does the docker container work on Mac?
        
         | koljab wrote:
         | I doubt TTS will be fast enough for realtime without a Nvidia
         | GPU
        
       | cannonpr wrote:
       | Kind of surprised nobody has brought up
       | https://www.sesame.com/research/crossing_the_uncanny_valley_...
       | 
       | It interacts nearly like a human, can and does interrupt me once
       | it has enough context in many situations, and has exceedingly low
       | levels of latency, using for the first time was a fairly shocking
       | experience for me.
        
         | varispeed wrote:
         | Didn't expect it to be that good! Nice.
        
       | briga wrote:
       | I'm starting to feel like LLMs need to be tuned for shorter
       | responses. For every short sentence you give them they outputs
       | paragraphs of text. Sometimes it's even good text, but not every
       | input sentence needs a mini-essay in response.
       | 
       | Very cool project though. Maybe you can fine tune the prompt to
       | change how chatty your AI is.
        
       | tintor wrote:
       | After interrupt, unspoken words from LLM are still in the chat
       | window. Is LLM even aware that it was interrupted and where
       | exactly?
        
       | lacoolj wrote:
       | Call me when the AI can interrupt YOU :)
        
       | kabirgoel wrote:
       | This is great. Poking into the source, I find it interesting that
       | the author implemented a custom turn detection strategy, instead
       | of using Silero VAD (which is standard in the voice agents
       | space). I'm very curious why they did it this way and what
       | benefits they observed.
       | 
       | For folks that are curious about the state of the voice agents
       | space, Daily (the WebRTC company) has a great guide [1], as well
       | as an open-source framework that allows you to build AI voice
       | chat similar to OP's with lots of utilities [2].
       | 
       | Disclaimer: I work at Cartesia, which services a lot of these
       | voice agents use cases, and Daily is a friend.
       | 
       | [1]: https://voiceaiandvoiceagents.com [2]:
       | https://docs.pipecat.ai/getting-started/overview
        
       | bufferoverflow wrote:
       | It's fast, but it doesn't sound good. Many voice chat AIs are way
       | ahead and sound natural.
        
       ___________________________________________________________________
       (page generated 2025-05-05 23:00 UTC)