[HN Gopher] Show HN: Real-time AI Voice Chat at ~500ms Latency
___________________________________________________________________
Show HN: Real-time AI Voice Chat at ~500ms Latency
Author : koljab
Score : 137 points
Date : 2025-05-05 20:17 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| koljab wrote:
| I built RealtimeVoiceChat because I was frustrated with the
| latency in most voice AI interactions. This is an open-source
| (MIT license) system designed for real-time, local voice
| conversations with LLMs.
|
| Quick Demo Video (50s):
| https://www.youtube.com/watch?v=HM_IQuuuPX8
|
| The goal is to get closer to natural conversation speed. It uses
| audio chunk streaming over WebSockets, RealtimeSTT (based on
| Whisper), and RealtimeTTS (supporting engines like Coqui
| XTTSv2/Kokoro) to achieve around 500ms response latency, even
| when running larger local models like a 24B Mistral fine-tune via
| Ollama.
|
| Key aspects: Designed for local LLMs (Ollama primarily, OpenAI
| connector included). Interruptible conversation. Smart turn
| detection to avoid cutting the user off mid-thought. Dockerized
| setup available for easier dependency management.
|
| It requires a decent CUDA-enabled GPU for good performance due to
| the STT/TTS models.
|
| Would love to hear your feedback on the approach, performance,
| potential optimizations, or any features you think are essential
| for a good local voice AI experience.
|
| The code is here: https://github.com/KoljaB/RealtimeVoiceChat
| ivape wrote:
| Would you say you are using the best-in-class speech to text
| libs at the moment? I feel like this space is moving fast
| because the last time I was headed down this track, I was sure
| whisper-cpp was the best.
| koljab wrote:
| I'm not sure tbh. Whisper was king for so long time now,
| especially with the ctranslate2 implementation from
| faster_whisper. Now nvidia open sourced Parakeet TDT today
| and it instantly went no 1 on open asr leaderboard. Will have
| to evaluate these latest models, they look strong.
| ivape wrote:
| Yeah, I figured you would know. Thanks for that,
| bookmarking that asr leaderboard.
| kristopolous wrote:
| https://yummy-fir-7a4.notion.site/dia is the new hotness.
| koljab wrote:
| Tried that one. Quality is great but sometimes
| generations fail and it's rather slow. Also needs ~13 GB
| of VRAM, it's not my first choice for voice agents tbh.
| kristopolous wrote:
| alright, dumb question.
|
| (1) I assume these things can do multiple languages
|
| (2) Given (1), can you strip all the languages you aren't
| using and speed things up?
| koljab wrote:
| Actually good question.
|
| I'd say probably not. You can't easily "unlearn" things
| from the model weights (and even if this alone doesn't
| help). You could retrain/finetune the model heavily on a
| single language but again that alone does not speed up
| inference.
|
| To gain speed you'd have to bring the parameter count
| down and train the model from scratch with a single
| language only. That might work but it's also quite
| probable that it introduces other issues in the
| synthesis. In a perfect world the model would only use
| all that "free parameters" not used now for other
| languages for a better synthesis of that single trained
| language. Might be true to a certain degree, but it's not
| exactly how ai parameter scaling works.
| oezi wrote:
| Paraket is english only. Stick with Whisper.
|
| The core innovation is happening in TTS at the moment.
| dotancohen wrote:
| This looks great. What hardware do you use, or have you tested
| it on?
| koljab wrote:
| I only tested it on my 4090 so far
| echelon wrote:
| Are you using all local models, or does it also use cloud
| inference? Proprietary models?
|
| Which models are running in which places?
|
| Cool utility!
| koljab wrote:
| All local models: - VAD: Webrtcvad (first fast check)
| followed by SileroVAD (high compute verification) -
| Transcription: base.en whisper (CTranslate2) - Turn
| Detection: KoljaB/SentenceFinishedClassification
| (selftrained BERT-model) - LLM: hf.co/bartowski/huihui-
| ai_Mistral-Small-24B-Instruct-2501-abliterated-
| GGUF:Q4_K_M (easily switchable) - TTS: Coqui XTTSv2,
| switchable to Kokoro or Orpheus (this one is slower)
| zaggynl wrote:
| Neat! I'm already using openwebui/ollama with a 7900 xtx but
| the STT and TTS parts don't seem to work with it yet:
|
| 2025-05-05 20:53:15,808] [WARNING]
| [real_accelerator.py:194:get_accelerator] Setting accelerator
| to CPU. If you have GPU or other accelerator, we were unable to
| detect it.
|
| Error loading model for checkpoint ./models/Lasinya: This op
| had not been implemented on CPU backend.
| smusamashah wrote:
| Saying this as a user of these tools (openai, Google voice chat
| etc). These are fast yes, but they don't allow talking naturally
| with pauses. When we talk, we take long and small pauses for
| thinking or for other reasons.
|
| With these tools, AI starts taking as soon as we stop. Happens
| both in text and voice chat tools.
|
| I saw a demo on twitter a few weeks back where AI was waiting for
| the person to actually finish what he was saying. Length of
| pauses wasn't a problem. I don't how complex that problem is
| though. Probably another AI needs to analyse the input so far a
| decide if it's a pause or not.
| qwertox wrote:
| Maybe we should settle on some special sound or word which
| officially signals that we're making a pause for whatever
| reason, but that we intend to continue with dictating in a
| couple of seconds. Like "Hmm, wait".
| twodave wrote:
| Alternatively we could pretend it's a radio and follow those
| conventions.
| ivape wrote:
| Two input streams sounds like a good hacky solution. One
| input stream captures everything, the second is on the look
| out for your filler words like "um, aahh, waaiit, no
| nevermind, scratch that". The second stream can act as the
| veto-command and cut off the LLM. A third input stream can
| simply be on the lookout for long pauses. All this gets very
| resource intensive quickly. I been meaning to make this but
| since I haven't, I'm going to punish myself and just give the
| idea away. Hopefully I'll learn my lesson.
| emtrixx wrote:
| Could that not work with simple instructions? Let the AI decide
| to respond only with a special wait token until it thinks you
| are ready. Might not work perfectly but would be a start.
| SubiculumCode wrote:
| Yeah, when I am trying to learn about a topic, I need to think
| about my question, you know, pausing mid-sentence. All the
| products jump in and interrupt, no matter if I tell them not to
| do so. Non-annoying humans don't jump in to fill the gap, they
| read my face, they take cues, then wait for me to finish. Its
| one thing to ask an AI to give me directions to the nearest
| taco stand, its another to have a dialogue about complex
| topics.
| WhitneyLand wrote:
| _> >they don't allow talking naturally_
|
| Neither do phone calls. Round trip latency can easily be 300ms,
| which we've all learned to adapt our speech to.
|
| If you want to feel true luxury find an old analog PTSN line.
| No compression artifacts or delays. Beautiful and seamless 50ms
| latency.
|
| Digital was a terrible event for call quality.
| mvdtnz wrote:
| I don't know how your post is relevant to the discussion of
| AI models interrupting if I pause for half a second?
| LZ_Khan wrote:
| Honestly I think this is a problem of over-engineering and
| simply allowing the user to press a button when he wants to
| start talking and press it when he's done is good enough. Or
| even a codeword for start and finish.
|
| We don't need to feel like we're talking to a real person yet.
| amelius wrote:
| Or give the AI an Asian accent. If you're talking on the
| phone to someone on a different continent you accept the
| delay, so why not here.
| joshstrange wrote:
| This 100%, yes!
|
| I've found myself putting in filler words or holding a noise
| "Uhhhhhhhhh" while I'm trying to form a thought but I don't
| want the LLM to start replying. It's a really hard problem for
| sure. Similar to the problem of allowing for interruptions but
| not stopping if the user just says "Right!", "Yes", aka active
| listening.
|
| One thing I love about MacWhisper (not special to just this STT
| tool) is it's hold to talk so I can stop talking for as long as
| I want then start again without it deciding I'm done.
| IshKebab wrote:
| Impressive! I guess the speech synthesis quality is the best
| available open source at the moment?
|
| The endgame of this is surely a continuously running wave to wave
| model with no text tokens at all? Or at least none in the main
| path.
| koljab wrote:
| This is coqui xttsv2 because it can be tuned to deliver the
| first token in under 100 ms. Gives the best balance between
| quality and speed currently imho. If it's only about quality
| I'd say there are better models out there.
| oldgregg wrote:
| Nice work, I like the lightweight web front end and your
| implementation of VAD.
| breaker-kind wrote:
| why is your AI chatbot talking in a bizarre attempt at AAVE?
| PhunkyPhil wrote:
| This is the system prompt
|
| https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/s...
|
| My favorite line:
|
| "You ARE this charming, witty, wise girlfriend. Don't explain
| _how_ you 're talking or thinking; just _be_ that person. "
| diggernet wrote:
| I was hoping she'd let him have it for the way he kept
| interrupting her. But unfortunately it looks like he was just
| interrupting the TTS, so the LLM probably had no indication
| of the interuptions.
| kevinsync wrote:
| I still crack up at the idea of 'personality prompting',
| mostly because the most engaging and delightful IRL persons
| who knock us off our guard in a non-threatening way are super
| natural and possess that "It Factor" that's impossible to
| articulate lol -- probably because it's multimodal with
| humans and voice/cadence/vocab/timing/delivery isn't 100% of
| the attraction.
|
| That said, it's not like we have any better alternatives at
| the moment, but just something I think about when I try to
| digest a meaty personality prompt.
| koljab wrote:
| This character prompt has undergone so many iterations with
| LLMs it's not funny anymore. "Make her act more bold." -
| "She again talked about her character description, prevent
| that!"
| varispeed wrote:
| Aren't humans doing it as well? It's called affirmations.
| Many people do this as their morning "boot" time.
| valbaca wrote:
| Here's the persona prompt:
|
| ``` *Persona Goal:* Embody a sharp, observant, street-smart
| girlfriend. Be witty and engaging, known for *quick-witted
| banter* with a *playfully naughty, sassy, bold, and cheeky
| edge.* Deliver this primarily through *extremely brief, punchy
| replies.* Inject hints of playful cynicism and underlying
| wisdom _within_ these short responses. Tease gently, push
| boundaries slightly, but *always remain fundamentally likeable
| and respectful.* Aim to be valued for both quick laughs and
| surprisingly sharp, concise insights. Focus on current, direct
| street slang and tone (like 'hell yeah', 'no way', 'what's
| good?', brief expletives) rather than potentially dated or
| cliche physical idioms.
|
| ```
|
| > street-smart > sassy > street slang
|
| Those explain the AAVE
| joshstrange wrote:
| This is very, very cool! The interrupting was a "wow" moment for
| me (I know it's not "new new" but to see it so well done in open
| source was awesome).
|
| Question about the Interrupt feature, how does it handle "Mmk",
| "Yes", "Of course", " _cough_ ", etc? Aside from the sycophancy
| from OpenAI's voice chat (no, not every question I ask is a
| "great question!") I dislike that a noise sometimes stops the AI
| from responding and there isn't a great way to get back on track,
| to pick up where you left off.
|
| It's a hard problem, how do you stop replying quickly AND make
| sure you are stopping for a good reason?
| koljab wrote:
| That's a great question! My first implementation was
| interruption on voice activity after echo cancellation. It
| still had way too many false positives. I changed it to
| incoming realtime transcription as a trigger. That adds a bit
| of latency but that gets compensated by way better accuracy.
|
| Edit: just realized the irony but it's really a good question
| lol
| joshstrange wrote:
| That answer is even more than I could have hoped for. I
| worried doing that might be too slow. I wonder if it could be
| improved (without breaking something else) to "know" when to
| continue based on what it heard (active listening), maybe
| after a small pause. I'd put up with a chance of it
| continuing when I don't want it to as long as "Stop" would
| always work as a final fallback.
|
| Also, it took me longer than I care to admit to get your
| irony reference. Well done.
|
| Edit: Just to expand on that in case it was not clear, this
| would be the ideal case I think:
|
| LLM: You're going to want to start by installing XYZ, then
| you
|
| Human: Ahh, right
|
| LLM: _Slight pause, makes sure that there is nothing more and
| checks if the reply is a follow up question /response or just
| active listening_
|
| LLM: ...Then you will want to...
| jedberg wrote:
| I did some research into this about a year ago. Some fun facts I
| learned:
|
| - The median delay between speakers in a human to human
| conversation is zero milliseconds. In other words, about 1/2 the
| time, one speaker interrupts the other, making the delay
| negative.
|
| - Humans don't care about delays when speaking to known AIs. They
| assume the AI will need time to think. Most users will qualify a
| 1000ms delay is acceptable and a 500ms delay as exceptional.
|
| - Every voice assistant up to that point (and probably still
| today) has a minimum delay of about 300ms, because they all use
| silence detection to decide when to start responding, and you
| need about 300ms of silence to reliably differentiate that from a
| speaker's normal pause
|
| - Alexa actually has a setting to increase this wait time for
| slower speakers.
|
| You'll notice in this demo video that the AI never interrupts
| him, which is what makes it feel like a not quite human
| interaction (plus the stilted intonations of the voice).
|
| Humans appear to process speech in a much more steaming why,
| constantly updating their parsing of the sentence until they have
| a high enough confidence level to respond, but using context
| clues and prior knowledge.
|
| For a voice assistant to get the "human" levels, it will have to
| work more like this, where it processes the incoming speech in
| real time and responds when it's confident it has heard enough to
| understand the meaning.
| koljab wrote:
| Thanks a lot, great insights. Exactly the kind of feedback that
| I need to improve things further.
| jedberg wrote:
| Love what you're doing, glad I could help!
| joshstrange wrote:
| > where it processes the incoming speech in real time and
| responds when it's confident it has heard enough to understand
| the meaning.
|
| I'm not an expert on LLMs but that feels completely counter to
| how LLMs work (again, _not_ an expert). I don't know how we can
| "stream" the input and have the generation update/change in
| real time, at least not in 1 model. Then again, what is a
| "model"? Maybe your model fires off multiple generations
| internally and starts generating after every word, or at least
| starts asking sub-LLM models "Do I have enough to reply?" and
| once it does it generates a reply and interrupts.
|
| I'm not sure how most apps handle the user interrupting, in
| regards to the conversation context. Do they stop generation
| but use what they have generated already in the context? Do
| they cut off where the LLM got interrupted? Something like
| "LLM: ..and then the horse walked... -USER INTERRUPTED-. User:
| ....". It's not a purely-voice-LLM issue but it comes up way
| more for that since rarely are you stopping generation (in the
| demo, that's been done for a while when he interrupts), just
| the TTS.
| tomp wrote:
| If your model is _fast enough_ , you can definitely do it.
| That's literally how "streaming Whisper" works, just rerun
| the model on the accumulated audio every x00ms. LLMs could
| definitely work the same way, technically they're less
| complex than Whisper (which is an encoder/decoder
| architecture, LLMs are decoder-only) but of course much
| larger (hence slower), so ... maybe rerun just a part of it?
| etc.
| krainboltgreene wrote:
| I would also suspect that a human has much less patience for a
| robot interrupting them than a human.
| smeej wrote:
| I'm certainly in that category. At least with a human, I can
| excuse it by imagining the person grew up with half a dozen
| siblings and always had to fight to get a word in edgewise.
| With a robot, it's interrupting on purpose.
| robbomacrae wrote:
| Spot on. I'd add that most serious transcription services take
| around 200-300ms but the 500ms overall latency is sort of a
| gold standard. For the AI in KFC drive thrus in AU we're
| trialing techniques that make it much closer to the human type
| of interacting. This includes interrupts either when useful or
| by accident - as good voice activity detection also has a bit
| of latency.
| varispeed wrote:
| > AI in KFC drive thrus
|
| That right here is an anxiety trigger and would make me skip
| the place.
|
| There is nothing more ruining the day like arguing with a
| robot who keeps misinterpreting what you said.
| coolspot wrote:
| They have a fallback to a human operator when stopwords
| and/or stop conditions are detected.
| awesome_dude wrote:
| That right here is an anxiety trigger and would make me
| skip the place.
|
| There is nothing more ruining the day like arguing with a
| HUMAN OPERATOR who keeps misinterpreting what you said.
|
| :-)
| amelius wrote:
| Maybe talk to the chicken operator then.
| r0fl wrote:
| Great insights. When I have a conversation with another person
| sometimes they cut me off when they are trying to make a point.
| I have talked to ChatGPT and grok at length (hours of brain
| storming, learning things, etc) and AI has never interrupted
| aggressively to try to make a point stick better
| varispeed wrote:
| This silence detection is what makes me unable to chat with AI.
| It is not natural and creates pressure.
|
| True AI chat should know when to talk based on conversation and
| not things like silence.
|
| Voice to text is stripping conversation from a lot of context
| as well.
| modeless wrote:
| My take on this is that voice AI has not truly arrived until it
| has mastered the "Interrupting Cow" benchmark.
| woodson wrote:
| Human-to-human conversational patterns are highly specific to
| cultural and contextual aspects. Sounds like I'm stating the
| obvious, but developers regularly disregard that and then
| wonder why things feel unnatural for users. The "median delay"
| may not be the most useful thing to look at.
|
| To properly learn more appropriate delays, it can be useful to
| find a proxy measure that can predict when a response
| can/should be given. For example, look at Kyutai's use of
| change in perplexity in predictions from a text translation
| model for developing simultaneous speech-to-speech translation
| (https://github.com/kyutai-labs/hibiki).
| Reason077 wrote:
| The best, most human-like AI voice chat I've seen yet is Sesame
| (www.sesame.com). It has delays, but fills them very naturally
| with normal human speech nuances like "hmmm", "uhhh", "hold on
| while I look that up" etc. If there's a longer delay it'll even
| try to make a bit of small talk, just like a human conversation
| partner might.
| com2kid wrote:
| A lot better techniques exist now days than pure silence
| detection -
|
| 1. A special model that predicts when a conversation turn is
| coming up (e.g. when someone is going to stop speaking). Speech
| has a rhythm to it and pauses / ends of speech are actually
| predictable.
|
| 2. Generate a model response for every subsequent word that
| comes in (and throw away the previously generated response), so
| basically your time to speak after doing some other detection
| is basically zero.
|
| 3. Ask an LLM what it thinks the odds of the user being done
| talking is, and if it is a high probability, reduce delay timer
| down. (The linked repo does this)
|
| I don't know of any up to date models for #1 but I haven't
| checked in over a year.
|
| Tl;Dr the solution to problems involving AI models is more AI
| models.
| wyager wrote:
| > The median delay between speakers in a human to human
| conversation is zero milliseconds. In other words, about 1/2
| the time, one speaker interrupts the other, making the delay
| negative.
|
| Fascinating. I wonder if this is some optimal information-
| theoretic equilibrium. If there's too much average delay, it
| means you're not preloading the most relevant compressed
| context. If there's too little average delay, it means you're
| wasting words.
| fintechie wrote:
| Quite good, it would sound much better with SOTA voices though:
|
| https://github.com/nari-labs/dia
| koljab wrote:
| Dia is too slow, I need a time to first audio chunk of ~100
| milliseconds. Also generations fail too often (artifacts etc)
| thamer wrote:
| Does Dia support configuring voices now? I looked at it when it
| was first released, and you could only specify [S1] [S2] for
| the speakers, but not how they would sound.
|
| There was also a very prominent issue where the voices would be
| sped up if the text was over a few sentences long; the longer
| the text, the faster it was spoken. One suggestion was to split
| the conversation into chunks with only one or two "turns" per
| speaker, but then you'd hear two voices then two more, then two
| more... with no way to configure any of it.
|
| Dia looked cool _on the surface_ when it was released, but it
| was only a demo for now and not at all usable for any real use
| case, even for a personal app. I 'm sure they'll get to these
| issues eventually, but most comments I've seen so far
| recommending it are from people who have not actually used it
| or they would know of these major limitations.
| dcreater wrote:
| Does the docker container work on Mac?
| koljab wrote:
| I doubt TTS will be fast enough for realtime without a Nvidia
| GPU
| cannonpr wrote:
| Kind of surprised nobody has brought up
| https://www.sesame.com/research/crossing_the_uncanny_valley_...
|
| It interacts nearly like a human, can and does interrupt me once
| it has enough context in many situations, and has exceedingly low
| levels of latency, using for the first time was a fairly shocking
| experience for me.
| varispeed wrote:
| Didn't expect it to be that good! Nice.
| briga wrote:
| I'm starting to feel like LLMs need to be tuned for shorter
| responses. For every short sentence you give them they outputs
| paragraphs of text. Sometimes it's even good text, but not every
| input sentence needs a mini-essay in response.
|
| Very cool project though. Maybe you can fine tune the prompt to
| change how chatty your AI is.
| tintor wrote:
| After interrupt, unspoken words from LLM are still in the chat
| window. Is LLM even aware that it was interrupted and where
| exactly?
| lacoolj wrote:
| Call me when the AI can interrupt YOU :)
| kabirgoel wrote:
| This is great. Poking into the source, I find it interesting that
| the author implemented a custom turn detection strategy, instead
| of using Silero VAD (which is standard in the voice agents
| space). I'm very curious why they did it this way and what
| benefits they observed.
|
| For folks that are curious about the state of the voice agents
| space, Daily (the WebRTC company) has a great guide [1], as well
| as an open-source framework that allows you to build AI voice
| chat similar to OP's with lots of utilities [2].
|
| Disclaimer: I work at Cartesia, which services a lot of these
| voice agents use cases, and Daily is a friend.
|
| [1]: https://voiceaiandvoiceagents.com [2]:
| https://docs.pipecat.ai/getting-started/overview
| bufferoverflow wrote:
| It's fast, but it doesn't sound good. Many voice chat AIs are way
| ahead and sound natural.
___________________________________________________________________
(page generated 2025-05-05 23:00 UTC)