[HN Gopher] Show HN: Open-source, native audio turn detection model
___________________________________________________________________
Show HN: Open-source, native audio turn detection model
Our goal with this project is to build a completely open source,
state of the art turn detection model that can be used in any voice
AI application. I've been experimenting with LLM voice
conversations since GPT-4 was first released. (There's a previous
front page Show HN about Pipecat, the open source voice AI
orchestration framework I work on. [1]) It's been almost two
years, and for most of that time, I've been expecting that someone
would "solve" turn detection. We all built initial, pretty good
80/20 versions of turn detection on top of VAD (voice activity
detection) models. And then, as an ecosystem, we kind of got stuck.
A few production applications have recently started using Gemini
2.0 Flash to do context aware turn detection. [2] But because
latency is ~500ms, that's a more complicated approach than using a
specialized model. The team at LiveKit released an open weights
model that does text-based turn detection. [3] I was really excited
to see that, but I'm not super-optimistic that a text-input model
will ever be good enough for this task. (A good rule of thumb in
deep learning is that you should bet on end-to-end.) So ... I
spent Christmas break training several little proof of concept
models, and experimenting with generating synthetic audio data. So,
so, so much fun. The results were promising enough that I nerd-
sniped a few friends and we started working in earnest on this.
The model now performs really well on a subset of turn detection
tasks. Too well, really. We're overfitting on a not-terribly-broad
initial data set of about 8,000 samples. Getting to this point was
the initial bar we set for doing a public release and seeing if
other people want to get involved in the project. There are lots
of ways to contribute. [4] Medium-term goals for the project are:
- Support for a wide range of languages - Inference time of
<50ms on GPU and <500ms on CPU - Much wider range of speech
nuances captured in training data - A completely synthetic
training data pipeline. (Maybe?) - Text conditioning of the
model, to support "modes" like credit card, telephone number, and
address entry. If you're interested in voice AI or in audio
model ML engineering, please try the model out and see what you
think. I'd love to hear your thoughts and ideas. [1]
https://news.ycombinator.com/item?id=40345696 [2]
https://x.com/kwindla/status/1870974144831275410 [3]
https://blog.livekit.io/using-a-transformer-to-improve-end-o...
[4] https://github.com/pipecat-ai/smart-turn#things-to-do
Author : kwindla
Score : 119 points
Date : 2025-03-06 18:20 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| zamalek wrote:
| As an [diagnosed] HF autistic person, this is unironically
| something I would go for in an earpiece. How many parameters is
| the model?
| kwindla wrote:
| 580M parameters. More info about the model architecture:
| https://github.com/pipecat-ai/smart-turn?tab=readme-ov-file#...
| cyberbiosecure wrote:
| 580m, awesome, incredible
| meltyness wrote:
| ... but will the model learn when to interrupt you out of
| frustration with your ongoing statements, and start shouting?
|
| it seems like for the obvious use-cases there might need to
| be some sort of limit on how much this component _knows_
| written-beyond wrote:
| Having reviewed a few turn based models your implementation is
| pretty inline with them. Excited to see how this matures!
| kwindla wrote:
| Can you say more? There's not much open source work in this
| domain, that I've been able to find.
|
| I'm particularly interested in architecture variations,
| approaches to the classification head design and loss function,
| etc.
| remram wrote:
| Ok what's turn detection?
| ry167 wrote:
| Detecting when one user of a conversation has finished talking.
|
| It's a big deal for detecting human speech when interacting
| with LLM systems
| kwindla wrote:
| Turn detection is deciding when a person has finished talking
| and expects the other party in a conversation to respond. In
| this case, the other party in the conversation is an LLM!
| remram wrote:
| Oh I see. Not like segmenting a conversation where people
| speak in turn. Thanks.
| whiddershins wrote:
| huh. how is analyzing conversations in the manner you
| described NOT the way to train such a model?
| remram wrote:
| Did you reply to the wrong comment? No one is taking
| about training here.
| password4321 wrote:
| Speaker diarization is also still a tough problem for free
| models.
| woodson wrote:
| It's often called endpoint detection (in ASR).
| lelag wrote:
| Yes, weird that they didn't use that term for this project.
| kwindla wrote:
| I've talked about this a lot with friends.
|
| Endpoint detection (and phrase endpointing, and end of
| utterance) are terms from the academic literature about
| this, and related, problems.
|
| Very few people who are doing "AI Engineering" or even
| "Machine Learning" today know these terms. In the past, I
| argued that we should use the existing academic language
| rather than invent new terms.
|
| But then OpenAI released the Realtime API and called this
| "turn detection" in their docs. And that was that. It no
| longer made sense to use any other verbiage.
| lelag wrote:
| Thank for the explanation. I guess it makes some sense,
| considering many people with no nlp background are using
| those models now...
| foundzen wrote:
| I got most of my answers from the README. Well written. I read
| most of it. Can you share what kind of resources (and how much of
| them) were required to fine tune Wav2Vec2-BERT?
| kwindla wrote:
| It takes about 45 minutes to do the current training run on an
| L4 GPU with these settings: # Training
| parameters "learning_rate": 5e-5, "num_epochs":
| 10, "train_batch_size": 12, "eval_batch_size":
| 32, "warmup_ratio": 0.2, "weight_decay": 0.05,
| # Evaluation parameters "eval_steps": 50,
| "save_steps": 50, "logging_steps": 5, #
| Model architecture parameters "num_frozen_layers": 20
|
| I haven't seen a run do all 10 epochs, recently. There's
| usually an early stop after about 4 epochs.
|
| The current data set size is ~8,000 samples.
| kwindla wrote:
| A couple of interesting updates today:
|
| - 100ms inference using CoreML:
| https://x.com/maxxrubin_/status/1897864136698347857
|
| - An LSTM model (1/7th the size) trained on a subset of the data:
| https://github.com/pipecat-ai/smart-turn/issues/1
| cyberbiosecure wrote:
| forking...
| prophesi wrote:
| I'd love for Vedal to incorporate this in Neuro-sama's model. An
| osu bot turned AI Vtuber[0].
|
| [0] https://www.youtube.com/shorts/eF6hnDFIKmA
| xp84 wrote:
| I'm excited to see this particular technology developing more.
| From the absolute worst speech systems such as Siri, who will
| happily interrupt to respond with nonsense at the slightest half-
| pause, to even ChatGPT voice mode which at least tries, we
| haven't yet successfully got computers to do a good job of this -
| and I feel it may be the biggest obstacle in making 'agents' that
| are competent at completing simple but useful tasks. There are so
| many situations where humans "just know" when someone hasn't yet
| completed a thought, but "AI" still struggles, and those errors
| can just destroy the efficiency of a conversation or worse, lead
| to severe errors in function.
| lostmsu wrote:
| Does this support multiple speakers?
| kwindla wrote:
| In general, for realtime voice AI you don't _want_ this model
| to support multiple speakers because you have a separate voice
| input stream for each participant in a session.
|
| We're not doing "speaker diarization" from a single audio
| track, here. We're streaming the input from each participant.
|
| If there are multiple participants in a session, we still
| process each stream separately either as it comes in from that
| user's microphone (locally) or as it arrives over the network
| (server-side).
| pzo wrote:
| I will have a look at this. Played with pipecat before and it's
| great, switched to sherpa-onnx though since I need something that
| compile to native and can run on edge devices.
|
| I'm not sure if turn detection can be really solved except
| dedicated push to talk button like in walkie-talkie. I often
| tried google translator app and the problem is in many times when
| you speaking longer sentence you will stop or slow down a little
| to gather thought before continuing talking (especially if you
| are not native speaker). For this reason I avoid converation mode
| in such cases like google translator and when using perplexity
| app I prefer the push to talk button mode instead of new one.
|
| I think this could be solved but we would need not only low
| latency turn detection but also low latency speech interruption
| detection and also very fast low latency llm on device. And in
| case we have interruption good recovery that system know we
| continue last sentence instead of discarding previous audio and
| starting new etc.
|
| Lots of things can be improved also regarding i/o latency, like
| using low latency audio api, very short audio buffer, dedicated
| audio category and mode (in iOS), using wired headsets instead of
| buildin speaker, turning off system processing like in iphone
| audio boosting or polar pattern. And streaming mode for all STT,
| transport (using using remote LLM), TTS. Not sure if we can have
| TTS in streaming mode. I think most of the time they split by
| sentence.
|
| I think push to talk is a good solution if well designed: big
| button in place easily reached with your thumb, integration with
| iphone action button, using haptic for feedback, using apple
| watch as big push button, etc.
___________________________________________________________________
(page generated 2025-03-07 23:01 UTC)