[HN Gopher] Show HN: Open source framework OpenAI uses for Advan...
___________________________________________________________________
Show HN: Open source framework OpenAI uses for Advanced Voice
Hey HN, we've been working with OpenAI for the past few months on
the new Realtime API. The goal is to give everyone access to the
same stack that underpins Advanced Voice in the ChatGPT app. Under
the hood it works like this: - A user's speech is captured by a
LiveKit client SDK in the ChatGPT app - Their speech is streamed
using WebRTC to OpenAI's voice agent - The agent relays the speech
prompt over websocket to GPT-4o - GPT-4o runs inference and streams
speech packets (over websocket) back to the agent - The agent
relays generated speech using WebRTC back to the user's device The
Realtime API that OpenAI launched is the websocket interface to
GPT-4o. This backend framework covers the voice agent portion.
Besides having additional logic like function calling, the agent
fundamentally proxies WebRTC to websocket. The reason for this is
because websocket isn't the best choice for client-server
communication. The vast majority of packet loss occurs between a
server and client device and websocket doesn't provide programmatic
control or intervention in lossy network environments like WiFi or
cellular. Packet loss leads to higher latency and choppy or garbled
audio.
Author : russ
Score : 238 points
Date : 2024-10-04 17:01 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mycall wrote:
| I wonder when Azure OpenAI will get this.
| davidz wrote:
| I'm working on a PR now :)
| gastonmorixe wrote:
| Nice they have many partners on this. I see Azure as well.
|
| There is a common consensus that the new Realtime API is not
| actually using the same Advanced Voice model / engine - or
| however it works - since at least the TTS part doesn't seem to be
| as capable as the one shipped with the official OpenAI app.
|
| Any idea on this?
|
| Source: https://github.com/openai/openai-realtime-api-
| beta/issues/2
| russ wrote:
| It's using the same model/engine. I don't have knowledge of the
| internals, but a different subsystem/set of dedicated resources
| though for API traffic versus first-party apps.
|
| One thing to note is there is no separate TTS-phase here, it's
| happening internally within GPT-4o, in the Realtime API and
| Advanced Voice.
| gastonmorixe wrote:
| Thanks
| FanaHOVA wrote:
| Olivier, Michelle, and Romain gave you guys a shoutout like 3
| times in our DevDay recap podcast if you need more testimonial
| quotes :) https://www.latent.space/p/devday-2024
| russ wrote:
| I had no idea! <3 Thank you for sharing this, made my weekend.
| shayps wrote:
| You guys are honestly the best
| pj_mukh wrote:
| Super cool! Didn't realize OpenAI is just using LiveKit.
|
| Does the pricing breakdown to be the same as having a OpenAI
| Advanced Voice socket open the whole time? It's like $9/hr!
|
| It would be theoretically cheaper to use this without keeping the
| advanced voice socket open the whole time and just use the GPT4o
| streaming service [1] for whenever inference is needed (pay per
| token) and use livekits other components to do the rest (TTS, VAD
| etc.).
|
| What's the trade off here?
|
| [1]: https://platform.openai.com/docs/api-reference/streaming
| davidz wrote:
| Currently it does: all audio is sent to the model.
|
| However, we are working on turn detection within the framework,
| so you won't have to send silence to the model when the user
| isn't talking. It's a fairly straight forward path to cutting
| down the cost by ~50%.
| rukuu001 wrote:
| Working on this for an internal tool - detecting no speech
| has been a PITA so far. Interested to see how you go with
| this.
| balloob wrote:
| Use the voice activity detector we wrote for Home
| Assistant. It works very well:
| https://github.com/rhasspy/pymicro-vad
| ValentinA23 wrote:
| What if I'm watching TV and use the AI to control it ? It
| should only react to my voice (a problem I had that
| forced me to use a wake word).
| davidz wrote:
| currently we are using silero VAD to detect speech:
| https://github.com/livekit/agents/blob/main/livekit-
| plugins/...
|
| it works well for voice activity; though it doesn't always
| detect end-of-turn correctly (humans often pause mid-
| sentence to think). we are working on improving this
| behavior.
| pj_mukh wrote:
| Can I currently put a VAD module in the pipeline and only
| send audio when there is an active conversation? Feel like
| just that would solve the problem?
| npace12 wrote:
| You dont get charged per hour with the openai realtime api,
| only for tokens from detected speech and response
| willsmith72 wrote:
| That was cool, but got up to $1 usage real quick
| russ wrote:
| We had our playground (https://playground.livekit.io) up for a
| few days using our key. Def racked up a $$$$ bill!
| wordpad25 wrote:
| How much is it per minute of talking?
| shayps wrote:
| It shakes out to around $0.15 per minute for an average
| conversation. If history is a guide though, this will get a
| lot cheaper pretty quickly.
| cdolan wrote:
| This is cheaper than old cellular calls, inflation
| adjusted
| russ wrote:
| 50% human speaking at $0.06/minute of tokens
|
| 50% AI speaking at $0.24/minute of tokens
|
| we (LiveKit Cloud) charge ~$0.0005/minute for each
| participant (in this case there would be 2)
|
| So blended is $0.151/minute
| solarkraft wrote:
| That's some crazy marketing for a ,,our library happened to
| support this relatively simple use case" situation. Impressive!
|
| By the way: The cerebras voice demo _also_ uses LiveKit for this:
| https://cerebras.vercel.app/
| russ wrote:
| There's a ton of complexity under the "relatively simple use
| case" when you get to a global, 200M+ user scale.
| gastonmorixe wrote:
| 80% of the times I'm experiencing choppy audio on my iPhone
| 15 Pro Max (18.1b) on Voice Mode (Standard and Advanced). My
| internet connection is FTTH and WiFi 7 state of the art
| router.
|
| I wonder if this is because bugs or the crazy load livekit
| may be going through given the popularity in ChatGPT voice
| modes right now.
| russ wrote:
| Doesn't sound right. I'd love to dig into this some more.
| Would you mind shooting me a DM on X? @dsa
| lolpanda wrote:
| so the WebRTC helps with the unreliable network between the
| mobile clients and the server side. if the application is backend
| only, would it make sense to use WebRTC or should I go directly
| to realtime api?
| spuz wrote:
| Is there anyone besides OpenAI working on a speech to speech
| model? I find it incredibly useful and it's the sole reason that
| I pay for their service but I do find it very limited. I'd be
| interested to know if any other groups are doing research on
| voice models.
| Ey7NFZ3P0nzAe wrote:
| Yes. Kyutai released an opened model called moshi :
| https://github.com/kyutai-labs/moshi
|
| There's also llama-omni and a few others. None of them are even
| close to 4o from an LLM standpoint. But moshi is called a
| "foundational" model and U'm hopeful it will be enhanced. Also
| there's not yet support for those on most backends like
| llamacpp / ollama etc. So I'd say we're in a trough but we'll
| get there.
| 0x1ceb00da wrote:
| When I asked advanced voice mode it said that it receives input
| as audio and generates text as output.
| mbrock wrote:
| It is mistaken because it has no particular insight into its
| own implementation. In fact the whole point is that it
| directly consumes and produces audio tokens with no text.
| That's why it's able to sing, make noises, do accents, and so
| on.
| russ wrote:
| There's Ultravox as well (from one of the creators of WebRTC):
| https://github.com/fixie-ai/ultravox
|
| Their model builds a speech-to-speech layer into Llama. Last I
| checked they have the audio-in part working and they're working
| on the audio-out piece.
| throw14082020 wrote:
| This is really helpful, thanks!
|
| OpenAI hired the ex fractional CTO of LiveKit, who created Pion,
| a popular WebRTC library/tool.
|
| I'd expect OpenAI to migrate off of LiveKit within 6 months.
| LiveKit is too expensive. Also, WebRTC is hard, and OpenAI now
| being a less open company will want to keep improvements to
| itself.
|
| Not affiliated with any competitors, but I did work at a PaaS
| company similar to LiveKit but used Websockets instead.
| fidotron wrote:
| > LiveKit is too expensive
|
| Most of it is open source, especially the clients, although
| they do feel quite ad hoc hacked together (a possible side
| effect of WebRTC evolution).
|
| Would totally agree on OpenAI moving away. The description of
| the agent here sounds like a big hack just to get around the
| fact temporarily the model server expects audio over sockets
| instead.
| russ wrote:
| Which components feel ad hoc?
|
| In most real applications, the agent has additional logic
| (function calling, RAG, etc) than simply relaying a stream to
| the model server. In those cases, you want it to be a
| separate service/component that can be independently scaled.
| fidotron wrote:
| Essentially I think the Livekit value is a SFU that works,
| with signalling, and the SDKs exist. My experience is
| people radically overstate how hard signalling is, and
| underestimate SFU complexity, especially with fast
| failover.
|
| In terms of being a higher level API arguably it is doomed
| to failure, thanks to the madness of the domain. (The part
| that sticks in my mind is audio device switching on
| Android.) WebRTC products seem to always end up with the
| consumer needing to know way more of the internals than is
| healthy. As such I think once you are sufficiently good at
| using LiveKit you are less likely to pick it for your next
| product because you will be able to roll your own far more
| easily. That is unless the value you were getting from it
| actually was the SFU infrastructure and not the SDKs.
|
| The OpenAI case is so point-to-point that doing WebRTC for
| that is, honestly, really not hard at all.
| russ wrote:
| You really don't need to know about WebRTC at all when
| you use LiveKit. That's largely thanks to the SDKs
| abstracting away all the complexity. Having good SDKs
| that work across every platform with consistent APIs is
| more valuable than the SFU imo. There are other options
| for SFUs and folks like Signal have rolled their own. Try
| to get WebRTC running on Apple Vision Pro or tvOS and let
| me know if that's no big deal.
| fidotron wrote:
| > Try to get WebRTC running on Apple Vision Pro or tvOS
| and let me know if that's no big deal.
|
| [EDIT: I probably shouldn't mention that]. I have some
| experience of getting webrtc up on new platforms, and
| it's not as bad as all that. libwebrtc is a remarkably
| solid library, especially given the domain it's in.
|
| I obviously do not share your opinion of the SDKs.
| russ wrote:
| Heh, actually I'm pretty sure I've come across your X
| profile before. :) You're definitely in a small minority
| of folks with a deep(er) understanding of WebRTC.
| russ wrote:
| Field CTO -- hi @Sean-Der :wave:
|
| Fractional CTO sounds like a disaster lol
| 0x1ceb00da wrote:
| This suggests that the AI "brain" receives the user input as text
| prompt (agent relays the speech prompt to GPT-4o) and generates
| audio as output (GPT-4o streams speech packets back to the
| agent).
|
| But when I asked advanced voice mode it said the exact opposite.
| That it receives input as audio and generates text as output.
| meiraleal wrote:
| Who did you ask? ChatGPT? Not sure if you understand LLMs but
| its knowledge is based on the training data, it can't reason
| about itself, it can only hallucinate in this case, sometimes
| correctly, most times incorrectly.
| hshshshsvsv wrote:
| This is also true for petty much all humans and bypassing
| this limitation is called enlightenment/self realization.
|
| LLMs don't even have a self so it can never be realized. Just
| the ego alone exists.
| TZubiri wrote:
| No, humans can self inspect just fine
| tempaccount420 wrote:
| How do you know that?
| mbrock wrote:
| A lot of psychologists would quibble with that...
| nialse wrote:
| Agreed. I am a psychologist.
| ada1981 wrote:
| Any evidence of that?
|
| Have you seen the current US political system? Or Hawk
| Tua?
| mbrock wrote:
| Both input and output are audio. This post is about bridging
| WebRTC audio I/O with an API that itself operates on simple TCP
| socket streams of raw PCM. For reliability and efficiency you
| want end users to connect with compressed loss-tolerant Zoom-
| style streams, and that goes through a middleman which relays
| to the model API.
| racecar789 wrote:
| Imagine being able to tell an app to call the IRS during the day,
| endure the on-hold wait times, then ask the question to the IRS
| rep and log the answer. Then deliver the answer when you get
| home.
|
| Or, have the app call a pharmacy every month to refill
| prescriptions. For some drugs, the pharmacy requires a manual
| phone call to refill which gets very annoying.
|
| So many use cases for this.
| TZubiri wrote:
| As costs of humanlike communications decrease, so will Sybil
| attacks and spam.
|
| The IRS is notorious for resistance to tech change, don't be
| surprised if they unplug the phones and force you to walk in to
| ask your question.
|
| What is the value add here? Save sometime for technocrats and
| technoadjacents for a whole of 3 years before victims of spam
| adapt?
|
| Also this has been solved already just mail your question like
| the rest of mortals.
| ensignavenger wrote:
| It would be really nice if the IRS would ALLOW you to walk in
| and ask a question!
| daveguy wrote:
| That is very expensive. Offices all around the country with
| personnel. We are going to have to fund them instead of
| gripe about them to get that to happen.
| ensignavenger wrote:
| Yeah, that is why I doubt it will happen. Maybe a website
| where you can sumbit an issue and have it resolved in a
| reasonable number of days would be fine.
| andrew_eu wrote:
| Years ago my tax return was flagged as a possible fraud
| case -- I believe a direct consequence of a big data
| breach. I had to go into my "local" IRS office and present
| my passport to prove indeed it was me. Decidedly not nice.
|
| True to form, with an appointment I waited 3 hours at the
| office and watched the guard staff turn away countless
| people. Finally saw a person, gave then my passport, and
| finished in a minute.
| ensignavenger wrote:
| I am going through that right now. IRS owes me 3 years of
| refunds, but I can't even get an appointment to see them.
| They hang up one when I call (after hours on hold), and
| won't let me just visit the local office. My current
| attempt is to work with my US Senators office.
| beeboobaa3 wrote:
| They'll just put up "captchas" or whatever.
|
| The point if phone lines is to waste the client's time. Not to
| have the client waste _their_ time.
| fosheezy wrote:
| We do this exact thing at getvibrato.com. You can schedule
| calls like these, or even do more advanced automation with
| Zapier.
___________________________________________________________________
(page generated 2024-10-05 23:01 UTC)