hngopher.com

       [HN Gopher] Show HN: An open source framework for voice assistants
       ___________________________________________________________________
        
       Show HN: An open source framework for voice assistants
        
       I've been obsessed for the past ~year with the possibilities of
       talking to LLMs. I built a bunch of one-off prototypes, shared code
       on X, started a Meetup group in SF, and co-hosted a big hackathon.
       It turns out that there are a few low-level problems that everybody
       building conversational/real-time AI needs to solve on the way to
       building/shipping something that works well: low-latency media
       transport, echo cancellation, voice activity detection, phrase
       endpointing, pipelining data between models/services, handling
       voice interruptions, swapping out different models/services.  On
       the theory that something like a LlamaIndex or LangChain for real-
       time/conversational AI would be useful, a few of us started working
       on a Python library for voice (and multimodal) AI
       assistants/agents.  So ... Pipecat: a framework for building things
       like personal coaches, meeting assistants, story-telling toys for
       kids, customer support bots, virtual friends, and snarky social
       bots.  Most of the core contributors to Pipecat so far work
       together at our day jobs. This has been a kind of "20% time" thing
       at our company. But we're serious about welcoming all
       contributions. We want Pipecat to support any and all models,
       services, transport layers, and infrastructure tooling. If you're
       interested in this stuff, please check it out and let us know what
       you think. Submit PRs. Become a maintainer. Join the Discord. Post
       cool stuff. Post funny stuff when your voice agent goes completely
       off the rails (as mine sometimes do).
        
       Author : kwindla
       Score  : 171 points
       Date   : 2024-05-13 17:21 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | canadiantim wrote:
       | Very cool, great work! I can def self using this when I start
       | building in that direction.
        
       | bamazizi wrote:
       | I wonder how the just announced "GPT-4o" with real-time voice
       | impacts projects like this?
       | 
       | The demo on real-time multi language translation conversation
       | blew me away!
        
         | avarun wrote:
         | Yeah, same question here.
         | 
         | Building pipelines for bridging LLMs and TTS and STT models
         | with lower latency is fine and all, but when you compare to a
         | natively multimodal model like GPT-4o it seems strictly
         | inferior. The future is clearly voice-native models that are
         | able to understand nuances in voice and speech patterns, and
         | it's not exactly a distant future.
        
         | kwindla wrote:
         | Here's a translation demo in Pipecat using the now ancient and
         | arthritic GPT-4 Turbo model. :-) https://github.com/pipecat-
         | ai/pipecat/tree/main/examples/tra...
         | 
         | As soon as GPT-4o audio input is available through the APIs,
         | we'll add 4o support to Pipecat. For bidirectional real-time
         | audio, I think they'll need to make new WebSocket or WebRTC
         | endpoints available.
        
           | jshreder wrote:
           | Just letting you know it's available right now, just specify
           | `gpt-4o` -- for text streaming anyway. I'd hazard a guess
           | that the audio endpoints are open now, just not documented
           | (like most of the last launches)...
        
             | kwindla wrote:
             | Yeah, seems to be a drop-in replacement for the existing
             | inference APIs. But I haven't found any docs yet for
             | streaming audio/video input.
        
       | ilaksh wrote:
       | This is great but we really need an audio-to-audio model like
       | they demoed in the open source world. Does anyone know of
       | anything like that?
       | 
       | Edit: someone found one:
       | https://news.ycombinator.com/item?id=40346992
        
         | makeitmore wrote:
         | Most of the Pipecat examples we've been working on are focused
         | on speech-to-speech. The examples guide you through how to do
         | that (or you can give the hosted storytelling example a try:
         | https://storytelling-chatbot.fly.dev/)
         | 
         | We should probably update the example in the README to better
         | represent that, thank you!
        
           | ilaksh wrote:
           | Your project is amazing and I'm not trying to take away from
           | what you have accomplished.
           | 
           | But..I looked at the code but didn't see any audio-to-audio
           | service or model. Can you link to an example of that?
           | 
           | I don't mean speech to text to LLM to text to speech. I mean
           | speech-to-speech directly, as in the ML model takes audio as
           | input and outputs audio. As they have now in OpenAI.
           | 
           | I am very familiar with the typical multi-model workflow and
           | have implemented it several times.
        
         | kwindla wrote:
         | An audio-to-audio model is definitely a step forward. And I do
         | think that's where things are going to go, generally speaking.
         | 
         | For context relating to real-time voice AI: once you're down
         | below ~800ms things are fast enough to feel naturally
         | responsive for most people and use cases.
         | 
         | The GPT-4o announcement page says they average ~320ms time to
         | first token from an audio prompt. Which is definitely next
         | level and is really, really exciting. You can't get to 800ms
         | with any pipeline that includes GPT-4 Turbo today, so this is a
         | big deal.
         | 
         | It's possible to do ~500ms time to first token by pipelining
         | today's fastest transcription, inference, and tts models. (For
         | example, Deepgram transcription, Groq Llama-3, Deepgram Aura
         | voices.)
        
           | ilaksh wrote:
           | I'm familiar with Deepgram, groq, and Eleven Labs. I have
           | recently built something on those and it's really not too bad
           | as far as latency. But OpenAI has shown that audio-to-audio
           | can't be beat.
        
           | tempusalaria wrote:
           | Every opening phrase is a platitude like 'sure let's do it'.
           | So the OpenAI latency is probably higher, they are just using
           | clever orchestration to generate some filler tokens to make
           | latency lower. Unlikely the initial response at OpenAI is
           | coming from the main model.
        
       | johnmaguire wrote:
       | Siri came out in October 2011. Amazon Alexa made its debut in
       | November 2014. Google Assistant's voice-activated speakers were
       | released in May 2016.
       | 
       | From what I can tell, Siri is still a dumpster fire that nobody
       | is willing to use. And I have no personal experience with Alexa,
       | so I can't speak to it. But I do have a few Google Home speakers
       | and an Android phone, and I have seen no major improvements in
       | years. In fact, it has gotten worse - for example, you can no
       | longer add items directly to AnyList[0], only Google Keep.
       | 
       | Or, as an incredibly simple example of something I thought we'd
       | get a long time ago, it's still unable to interpret two-part
       | requests, e.g. "please repeat that but louder," or "please turn
       | off the kitchen and dining room lights."
       | 
       | I find voice assistants very useful - especially when driving,
       | lying in bed, cooking, or when I'm otherwise preoccupied. Yet
       | they have stagnated almost since their debut. I can only imagine
       | nobody has found a viable way to monetize them.
       | 
       | What will it take to get a better voice assistant for consumers?
       | Willow[1] doesn't seem to have taken off.
       | 
       | [0] https://help.anylist.com/articles/google-assistant-overview/
       | 
       | [1] https://heywillow.io/
       | 
       | edit: I realize I hijacked your thread to dump something that's
       | been on my mind lately. Pipecat looks really cool, and I hope it
       | takes off! I hope to get some time to experiment this weekend.
        
         | petemir wrote:
         | > From what I can tell, Siri is still a dumpster fire that
         | nobody is willing to use. I have no personal experience with
         | Alexa, so I can't speak to it
         | 
         | I use both (albeit more Alexa than Siri, both just for a really
         | limited functionality set), and FWIW, I believe Alexa is worse
         | than Siri. It can do two things at the same time though (just
         | as your example: "turn on X and turn off Y", "turn on X for Y
         | seconds", and things like that).
         | 
         | I also feel that it has gotten worse over the years. I read
         | about the possibility of microphones getting dust and therefore
         | capturing worse audio, so I got a dust blower (for other
         | reasons, too), but it didn't solve anything.
         | 
         | After listening in the app what Alexa picks up (from an Echo
         | and Echo Dot, both 4th. Gen), I have to say that they use
         | really shitty microphones. Furthermore, I have been testing
         | Whisper extensively last month, with audio coming from low-
         | quality sources, and I think a similar model would interpret a
         | lot better my voice than whatever Amazon is using.
        
         | sroussey wrote:
         | For some activities, Siri is just fine. Thinks like "send a
         | text to x" and "remind me to do x when I get home".
         | 
         | And it does fine with no internet access.
         | 
         | Except dictation. Much better with internet access than
         | without.
        
           | johnmaguire wrote:
           | Those are about as basic of an action as you can get. Every
           | assistant supports them. But as soon as you want to know
           | something like "how many teaspoons in a cup," can Siri still
           | handle it? What about "where is the aurora borealis visible
           | tonight"?
           | 
           | Another issue Siri used to struggle was trying to play
           | specific music on Spotify. Is that better these days?
        
         | michaelmior wrote:
         | I primarily use Google Home, but I do also have Echo Frames so
         | I use Alexa semi-regularly. My use case is primarily home
         | automation. In that scenario, I find Alexa to be much more
         | responsive than Google Home. I do agree that it seems like
         | Google Home has gotten worse in a number of ways. (As a happy
         | AnyList user, that specific one was frustrating.)
        
         | magicalhippo wrote:
         | > it's still unable to interpret two-part requests
         | 
         | Our car has Google Assistant, and yeah that's annoying. Want to
         | turn off steering wheel heater _and_ seat heater? Gotta do two
         | individual requests.
         | 
         | That said, it's actually quite nice to have voice control over
         | these things. Especially when it's heavy traffic and snowing on
         | top of the icy road, and you really want to have eyes on the
         | traffic and both hands on the steering wheel.
        
           | johnmaguire wrote:
           | > That said, it's actually quite nice to have voice control
           | over these things.
           | 
           | Yes! I really think voice assistants are underrated. When I
           | talk to iOS users, they have a much less favorable opinion of
           | Siri (and I've watched my partner give up on using it over
           | the past 10 years) and given that iOS has dominant market
           | share in the US, I suspect this is a part of it.
           | 
           | But I also think there is just so much "low hanging fruit"
           | that would drastically improve the experience. But I remember
           | that even during "the race" for voice AI, everyone was
           | wondering... how will they monetize this? And I'm not sure
           | anyone was ever truly able to figure that out.
        
         | keb_ wrote:
         | I have Alexa (Amazon Echo Show) and my use-case is asking for a
         | news briefing, the weather, playing music, or setting timers.
         | 
         | Alexa is a dumpster fire and constantly getting dumber. She
         | also completely disrespects your settings and will re-enable
         | settings you have disabled. She constantly ignores my questions
         | to ask me if I want to try some other new feature instead. She
         | randomly decides to add news stations I have explicitly
         | _removed_ from my Flash Briefing list.
         | 
         | I am constantly baffled by how bad it is.
        
       | awenix wrote:
       | Nice to see an open source implementation, i have been seeing
       | many startups get into this space like https://www.retellai.com/,
       | https://fixie.ai/ etc. They always end up needing speech-to-
       | speech models (current approach seems speech-text-text-speech
       | with multiple agents handling 1 listening + 1 speaking), excited
       | to see how this plays with recently announced gpt-4o
        
         | kwindla wrote:
         | Adding to your list: https://vapi.ai -- really nice tools.
         | 
         | (I try to keep up with all the different layers/players in this
         | space.)
        
         | zkoch wrote:
         | We're (fixie.ai) working on on our SLM (speech language model).
         | We'll release something soon to play with :)
        
       | xan_ps007 wrote:
       | We're also building bolna an open source voice orchestration:
       | https://github.com/bolna-ai/bolna
        
       | russ wrote:
       | LiveKit Agents, which OpenAI uses in voice mode is also open
       | source:
       | 
       | https://github.com/livekit/agents
        
       ___________________________________________________________________
       (page generated 2024-05-13 23:00 UTC)