[HN Gopher] Show HN: Local voice assistant using Ollama, transfo...
       ___________________________________________________________________
        
       Show HN: Local voice assistant using Ollama, transformers and Coqui
       TTS toolkit
        
       Author : mezba
       Score  : 141 points
       Date   : 2024-06-20 22:48 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | aftbit wrote:
       | Looks interesting! Is the latency low enough for it to feel
       | natural? How's the Coqui speech quality?
        
         | lelag wrote:
         | It supports XTTSv2 which is currently the open-weight state of
         | the art. So, pretty damn good (https://huggingface.co/coqui/XTT
         | S-v2/blob/main/samples/en_sa...).
         | 
         | Too bad that the project is in limbo after Coqui (the company)
         | folded. The license limits the use of the weights to non-
         | commercial usage unless you buy a commercial license, and
         | there's nobody left to sell you one now.
        
           | qup wrote:
           | is there anyone to sue you? (how does that work?)
        
             | nmstoker wrote:
             | I don't know the details in this case, but it seems
             | plausible that someone still owns the IP and thus might be
             | in a position to initiate legal proceedings.
        
               | qup wrote:
               | That same person could update the license, no?
        
           | NikkiA wrote:
           | Honestly, I don't think that sounds as human as piper does,
           | but that's probably a function of the voice model files more
           | than anything, fe, en_US 'amy' sounds artificial, but
           | hfc_female sounds more realistic on the Piper samples.
           | 
           | https://rhasspy.github.io/piper-samples/
        
         | jsemrau wrote:
         | When I gave "Matt", my loyal local assistant[1], a voice xTTSv2
         | performed better for long form text. While in longform emotions
         | seemed well balanced in the text, in short replies the emotion
         | patterns frequently felt off and therefore unnatural. What I
         | liked about xTTsv2 though is that voice cloning is fairly easy
         | by just providing a .wav file with the intended voice pattern.
         | 
         | [1]https://open.substack.com/pub/jdsemrau/p/teaching-your-
         | agent...
        
           | pclmulqdq wrote:
           | xTTS is notoriously bad at generating short samples. It will
           | also hallucinate if you give it something short enough.
        
       | sleight42 wrote:
       | Ok, I need this but cloning Majel Barrett as the voice of the
       | Enterprise computer.
        
         | gavmor wrote:
         | Trivially done with a minute-long wav file. Simply specify the
         | source sample in your june-va config.json
        
       | xan_ps007 wrote:
       | we have made an open source orchestration which enables you to
       | plug in your own TTS/ASR/LLM for end-to-end voice conversations
       | at -> https://github.com/bolna-ai/bolna.
       | 
       | We are also working on a complete open source stack for
       | ASR+TTS+LLM and will be releasing it shortly.
        
         | bmicraft wrote:
         | Have you thought about support for the wyoming protocol? That
         | would make it pretty much plug&play with home assistant.
        
           | nmstoker wrote:
           | Hadn't heard of the Wyoming Protocol before, but it's
           | interesting, thanks for mentioning
           | 
           | For others who also hadn't heard of it, here's an overview: h
           | ttps://github.com/rhasspy/rhasspy3/blob/master/docs/wyoming..
           | .
        
       | underlines wrote:
       | Honestly, there are so many Project on Github doing STT - LLM -
       | TTS that I lost count. The only revolutionary thing that feels
       | like magic is if the STT supports Voice Activity Detection and
       | low latency LLM inference on Groq, so conversations feel natural.
        
         | xan_ps007 wrote:
         | What we have learnt is that big enterprises do not really want
         | to use close source models due to the random bursts in usage
         | which might drain their bills.
        
       | replete wrote:
       | I tried a similar project out last week, which uses Ollama,
       | FastWhisperAPI, and MeloTTS:
       | https://github.com/PromtEngineer/Verbi
       | 
       | Docker is a great option if you want lots of people to try out
       | your project, but not many apps in this space come with a
       | dockerfile
        
       | replete wrote:
       | How does the STT compare to Fastwhisper?
        
       | Gryph0n77 wrote:
       | How many RAM GB the model requires?
        
       | modeless wrote:
       | Coqui's XTTSv2 is good for this because it has a streaming mode.
       | I have my own version of this where I got ~500ms end-to-end
       | response latency, which is much faster than any other open source
       | project I've seen. https://github.com/jdarpinian/chirpy
       | 
       | These are easy to make and fun to play with and it's awesome to
       | have everything local. But it will take more to build something
       | truly useable. A truly natural conversational AI needs to
       | understand the nuances of conversation, most importantly when to
       | speak and when to wait. It also needs to know subtleties of the
       | user's voice that no speech recognizer can output, and it needs
       | control over the output voice more precise than any TTS provides.
       | Audio-to-audio models in the style of GPT-4o are clearly the way
       | forward. (And someday soon, video-to-video models for video
       | calling with a virtual avatar. And the step after that is
       | robotics for physical avatars).
       | 
       | There aren't any open source audio-to-audio models yet but there
       | are some promising approaches. https://ultravox.ai has the input
       | half at least. https://tincans.ai/slm has a cool approach too.
        
         | zkstefan wrote:
         | > There aren't any open source audio-to-audio models yet
         | 
         | I think that's not true. See this for example:
         | https://huggingface.co/facebook/seamless-m4t-v2-large It's not
         | general purpose like GPT4o but translation still seems pretty
         | useful
        
           | modeless wrote:
           | I don't think SeamlessM4T qualifies as an end-to-end audio-
           | to-audio model. The paper states "the task of speech-to-
           | speech translation in SeamlessM4T v2 is broken down into
           | speech-to-text translation (S2TT) and then text-to-unit
           | conversion (T2U)". And while language translation is an
           | important application as you mention, it's strictly limited
           | to that. It wouldn't understand or produce non-speech audio
           | (e.g. singing, music, environmental sounds, etc) and you
           | can't have a conversation with it.
        
       | m3kw9 wrote:
       | How long till a stand alone OS that makes AI usage its first
       | class citizen?
        
       | skenderbeu wrote:
       | My very first Multimodal AI star on Github. Hope we see more of
       | these in the future.
        
       | wkat4242 wrote:
       | I currently use Ollama + Openwebui for this. It also has a really
       | serviceable voice mode. And it has many options like RAG
       | integrations, custom models, memories to know you better, vision,
       | a great web interface etc. But I'll have a look at this thing.
        
       ___________________________________________________________________
       (page generated 2024-06-21 23:02 UTC)