[HN Gopher] Show HN: Local voice assistant using Ollama, transfo...
___________________________________________________________________
Show HN: Local voice assistant using Ollama, transformers and Coqui
TTS toolkit
Author : mezba
Score : 141 points
Date : 2024-06-20 22:48 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| aftbit wrote:
| Looks interesting! Is the latency low enough for it to feel
| natural? How's the Coqui speech quality?
| lelag wrote:
| It supports XTTSv2 which is currently the open-weight state of
| the art. So, pretty damn good (https://huggingface.co/coqui/XTT
| S-v2/blob/main/samples/en_sa...).
|
| Too bad that the project is in limbo after Coqui (the company)
| folded. The license limits the use of the weights to non-
| commercial usage unless you buy a commercial license, and
| there's nobody left to sell you one now.
| qup wrote:
| is there anyone to sue you? (how does that work?)
| nmstoker wrote:
| I don't know the details in this case, but it seems
| plausible that someone still owns the IP and thus might be
| in a position to initiate legal proceedings.
| qup wrote:
| That same person could update the license, no?
| NikkiA wrote:
| Honestly, I don't think that sounds as human as piper does,
| but that's probably a function of the voice model files more
| than anything, fe, en_US 'amy' sounds artificial, but
| hfc_female sounds more realistic on the Piper samples.
|
| https://rhasspy.github.io/piper-samples/
| jsemrau wrote:
| When I gave "Matt", my loyal local assistant[1], a voice xTTSv2
| performed better for long form text. While in longform emotions
| seemed well balanced in the text, in short replies the emotion
| patterns frequently felt off and therefore unnatural. What I
| liked about xTTsv2 though is that voice cloning is fairly easy
| by just providing a .wav file with the intended voice pattern.
|
| [1]https://open.substack.com/pub/jdsemrau/p/teaching-your-
| agent...
| pclmulqdq wrote:
| xTTS is notoriously bad at generating short samples. It will
| also hallucinate if you give it something short enough.
| sleight42 wrote:
| Ok, I need this but cloning Majel Barrett as the voice of the
| Enterprise computer.
| gavmor wrote:
| Trivially done with a minute-long wav file. Simply specify the
| source sample in your june-va config.json
| xan_ps007 wrote:
| we have made an open source orchestration which enables you to
| plug in your own TTS/ASR/LLM for end-to-end voice conversations
| at -> https://github.com/bolna-ai/bolna.
|
| We are also working on a complete open source stack for
| ASR+TTS+LLM and will be releasing it shortly.
| bmicraft wrote:
| Have you thought about support for the wyoming protocol? That
| would make it pretty much plug&play with home assistant.
| nmstoker wrote:
| Hadn't heard of the Wyoming Protocol before, but it's
| interesting, thanks for mentioning
|
| For others who also hadn't heard of it, here's an overview: h
| ttps://github.com/rhasspy/rhasspy3/blob/master/docs/wyoming..
| .
| underlines wrote:
| Honestly, there are so many Project on Github doing STT - LLM -
| TTS that I lost count. The only revolutionary thing that feels
| like magic is if the STT supports Voice Activity Detection and
| low latency LLM inference on Groq, so conversations feel natural.
| xan_ps007 wrote:
| What we have learnt is that big enterprises do not really want
| to use close source models due to the random bursts in usage
| which might drain their bills.
| replete wrote:
| I tried a similar project out last week, which uses Ollama,
| FastWhisperAPI, and MeloTTS:
| https://github.com/PromtEngineer/Verbi
|
| Docker is a great option if you want lots of people to try out
| your project, but not many apps in this space come with a
| dockerfile
| replete wrote:
| How does the STT compare to Fastwhisper?
| Gryph0n77 wrote:
| How many RAM GB the model requires?
| modeless wrote:
| Coqui's XTTSv2 is good for this because it has a streaming mode.
| I have my own version of this where I got ~500ms end-to-end
| response latency, which is much faster than any other open source
| project I've seen. https://github.com/jdarpinian/chirpy
|
| These are easy to make and fun to play with and it's awesome to
| have everything local. But it will take more to build something
| truly useable. A truly natural conversational AI needs to
| understand the nuances of conversation, most importantly when to
| speak and when to wait. It also needs to know subtleties of the
| user's voice that no speech recognizer can output, and it needs
| control over the output voice more precise than any TTS provides.
| Audio-to-audio models in the style of GPT-4o are clearly the way
| forward. (And someday soon, video-to-video models for video
| calling with a virtual avatar. And the step after that is
| robotics for physical avatars).
|
| There aren't any open source audio-to-audio models yet but there
| are some promising approaches. https://ultravox.ai has the input
| half at least. https://tincans.ai/slm has a cool approach too.
| zkstefan wrote:
| > There aren't any open source audio-to-audio models yet
|
| I think that's not true. See this for example:
| https://huggingface.co/facebook/seamless-m4t-v2-large It's not
| general purpose like GPT4o but translation still seems pretty
| useful
| modeless wrote:
| I don't think SeamlessM4T qualifies as an end-to-end audio-
| to-audio model. The paper states "the task of speech-to-
| speech translation in SeamlessM4T v2 is broken down into
| speech-to-text translation (S2TT) and then text-to-unit
| conversion (T2U)". And while language translation is an
| important application as you mention, it's strictly limited
| to that. It wouldn't understand or produce non-speech audio
| (e.g. singing, music, environmental sounds, etc) and you
| can't have a conversation with it.
| m3kw9 wrote:
| How long till a stand alone OS that makes AI usage its first
| class citizen?
| skenderbeu wrote:
| My very first Multimodal AI star on Github. Hope we see more of
| these in the future.
| wkat4242 wrote:
| I currently use Ollama + Openwebui for this. It also has a really
| serviceable voice mode. And it has many options like RAG
| integrations, custom models, memories to know you better, vision,
| a great web interface etc. But I'll have a look at this thing.
___________________________________________________________________
(page generated 2024-06-21 23:02 UTC)