[HN Gopher] Show HN: WhisperFusion - Low-latency conversations w...
       ___________________________________________________________________
        
       Show HN: WhisperFusion - Low-latency conversations with an AI
       chatbot
        
       WhisperFusion builds upon the capabilities of open source tools
       WhisperLive and WhisperSpeech to provide a seamless conversations
       with an AI chatbot.
        
       Author : mfilion
       Score  : 228 points
       Date   : 2024-01-29 14:23 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | asynchronous wrote:
       | Very neat capability. Need to see more of hyper-optimizing models
       | to one specific use case, this is a great example of doing so.
        
       | quonn wrote:
       | It's what Siri and Alexa should have been. I think we will see
       | much more of this in the next years. If - and only if - it can
       | run locally and not keep a permanent record then the issue of
       | listening in the background would go away, too. This is really
       | the biggest obstacle to a natural interaction. I want to first
       | talk, perhaps to a friend and later ask the bot to chime in. And
       | for that to work it really needs to listen for an extended
       | period. This could be especially useful for home automation.
        
         | regularfry wrote:
         | This is using phi-2, so the first assumption would be that it's
         | local. It's a tiny little model in the grand scheme of things.
         | 
         | I've been toying around with something similar myself, only I
         | want push-to-talk from my phone. There's a route there with a
         | WebRTC SPA, and it feels like it should be doable just by
         | stringing together the right bits of various tech demos, but
         | just understanding how to string everything together is more
         | effort than it should be if you're not familiar with the tech.
         | 
         | What's really annoying is Whisper's latency. It's not really
         | designed for this sort of streaming use-case, they're only
         | masking its unsuitability here by throwing (comparatively)
         | ludicrous compute at it.
        
           | gpderetta wrote:
           | There are people trying to frankenstain-merge Mistral and
           | Whisper in a single multimodal model [1]. I wonder if this
           | could improve the latency.
           | 
           | [1] : https://paul.mou.dev/posts/2023-12-31-listening-with-
           | llm/
        
             | huac wrote:
             | yes (you skip a decoding step) but also no (when do you
             | start emitting?)
        
           | pilotneko wrote:
           | This project is using Mistral, not Phi-2. However, it is
           | clear from reading the README.MD that this runs locally, so
           | your point still stands. That being said, it looks like all
           | models have been optimized for TensorRT, so the Whisper
           | component may not be as high-latency as you suggest.
        
             | regularfry wrote:
             | Ah, so it is. I got confused by the video, where the
             | assistant responses are labeled as phi-2.
        
           | renus wrote:
           | For the transcription part, we are looking into W2v-BERT 2.0
           | as well and will make it available in a live-streaming
           | context. That said, Whisper, especially small (<50ms), is not
           | as compute-heavy; right now, most of the compute is consumed
           | by the LLM.
        
             | regularfry wrote:
             | No, it's not that it's compute-heavy, especially, it's that
             | the model expects to work on 30-second samples. So if you
             | want sub-second latency, you have to do 30 seconds worth of
             | processing more than once a second. It just multiplies the
             | problem up. If you can't offload it to a gpu it's painfully
             | inefficient.
             | 
             | As to why that might matter: my single 4090 is occupied
             | with most of a Mixtral instance, and I don't especially
             | want to take any compute away from that.
        
               | intalentive wrote:
               | For minimum latency you want a recurrent model that works
               | in the time domain. A Mamba-like model could do it.
        
       | yieldcrv wrote:
       | I like how Chat GPT 4 will stammer, stutter and pause. This would
       | be even better with a little "uhm" right when the speaker
       | finishes talking, or even a chat bot that interrupts you a little
       | bit, predicting when you're finishing - even incorrectly.
       | 
       | like an engaged but not-most-polite person does
        
         | pyryt wrote:
         | Knowing when to speak is actually a prediction task in itself.
         | See eg https://arxiv.org/abs/2010.10874
         | 
         | Would be indeed great to get something like this integrated
         | with whisper, LLM and TTS
        
           | zachthewf wrote:
           | Hard for me to imagine that this could be solved in text
           | space. I think the prediction task needs to be done on the
           | audio.
        
             | stiffler01 wrote:
             | We thought about doing this in Whisper itself, since its
             | already working in the audio space.
        
           | stiffler01 wrote:
           | Yes, this is something we want to look into in more detail,
           | really appreciate sharing the research.
        
       | cristyansv wrote:
       | Imagine porting this to a dedicated app that can access the
       | context of the open window and the text on the screen, providing
       | an almost real-time assistant for everything you do on screen.
        
         | column wrote:
         | Automatically take a screenshot and feed it to
         | https://github.com/vikhyat/moondream or similar? Doable. But
         | while very impressive, the results are a bit of mixed bag (some
         | hallucinations)
        
           | cristyansv wrote:
           | I'm sure something like the accessibility API will have a
           | smaller latency.
           | 
           | https://developer.apple.com/library/archive/samplecode/UIEle.
           | ..
        
           | summarity wrote:
           | rewind.ai seems to be moving in this direction
        
             | cristyansv wrote:
             | this looks equally scary and incredible, especially the
             | "summarize what I worked on today" examples.
        
               | fragmede wrote:
               | it works really well, and locally too!
        
       | albertzeyer wrote:
       | See also the blog post: https://www.collabora.com/news-and-
       | blog/news-and-events/whis...
       | 
       | WhisperFusion, WhisperLive, WhisperSpeech, those are very
       | interesting projects.
       | 
       | I'm curious about latency (of all those 3 systems individually,
       | and also the LLM), and WER numbers of WhisperLive. I did not
       | really find any numbers on that? This is a bit strange, as those
       | are the most crucial information about such models? Maybe I just
       | looked at the wrong places (the GitHub repos).
        
         | renus wrote:
         | WhisperLive builds upon the Whisper model; for the demo, we
         | used small.en, but you can also use large without introducing a
         | bigger latency for the overall pipeline since the transcription
         | process is decoupled from the LLM and text-to-speech process.
        
           | albertzeyer wrote:
           | Yes, but when you change Whisper to make it live, to get
           | WhisperLive, surely this has an effect on the WER, it will
           | get worse. The question is, how much worse? And what is the
           | latency? Depending on the type of streaming model, you might
           | be able to control the latency, so you get a graph, latency
           | vs WER, and in the extreme (offline) case, you have the
           | original WER.
           | 
           | How exactly does WhisperLive work actually? Did you reduce
           | the chunk size from 30 sec to something lower? To what? Is
           | this fixed or can it be configured by the user? Where can I
           | find information on those details, or even a broad overview
           | on how WhisperLive works?
        
             | renus wrote:
             | https://github.com/collabora/WhisperLive
        
               | albertzeyer wrote:
               | Yes I have looked there. I did not find any WER numbers
               | and latency numbers (ideally both together in a graph). I
               | also did not find the model being described.
               | 
               | *Edit*
               | 
               | Ah, when you write faster_whisper, you actually mean
               | https://github.com/SYSTRAN/faster-whisper?
               | 
               | And for streaming, you use
               | https://github.com/ufal/whisper_streaming? So, the model
               | as described in http://www.afnlp.org/conferences/ijcnlp20
               | 23/proceedings/main...?
               | 
               | There, for example in Table 1, you have exactly that,
               | latency vs WER. But the latency is huge (2.85 sec the
               | lowest). Usually, streaming speech recognition systems
               | have latency well beyond 1 sec.
               | 
               | But anyway, is this actually what you use in WhisperLive
               | / WhisperFusion? I think it would be good to give a bit
               | more details on that.
        
               | stiffler01 wrote:
               | WhisperLive supports both TensorRT and faster-whisper. We
               | didn't reduce the chunk size rather use padding based on
               | the chunk size received from the client. Reducing the
               | segment size should be a more optimised solution in the
               | Live scenario.
               | 
               | For streaming we continuously stream audio bytes of fixed
               | size to the server and send the completed segments back
               | to the client while incrementing the timestamp_offset.
        
               | albertzeyer wrote:
               | Ah, but that sounds like a very inefficient approach,
               | which probably still has quite high latency, and probably
               | also performs bad in terms of word-error-rate (WER).
               | 
               | But I'm happy to be proven wrong. That's why I would like
               | to see some actual numbers. Maybe it's still okish
               | enough, maybe it's actually really bad. I'm curious. But
               | I don't just want to see a demo or a sloppy statement
               | like "it's working ok".
               | 
               | Note that this is a highly non-trivial problem, to make a
               | streamable speech recognition system with low latency and
               | still good performance. There is a big research community
               | working on just this problem.
               | 
               | I actually have worked on this problem myself. E.g. see
               | our work "Chunked Attention-based Encoder-Decoder Model
               | for Streaming Speech Recognition"
               | (https://arxiv.org/abs/2309.08436), which will be
               | presented at ICASSP 2024. E.g. for a median latency of
               | 1.11s ec, we get a WER of 7.5% on TEDLIUM-v2 dev, which
               | is almost as good as the offline model with 7.4% WER.
               | This is a very good result (only very minor WER
               | degradation). Or with a latency of 0.78 sec, we get 7.7%
               | WER. Our model currently does not work too well when we
               | go to even lower latencies (or the computational overhead
               | becomes impractical).
               | 
               | Or see Emformer (https://arxiv.org/abs/2010.10759) as
               | another popular model.
        
               | 082236036778 wrote:
               | https://www.facebook.com/ronal.kat?mibextid=VqkefZtyiaKY4
               | pB6
        
               | renus wrote:
               | We will add the details, thanks for pointing it out.
        
         | pyryt wrote:
         | Interesting project, thanks for sharing
        
       | pizzathyme wrote:
       | Whenever I walk my dog I find myself wanting a conversationalist
       | LLM layer to exist in the best form. LLM's now are great at
       | conversation, but the connective tissue between the LLM and
       | natural dialog needs a lot of work.
       | 
       | Some of the problems:
       | 
       | - Voice systems now (including ChatGPT mobile app) stop you at
       | times when a human would not, based on how long you pause. If you
       | said, "I think I'm going to...[3 second pause]" then LLM's stop
       | you, but a human would wait
       | 
       | - No ability to interrupt them with voice only
       | 
       | - Natural conversationalists tend to match one another's speed,
       | but these system's speed are fixed
       | 
       | - Lots of custom instructions needed to change from what works in
       | written text to what works in speech (no bullet points, no long
       | formulas)
       | 
       | On the other side of this problem is a super smart friend you can
       | call on your phone. That would be world changing.
        
         | renus wrote:
         | Good point; another area we are currently looking into is
         | predicting intention; often, when talking to someone, we have a
         | good idea of what that person might say next. That would not
         | only help with latency but also, allow us to give better
         | answers, and load the right context.
        
         | hombre_fatal wrote:
         | Yeah. While I like the idea of live voice chat with an LLM, it
         | turns out I'm not so good at getting a thought across without
         | pauses, and that gets interpreted as the LLM's turn to respond.
         | I'd need to be able to turn on a magic spoken word like
         | "continue" for it to be useful.
         | 
         | I do like the interface though.
        
           | renus wrote:
           | pyryt posted https://arxiv.org/abs/2010.10874, which might be
           | helpful here, but we probably end off with personalized
           | models that learned from conversation styles. A magic
           | stop/processing word would be the easiest to add since you
           | already have the transcript, but it's taking the natural feel
           | of a conversation.
        
         | visarga wrote:
         | I think the Whisper models need to predict end-of-turn based on
         | content. And if it still gets input after the EOT, it can just
         | drop the LLM generation and start over at the next EOT.
        
       | pyryt wrote:
       | Has anyone experimented with integrating real-time lipsync into a
       | low-latency audio bot? I saw some demos with d-id but their
       | pricing was closer to $1/minute which makes it rather prohibitive
        
       | tasty_freeze wrote:
       | I'm aching for someone to come up with a low latency round trip
       | voice recognition, LLM, speech generation tuned to waste the time
       | of phone scammers. There is one famous youtube guy who has tried
       | this exact thing, but the one video I saw was very, very
       | primitive and unconvincing.
        
         | jsheard wrote:
         | OTOH the technology which allows that would just as easily, and
         | more likely be used by the scammers themselves to fully
         | automate robocalling rather than having to outsource to call
         | centres like they currently do. Your time wasting robot would
         | just be wasting the time of another robot that's simultaneously
         | on the line with a thousand other people.
        
           | realo wrote:
           | correction:
           | 
           | "... simultaneously on the line with a thousand other
           | robots."
           | 
           | :)
        
             | jsheard wrote:
             | If it were that easy to detect a scam call and redirect it
             | to a robot then we could just block the scam calls in the
             | first place.
        
       | lxe wrote:
       | Oh this is neat! I was wondering how to get whisper to stream-
       | transcribe well. I have a similar project using whisper +
       | styletts with the similar goal to gave minimal delay:
       | https://github.com/lxe/llm-companion
        
         | dmw_ng wrote:
         | There must have been 100 folk with the same idea at the same
         | time, I'm very excited for having something like this running
         | mics in my home so long as it's running locally (and not
         | costing $30/mo. in electricity to operate). Lots of starter
         | projects, feels like a polished solution (e.g. easy
         | maintainability and good home assistant integration etc) is
         | right around the corner now
         | 
         | Have been tempted to try and build something out myself, there
         | are tons of IP cameras around with 2-way audio. If the mic was
         | reasonable enough quality, the potential for a multimodal LLM
         | to comment contextually on the scene as well as respond through
         | the speaker in a ceiling-mounted camera appeals to me a lot.
         | "Computer, WTF is this old stray component I found lying under
         | the sink?"
        
           | fragmede wrote:
           | What is SOTA for model-available vision systems? If there's a
           | camera, can it track objects so it can tell me where I put my
           | keys in the room without having to put an $30 airtag on them?
        
             | dmw_ng wrote:
             | I think good in-home vision models are probably still a
             | little bit away yet, but it seems already the case you
             | could start to plan for their existence. It would also be
             | possible to fine-tune a puny model to trigger a function to
             | pass the image to a larger hosted model if explicitly
             | requested to, there are a variety of ways things could be
             | tiered to keep processing that can be done practically at
             | home at home, and still make it possible to automatically
             | (or on user's request) defer the query to a larger model
             | operated by someone else
        
       | wruza wrote:
       | Could someone please summarize the differences (or similarities)
       | of the LLM part against TGWUI+llama.cpp setup with offloading
       | layers to tensor cores?
       | 
       | Asking because 8x7B Q4_K_M (25GB, GGUF) doesn't seem to be
       | "ultra-low latency" on my 12GB VRAM + RAM. Like, at all. I can
       | imagine running 7-13GB sized model with that latency (cause I
       | did, but... it's a small model), or using 2x P40 or something.
       | Not sure what the assumptions they make in the README. Am I
       | missing something? Can you try it without TTS part?
        
         | freeqaz wrote:
         | The video example is using Phi-2 which is a 2.7bn param
         | network. I think that's part of how they're achieving the low
         | latency here!
         | 
         | Has anybody fine-tuned Phi-2? I haven't found any good
         | resources for that yet.
        
           | renus wrote:
           | We tested https://huggingface.co/cognitivecomputations/dolphi
           | n-2_6-phi... as well, in some tasks it performs better. That
           | said, you can use Mistral as well, we support a few models
           | through TensorRT-LLM.
        
       | monkeydust wrote:
       | This post reminded me of Vocode:
       | https://github.com/vocodedev/vocode-python
       | 
       | Discussion on them here from 10 months ago:
       | https://news.ycombinator.com/item?id=35358873
       | 
       | I tried the demo back then and was very impressed. Anyone using
       | it in dev or production?
        
         | domrdy wrote:
         | I think they did a pivot to LLM phone calls? I've tried their
         | library the other day and it works quite well. It even has the
         | "interrupt feature" that is being talked about a few threads
         | up. Supports a ton of backends for transcribe/voice/LLM.
        
           | monkeydust wrote:
           | Yea the I interrupt worked well, would guess (?) this could
           | be deployed for local conversation without need for phone.
        
       | ramon156 wrote:
       | Great to hear its seamless real-time ultra low-latency. Hopefully
       | the next iteration is blazingly fast too!
        
       | localhost wrote:
       | There are two things that I think are needed and that I'm not
       | sure if anyone provides yet to make this scenario work well:
       | 
       | 1. Interruption - I need to be able to say "hang on" and have the
       | LLM pause. 2. Wait for a specific cue before responding. I like
       | "What do you think?"
       | 
       | That + low latency are crucial. It needs to feel like talking to
       | another person.
        
         | bilsbie wrote:
         | It would be cool if the Ai could interrupt too.
        
           | andai wrote:
           | "Imma let you finish, but..."
        
         | plufz wrote:
         | I agree, it is unnatural and a little stressful with current
         | implementations. It feels like I first need to figure out what
         | to say and than say it so I don't pause and mess up my input.
         | 
         | I hope the new improved Siri and Google assistant will be able
         | to chain actions as well. "Ok Google, turn off the lights. Ok
         | Google, stop music." Feels a bit cumbersome.
        
           | renus wrote:
           | A fast turnaround time is also super important; if the
           | transcription is not correct, waiting multiple seconds for
           | each turn would kill the application. E.g., ordering food
           | using voice is only convenient if it gets me right all the
           | time; if not, I will fall back to the app.
        
         | stiffler01 wrote:
         | Indeed a great point. Waiting for a specific cue, before
         | responding, is an interesting idea. It would make the
         | interaction more natural, especially in situations where the
         | user is thinking aloud or formulating their thoughts before
         | seeking the AI's input.
         | 
         | Interruption is something that is already in the pipeline and
         | we are working on it. You should see an update soon.
        
           | localhost wrote:
           | Thanks! Really looking forward to interruptions.
           | 
           | I think about the cue as kind of being like "Hey
           | Siri/Alexa/Cortana" but in reverse.
        
         | Valgrim wrote:
         | In order to feel like a human, cues should not be a pre-
         | programmed phrase, the system should continuously listen to the
         | conversation, and evaluate constantly if speaking is pertinent
         | at that particular moment. Humans will cut a conversation if
         | it's important, and such a system should be able to do the
         | same.
        
           | localhost wrote:
           | Totally agree with your take. But a pre-programmed phrase
           | would work today and hopefully wouldn't be too difficult to
           | implement. I would imagine that higher latency would be more
           | tolerable as well. But in the fullness of time, your approach
           | is better.
           | 
           | When I'm listening to someone else talk, I'm already
           | formulating responses or at least an outline of responses in
           | my head. If the LLM could do a progressive summarization of
           | the conversation in real-time as part of its context this
           | would be super cool as well. It could also interrupt you if
           | the LLM self-reflects on the summary and realizes that now
           | would be a good time to interrupt.
        
         | philsnow wrote:
         | > 2. Wait for a specific cue before responding. I like "What do
         | you think?"
         | 
         | "Over."
        
           | dr_kiszonka wrote:
           | "Over and out" closes the app. ;)
           | 
           | Saying "Go" to indicate it's the bot's turn would work for
           | me. (Or maybe pressing a button.) The bot should always stop
           | wherever I start speaking.
        
         | pksebben wrote:
         | I wrote a sort of toy version of this a little while ago using
         | Vosk and a variety of TTS engines, and the solution that worked
         | _mostly-well_ was to have a buffer that waited for audio that
         | filled until a pause of so many seconds, then it sent that to
         | the LLM.
         | 
         | With the implementation of tools for GPT, I could see a way to
         | having the model check if it thinks it received a complete
         | thought, and if it didn't, send back a signal to keep appending
         | to the buffer until the next long pause. The addition of a
         | longer "pregnant pause" timeout could have the model check in
         | to see if you're done talking or whatever.
        
           | renus wrote:
           | To streamline the experience we don't send the transcription
           | to the LLM after the pause, since we are using the time we
           | wait for the end of sentence trigger (pause) to generate the
           | LLM and text-to-speech output. So ideally once we detected
           | the pause, we already processed everything.
        
         | zan2434 wrote:
         | I agree. Have been working on a 2 way interruptions system +
         | streaming like this. It's not robust yet, but when it works it
         | does feel magical.
        
         | lambdaba wrote:
         | > Interruption
         | 
         | Well, today is your lucky day!: https://persona-webapp-
         | beta.vercel.app/ and the demo https://smarterchild.chat/
        
           | dmw_ng wrote:
           | The latency on this (or lack thereof) is the best I've seen,
           | would love to know more about how it's achieved. I asked the
           | bot and it claimed you're using Google's speech recognition,
           | which I know supports streaming, but this result seems much
           | lower lag than I remember Google's stuff being capable of
        
           | napier wrote:
           | this crops up in my feed every now and then and it has vastly
           | superior perf vs. OAI's ChatGPT iOS app or anything else I've
           | found. truly outstanding. are you planning on developing it
           | further and/or monetizing it?
        
             | lambdaba wrote:
             | This isn't mine, it's from sindarin.tech, they already have
             | paid versions, with one plan being $450/50 hours of speech
             | (just checked and it's up from 30 hours).
        
           | spywaregorilla wrote:
           | The latency on smarterchild is very fast, but it doesn't seem
           | to be interruptible. The UI seems to be restricting me from
           | even inputting input in between my input and the ai response?
        
         | irthomasthomas wrote:
         | I did a video demo of this. Tell it to only respond only with
         | OK to every message and only respond fully when I tell you I am
         | finished. Ok? Ok.
        
       | bilsbie wrote:
       | My dream is to do pair coding (and pair learning) with an AI.
       | 
       | It would be a live conversation and it can see whatever I'm doing
       | on my screen.
       | 
       | We're gradually getting closer.
        
         | andai wrote:
         | I wanted to say that's Copilot, but you meant speaking instead
         | of typing?
        
           | bilsbie wrote:
           | I envision it feeling like you're pair programming with a
           | person. (I have problems staying motivated.) But that might
           | be a good place to start.
        
         | pksebben wrote:
         | I have a similar dream; with one major caveat - it must
         | unequivocally be a local model. "See whatever I'm doing on my
         | screen" comes with "leaks information to the model" and that
         | could go real-bad-wrong real fast.
        
           | bilsbie wrote:
           | I have a feeling people will really demand local only for AI.
           | 
           | I'm not sure why the demand never materialized for other
           | highly personal services like search, photos, medical, etc.
           | 
           | But I just have this hunch we all really want it for AI.
        
             | blooalien wrote:
             | I have a feeling that a small subset of privacy-conscious
             | "computer savvy" folks will care about local-only for AI,
             | but that the vast majority of humanity simply won't know,
             | care, or care to know why they should even care. For proof,
             | just look at how _nobody_ cared about search, photos,
             | medical, or other data until _theirs_ got leaked, and
             | _still_ nobody cares about _them_ because  "it's not _my_
             | data that got leaked ".
             | 
             | We (we in the larger sense of computer users as a whole,
             | not just the small subset of "power-users") _should_ care
             | more about privacy and security and such, but most people
             | think of computers and networks in the same way they think
             | of a toaster or a hammer. To them it 's a tool that does
             | stuff when they push the right "magic button", and they
             | couldn't care less what's inside, or how it could harm them
             | if mis-used until it actually _does_ harm them (or come
             | close enough to it that they can no longer ignore it).
        
         | lambdaba wrote:
         | More than that, it can monitor your screen continuously and
         | have perfect recall, so it will be able to correct you
         | immediately, or remind you of relevant context.
         | 
         | I like to call it "Artificial Attention".
        
       | doctorpangloss wrote:
       | This is an excellent project with excellent packaging. It is
       | primarily a packaging problem.
       | 
       | Why does every Python application on GitHub have its own ad-hoc,
       | informally specified, bug ridden, slow implementation of half of
       | setuptools?
       | 
       | Why does TensorRT distribute the most essential part of what it
       | does in an "examples" directory?
       | 
       | huggingface_cli... man, I already have a way to download
       | something by a name, it's a zip file. In fact, why not make a
       | PyPi index that facades these models? We have so many ways
       | already to install and cache read only binary blobs...
        
         | traverseda wrote:
         | Well the huggingface one is obvious enough, they want to
         | encourage vendor lock-in, make themselves the default. Same
         | reason why docker downloads from dockerhub unless you
         | explicitly request a full url.
        
       | abdullahkhalids wrote:
       | We see these tools such as this posted several times a week. Is
       | there any expectation they will be installable by the common
       | person? Where is the setup.exe, .deb, .rpm, .dmg?
        
         | renus wrote:
         | We are going to put the sample interface into the Docker, so
         | it's more mainly:
         | 
         | > docker run --gpus all --shm-size 64G -p 80:80 -it
         | ghcr.io/collabora/whisperfusion:latest
         | 
         | instead of:
         | 
         | > docker run --gpus all --shm-size 64G -p 6006:6006 -p
         | 8888:8888 -it ghcr.io/collabora/whisperfusion:latest > cd
         | examples/chatbot/html > python -m http.server
        
       | codethief wrote:
       | Seeing that this uses TensorRT (i.e. seems well optimized), what
       | GPUs are supported? Could I run this on a Jetson?
        
       | WhackyIdeas wrote:
       | I had a quick read through, maybe I missed something but is this
       | all local run or does it need api access to OpenAI's remote
       | system?
       | 
       | The reason I ask is that I'm building something that does both
       | TTS and STT using OpenAI, but I do not want to be sending a never
       | ending stream of audio to OpenAI just for it to listen for a
       | single command I will eventually give it.
       | 
       | If I can do all of this local and use Mistral instead, then I'd
       | give it a go too.
        
         | renus wrote:
         | Everything runs locally, we use:
         | 
         | - WhisperLive for the transcription -
         | https://github.com/collabora/WhisperLive - WhisperSpeech for
         | the text-to-speech - https://github.com/collabora/WhisperSpeech
         | 
         | and an LLM (phi-2, Mistral, etc.) in between
        
           | WhackyIdeas wrote:
           | Thank you! When I read OpenAI I was thinking would be going
           | through them. This revelation is perfect timing for me...
           | keeping user data even more private. Excellent!
        
       ___________________________________________________________________
       (page generated 2024-01-29 23:01 UTC)