[HN Gopher] Show HN: WhisperFusion - Low-latency conversations w...
___________________________________________________________________
Show HN: WhisperFusion - Low-latency conversations with an AI
chatbot
WhisperFusion builds upon the capabilities of open source tools
WhisperLive and WhisperSpeech to provide a seamless conversations
with an AI chatbot.
Author : mfilion
Score : 228 points
Date : 2024-01-29 14:23 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| asynchronous wrote:
| Very neat capability. Need to see more of hyper-optimizing models
| to one specific use case, this is a great example of doing so.
| quonn wrote:
| It's what Siri and Alexa should have been. I think we will see
| much more of this in the next years. If - and only if - it can
| run locally and not keep a permanent record then the issue of
| listening in the background would go away, too. This is really
| the biggest obstacle to a natural interaction. I want to first
| talk, perhaps to a friend and later ask the bot to chime in. And
| for that to work it really needs to listen for an extended
| period. This could be especially useful for home automation.
| regularfry wrote:
| This is using phi-2, so the first assumption would be that it's
| local. It's a tiny little model in the grand scheme of things.
|
| I've been toying around with something similar myself, only I
| want push-to-talk from my phone. There's a route there with a
| WebRTC SPA, and it feels like it should be doable just by
| stringing together the right bits of various tech demos, but
| just understanding how to string everything together is more
| effort than it should be if you're not familiar with the tech.
|
| What's really annoying is Whisper's latency. It's not really
| designed for this sort of streaming use-case, they're only
| masking its unsuitability here by throwing (comparatively)
| ludicrous compute at it.
| gpderetta wrote:
| There are people trying to frankenstain-merge Mistral and
| Whisper in a single multimodal model [1]. I wonder if this
| could improve the latency.
|
| [1] : https://paul.mou.dev/posts/2023-12-31-listening-with-
| llm/
| huac wrote:
| yes (you skip a decoding step) but also no (when do you
| start emitting?)
| pilotneko wrote:
| This project is using Mistral, not Phi-2. However, it is
| clear from reading the README.MD that this runs locally, so
| your point still stands. That being said, it looks like all
| models have been optimized for TensorRT, so the Whisper
| component may not be as high-latency as you suggest.
| regularfry wrote:
| Ah, so it is. I got confused by the video, where the
| assistant responses are labeled as phi-2.
| renus wrote:
| For the transcription part, we are looking into W2v-BERT 2.0
| as well and will make it available in a live-streaming
| context. That said, Whisper, especially small (<50ms), is not
| as compute-heavy; right now, most of the compute is consumed
| by the LLM.
| regularfry wrote:
| No, it's not that it's compute-heavy, especially, it's that
| the model expects to work on 30-second samples. So if you
| want sub-second latency, you have to do 30 seconds worth of
| processing more than once a second. It just multiplies the
| problem up. If you can't offload it to a gpu it's painfully
| inefficient.
|
| As to why that might matter: my single 4090 is occupied
| with most of a Mixtral instance, and I don't especially
| want to take any compute away from that.
| intalentive wrote:
| For minimum latency you want a recurrent model that works
| in the time domain. A Mamba-like model could do it.
| yieldcrv wrote:
| I like how Chat GPT 4 will stammer, stutter and pause. This would
| be even better with a little "uhm" right when the speaker
| finishes talking, or even a chat bot that interrupts you a little
| bit, predicting when you're finishing - even incorrectly.
|
| like an engaged but not-most-polite person does
| pyryt wrote:
| Knowing when to speak is actually a prediction task in itself.
| See eg https://arxiv.org/abs/2010.10874
|
| Would be indeed great to get something like this integrated
| with whisper, LLM and TTS
| zachthewf wrote:
| Hard for me to imagine that this could be solved in text
| space. I think the prediction task needs to be done on the
| audio.
| stiffler01 wrote:
| We thought about doing this in Whisper itself, since its
| already working in the audio space.
| stiffler01 wrote:
| Yes, this is something we want to look into in more detail,
| really appreciate sharing the research.
| cristyansv wrote:
| Imagine porting this to a dedicated app that can access the
| context of the open window and the text on the screen, providing
| an almost real-time assistant for everything you do on screen.
| column wrote:
| Automatically take a screenshot and feed it to
| https://github.com/vikhyat/moondream or similar? Doable. But
| while very impressive, the results are a bit of mixed bag (some
| hallucinations)
| cristyansv wrote:
| I'm sure something like the accessibility API will have a
| smaller latency.
|
| https://developer.apple.com/library/archive/samplecode/UIEle.
| ..
| summarity wrote:
| rewind.ai seems to be moving in this direction
| cristyansv wrote:
| this looks equally scary and incredible, especially the
| "summarize what I worked on today" examples.
| fragmede wrote:
| it works really well, and locally too!
| albertzeyer wrote:
| See also the blog post: https://www.collabora.com/news-and-
| blog/news-and-events/whis...
|
| WhisperFusion, WhisperLive, WhisperSpeech, those are very
| interesting projects.
|
| I'm curious about latency (of all those 3 systems individually,
| and also the LLM), and WER numbers of WhisperLive. I did not
| really find any numbers on that? This is a bit strange, as those
| are the most crucial information about such models? Maybe I just
| looked at the wrong places (the GitHub repos).
| renus wrote:
| WhisperLive builds upon the Whisper model; for the demo, we
| used small.en, but you can also use large without introducing a
| bigger latency for the overall pipeline since the transcription
| process is decoupled from the LLM and text-to-speech process.
| albertzeyer wrote:
| Yes, but when you change Whisper to make it live, to get
| WhisperLive, surely this has an effect on the WER, it will
| get worse. The question is, how much worse? And what is the
| latency? Depending on the type of streaming model, you might
| be able to control the latency, so you get a graph, latency
| vs WER, and in the extreme (offline) case, you have the
| original WER.
|
| How exactly does WhisperLive work actually? Did you reduce
| the chunk size from 30 sec to something lower? To what? Is
| this fixed or can it be configured by the user? Where can I
| find information on those details, or even a broad overview
| on how WhisperLive works?
| renus wrote:
| https://github.com/collabora/WhisperLive
| albertzeyer wrote:
| Yes I have looked there. I did not find any WER numbers
| and latency numbers (ideally both together in a graph). I
| also did not find the model being described.
|
| *Edit*
|
| Ah, when you write faster_whisper, you actually mean
| https://github.com/SYSTRAN/faster-whisper?
|
| And for streaming, you use
| https://github.com/ufal/whisper_streaming? So, the model
| as described in http://www.afnlp.org/conferences/ijcnlp20
| 23/proceedings/main...?
|
| There, for example in Table 1, you have exactly that,
| latency vs WER. But the latency is huge (2.85 sec the
| lowest). Usually, streaming speech recognition systems
| have latency well beyond 1 sec.
|
| But anyway, is this actually what you use in WhisperLive
| / WhisperFusion? I think it would be good to give a bit
| more details on that.
| stiffler01 wrote:
| WhisperLive supports both TensorRT and faster-whisper. We
| didn't reduce the chunk size rather use padding based on
| the chunk size received from the client. Reducing the
| segment size should be a more optimised solution in the
| Live scenario.
|
| For streaming we continuously stream audio bytes of fixed
| size to the server and send the completed segments back
| to the client while incrementing the timestamp_offset.
| albertzeyer wrote:
| Ah, but that sounds like a very inefficient approach,
| which probably still has quite high latency, and probably
| also performs bad in terms of word-error-rate (WER).
|
| But I'm happy to be proven wrong. That's why I would like
| to see some actual numbers. Maybe it's still okish
| enough, maybe it's actually really bad. I'm curious. But
| I don't just want to see a demo or a sloppy statement
| like "it's working ok".
|
| Note that this is a highly non-trivial problem, to make a
| streamable speech recognition system with low latency and
| still good performance. There is a big research community
| working on just this problem.
|
| I actually have worked on this problem myself. E.g. see
| our work "Chunked Attention-based Encoder-Decoder Model
| for Streaming Speech Recognition"
| (https://arxiv.org/abs/2309.08436), which will be
| presented at ICASSP 2024. E.g. for a median latency of
| 1.11s ec, we get a WER of 7.5% on TEDLIUM-v2 dev, which
| is almost as good as the offline model with 7.4% WER.
| This is a very good result (only very minor WER
| degradation). Or with a latency of 0.78 sec, we get 7.7%
| WER. Our model currently does not work too well when we
| go to even lower latencies (or the computational overhead
| becomes impractical).
|
| Or see Emformer (https://arxiv.org/abs/2010.10759) as
| another popular model.
| 082236036778 wrote:
| https://www.facebook.com/ronal.kat?mibextid=VqkefZtyiaKY4
| pB6
| renus wrote:
| We will add the details, thanks for pointing it out.
| pyryt wrote:
| Interesting project, thanks for sharing
| pizzathyme wrote:
| Whenever I walk my dog I find myself wanting a conversationalist
| LLM layer to exist in the best form. LLM's now are great at
| conversation, but the connective tissue between the LLM and
| natural dialog needs a lot of work.
|
| Some of the problems:
|
| - Voice systems now (including ChatGPT mobile app) stop you at
| times when a human would not, based on how long you pause. If you
| said, "I think I'm going to...[3 second pause]" then LLM's stop
| you, but a human would wait
|
| - No ability to interrupt them with voice only
|
| - Natural conversationalists tend to match one another's speed,
| but these system's speed are fixed
|
| - Lots of custom instructions needed to change from what works in
| written text to what works in speech (no bullet points, no long
| formulas)
|
| On the other side of this problem is a super smart friend you can
| call on your phone. That would be world changing.
| renus wrote:
| Good point; another area we are currently looking into is
| predicting intention; often, when talking to someone, we have a
| good idea of what that person might say next. That would not
| only help with latency but also, allow us to give better
| answers, and load the right context.
| hombre_fatal wrote:
| Yeah. While I like the idea of live voice chat with an LLM, it
| turns out I'm not so good at getting a thought across without
| pauses, and that gets interpreted as the LLM's turn to respond.
| I'd need to be able to turn on a magic spoken word like
| "continue" for it to be useful.
|
| I do like the interface though.
| renus wrote:
| pyryt posted https://arxiv.org/abs/2010.10874, which might be
| helpful here, but we probably end off with personalized
| models that learned from conversation styles. A magic
| stop/processing word would be the easiest to add since you
| already have the transcript, but it's taking the natural feel
| of a conversation.
| visarga wrote:
| I think the Whisper models need to predict end-of-turn based on
| content. And if it still gets input after the EOT, it can just
| drop the LLM generation and start over at the next EOT.
| pyryt wrote:
| Has anyone experimented with integrating real-time lipsync into a
| low-latency audio bot? I saw some demos with d-id but their
| pricing was closer to $1/minute which makes it rather prohibitive
| tasty_freeze wrote:
| I'm aching for someone to come up with a low latency round trip
| voice recognition, LLM, speech generation tuned to waste the time
| of phone scammers. There is one famous youtube guy who has tried
| this exact thing, but the one video I saw was very, very
| primitive and unconvincing.
| jsheard wrote:
| OTOH the technology which allows that would just as easily, and
| more likely be used by the scammers themselves to fully
| automate robocalling rather than having to outsource to call
| centres like they currently do. Your time wasting robot would
| just be wasting the time of another robot that's simultaneously
| on the line with a thousand other people.
| realo wrote:
| correction:
|
| "... simultaneously on the line with a thousand other
| robots."
|
| :)
| jsheard wrote:
| If it were that easy to detect a scam call and redirect it
| to a robot then we could just block the scam calls in the
| first place.
| lxe wrote:
| Oh this is neat! I was wondering how to get whisper to stream-
| transcribe well. I have a similar project using whisper +
| styletts with the similar goal to gave minimal delay:
| https://github.com/lxe/llm-companion
| dmw_ng wrote:
| There must have been 100 folk with the same idea at the same
| time, I'm very excited for having something like this running
| mics in my home so long as it's running locally (and not
| costing $30/mo. in electricity to operate). Lots of starter
| projects, feels like a polished solution (e.g. easy
| maintainability and good home assistant integration etc) is
| right around the corner now
|
| Have been tempted to try and build something out myself, there
| are tons of IP cameras around with 2-way audio. If the mic was
| reasonable enough quality, the potential for a multimodal LLM
| to comment contextually on the scene as well as respond through
| the speaker in a ceiling-mounted camera appeals to me a lot.
| "Computer, WTF is this old stray component I found lying under
| the sink?"
| fragmede wrote:
| What is SOTA for model-available vision systems? If there's a
| camera, can it track objects so it can tell me where I put my
| keys in the room without having to put an $30 airtag on them?
| dmw_ng wrote:
| I think good in-home vision models are probably still a
| little bit away yet, but it seems already the case you
| could start to plan for their existence. It would also be
| possible to fine-tune a puny model to trigger a function to
| pass the image to a larger hosted model if explicitly
| requested to, there are a variety of ways things could be
| tiered to keep processing that can be done practically at
| home at home, and still make it possible to automatically
| (or on user's request) defer the query to a larger model
| operated by someone else
| wruza wrote:
| Could someone please summarize the differences (or similarities)
| of the LLM part against TGWUI+llama.cpp setup with offloading
| layers to tensor cores?
|
| Asking because 8x7B Q4_K_M (25GB, GGUF) doesn't seem to be
| "ultra-low latency" on my 12GB VRAM + RAM. Like, at all. I can
| imagine running 7-13GB sized model with that latency (cause I
| did, but... it's a small model), or using 2x P40 or something.
| Not sure what the assumptions they make in the README. Am I
| missing something? Can you try it without TTS part?
| freeqaz wrote:
| The video example is using Phi-2 which is a 2.7bn param
| network. I think that's part of how they're achieving the low
| latency here!
|
| Has anybody fine-tuned Phi-2? I haven't found any good
| resources for that yet.
| renus wrote:
| We tested https://huggingface.co/cognitivecomputations/dolphi
| n-2_6-phi... as well, in some tasks it performs better. That
| said, you can use Mistral as well, we support a few models
| through TensorRT-LLM.
| monkeydust wrote:
| This post reminded me of Vocode:
| https://github.com/vocodedev/vocode-python
|
| Discussion on them here from 10 months ago:
| https://news.ycombinator.com/item?id=35358873
|
| I tried the demo back then and was very impressed. Anyone using
| it in dev or production?
| domrdy wrote:
| I think they did a pivot to LLM phone calls? I've tried their
| library the other day and it works quite well. It even has the
| "interrupt feature" that is being talked about a few threads
| up. Supports a ton of backends for transcribe/voice/LLM.
| monkeydust wrote:
| Yea the I interrupt worked well, would guess (?) this could
| be deployed for local conversation without need for phone.
| ramon156 wrote:
| Great to hear its seamless real-time ultra low-latency. Hopefully
| the next iteration is blazingly fast too!
| localhost wrote:
| There are two things that I think are needed and that I'm not
| sure if anyone provides yet to make this scenario work well:
|
| 1. Interruption - I need to be able to say "hang on" and have the
| LLM pause. 2. Wait for a specific cue before responding. I like
| "What do you think?"
|
| That + low latency are crucial. It needs to feel like talking to
| another person.
| bilsbie wrote:
| It would be cool if the Ai could interrupt too.
| andai wrote:
| "Imma let you finish, but..."
| plufz wrote:
| I agree, it is unnatural and a little stressful with current
| implementations. It feels like I first need to figure out what
| to say and than say it so I don't pause and mess up my input.
|
| I hope the new improved Siri and Google assistant will be able
| to chain actions as well. "Ok Google, turn off the lights. Ok
| Google, stop music." Feels a bit cumbersome.
| renus wrote:
| A fast turnaround time is also super important; if the
| transcription is not correct, waiting multiple seconds for
| each turn would kill the application. E.g., ordering food
| using voice is only convenient if it gets me right all the
| time; if not, I will fall back to the app.
| stiffler01 wrote:
| Indeed a great point. Waiting for a specific cue, before
| responding, is an interesting idea. It would make the
| interaction more natural, especially in situations where the
| user is thinking aloud or formulating their thoughts before
| seeking the AI's input.
|
| Interruption is something that is already in the pipeline and
| we are working on it. You should see an update soon.
| localhost wrote:
| Thanks! Really looking forward to interruptions.
|
| I think about the cue as kind of being like "Hey
| Siri/Alexa/Cortana" but in reverse.
| Valgrim wrote:
| In order to feel like a human, cues should not be a pre-
| programmed phrase, the system should continuously listen to the
| conversation, and evaluate constantly if speaking is pertinent
| at that particular moment. Humans will cut a conversation if
| it's important, and such a system should be able to do the
| same.
| localhost wrote:
| Totally agree with your take. But a pre-programmed phrase
| would work today and hopefully wouldn't be too difficult to
| implement. I would imagine that higher latency would be more
| tolerable as well. But in the fullness of time, your approach
| is better.
|
| When I'm listening to someone else talk, I'm already
| formulating responses or at least an outline of responses in
| my head. If the LLM could do a progressive summarization of
| the conversation in real-time as part of its context this
| would be super cool as well. It could also interrupt you if
| the LLM self-reflects on the summary and realizes that now
| would be a good time to interrupt.
| philsnow wrote:
| > 2. Wait for a specific cue before responding. I like "What do
| you think?"
|
| "Over."
| dr_kiszonka wrote:
| "Over and out" closes the app. ;)
|
| Saying "Go" to indicate it's the bot's turn would work for
| me. (Or maybe pressing a button.) The bot should always stop
| wherever I start speaking.
| pksebben wrote:
| I wrote a sort of toy version of this a little while ago using
| Vosk and a variety of TTS engines, and the solution that worked
| _mostly-well_ was to have a buffer that waited for audio that
| filled until a pause of so many seconds, then it sent that to
| the LLM.
|
| With the implementation of tools for GPT, I could see a way to
| having the model check if it thinks it received a complete
| thought, and if it didn't, send back a signal to keep appending
| to the buffer until the next long pause. The addition of a
| longer "pregnant pause" timeout could have the model check in
| to see if you're done talking or whatever.
| renus wrote:
| To streamline the experience we don't send the transcription
| to the LLM after the pause, since we are using the time we
| wait for the end of sentence trigger (pause) to generate the
| LLM and text-to-speech output. So ideally once we detected
| the pause, we already processed everything.
| zan2434 wrote:
| I agree. Have been working on a 2 way interruptions system +
| streaming like this. It's not robust yet, but when it works it
| does feel magical.
| lambdaba wrote:
| > Interruption
|
| Well, today is your lucky day!: https://persona-webapp-
| beta.vercel.app/ and the demo https://smarterchild.chat/
| dmw_ng wrote:
| The latency on this (or lack thereof) is the best I've seen,
| would love to know more about how it's achieved. I asked the
| bot and it claimed you're using Google's speech recognition,
| which I know supports streaming, but this result seems much
| lower lag than I remember Google's stuff being capable of
| napier wrote:
| this crops up in my feed every now and then and it has vastly
| superior perf vs. OAI's ChatGPT iOS app or anything else I've
| found. truly outstanding. are you planning on developing it
| further and/or monetizing it?
| lambdaba wrote:
| This isn't mine, it's from sindarin.tech, they already have
| paid versions, with one plan being $450/50 hours of speech
| (just checked and it's up from 30 hours).
| spywaregorilla wrote:
| The latency on smarterchild is very fast, but it doesn't seem
| to be interruptible. The UI seems to be restricting me from
| even inputting input in between my input and the ai response?
| irthomasthomas wrote:
| I did a video demo of this. Tell it to only respond only with
| OK to every message and only respond fully when I tell you I am
| finished. Ok? Ok.
| bilsbie wrote:
| My dream is to do pair coding (and pair learning) with an AI.
|
| It would be a live conversation and it can see whatever I'm doing
| on my screen.
|
| We're gradually getting closer.
| andai wrote:
| I wanted to say that's Copilot, but you meant speaking instead
| of typing?
| bilsbie wrote:
| I envision it feeling like you're pair programming with a
| person. (I have problems staying motivated.) But that might
| be a good place to start.
| pksebben wrote:
| I have a similar dream; with one major caveat - it must
| unequivocally be a local model. "See whatever I'm doing on my
| screen" comes with "leaks information to the model" and that
| could go real-bad-wrong real fast.
| bilsbie wrote:
| I have a feeling people will really demand local only for AI.
|
| I'm not sure why the demand never materialized for other
| highly personal services like search, photos, medical, etc.
|
| But I just have this hunch we all really want it for AI.
| blooalien wrote:
| I have a feeling that a small subset of privacy-conscious
| "computer savvy" folks will care about local-only for AI,
| but that the vast majority of humanity simply won't know,
| care, or care to know why they should even care. For proof,
| just look at how _nobody_ cared about search, photos,
| medical, or other data until _theirs_ got leaked, and
| _still_ nobody cares about _them_ because "it's not _my_
| data that got leaked ".
|
| We (we in the larger sense of computer users as a whole,
| not just the small subset of "power-users") _should_ care
| more about privacy and security and such, but most people
| think of computers and networks in the same way they think
| of a toaster or a hammer. To them it 's a tool that does
| stuff when they push the right "magic button", and they
| couldn't care less what's inside, or how it could harm them
| if mis-used until it actually _does_ harm them (or come
| close enough to it that they can no longer ignore it).
| lambdaba wrote:
| More than that, it can monitor your screen continuously and
| have perfect recall, so it will be able to correct you
| immediately, or remind you of relevant context.
|
| I like to call it "Artificial Attention".
| doctorpangloss wrote:
| This is an excellent project with excellent packaging. It is
| primarily a packaging problem.
|
| Why does every Python application on GitHub have its own ad-hoc,
| informally specified, bug ridden, slow implementation of half of
| setuptools?
|
| Why does TensorRT distribute the most essential part of what it
| does in an "examples" directory?
|
| huggingface_cli... man, I already have a way to download
| something by a name, it's a zip file. In fact, why not make a
| PyPi index that facades these models? We have so many ways
| already to install and cache read only binary blobs...
| traverseda wrote:
| Well the huggingface one is obvious enough, they want to
| encourage vendor lock-in, make themselves the default. Same
| reason why docker downloads from dockerhub unless you
| explicitly request a full url.
| abdullahkhalids wrote:
| We see these tools such as this posted several times a week. Is
| there any expectation they will be installable by the common
| person? Where is the setup.exe, .deb, .rpm, .dmg?
| renus wrote:
| We are going to put the sample interface into the Docker, so
| it's more mainly:
|
| > docker run --gpus all --shm-size 64G -p 80:80 -it
| ghcr.io/collabora/whisperfusion:latest
|
| instead of:
|
| > docker run --gpus all --shm-size 64G -p 6006:6006 -p
| 8888:8888 -it ghcr.io/collabora/whisperfusion:latest > cd
| examples/chatbot/html > python -m http.server
| codethief wrote:
| Seeing that this uses TensorRT (i.e. seems well optimized), what
| GPUs are supported? Could I run this on a Jetson?
| WhackyIdeas wrote:
| I had a quick read through, maybe I missed something but is this
| all local run or does it need api access to OpenAI's remote
| system?
|
| The reason I ask is that I'm building something that does both
| TTS and STT using OpenAI, but I do not want to be sending a never
| ending stream of audio to OpenAI just for it to listen for a
| single command I will eventually give it.
|
| If I can do all of this local and use Mistral instead, then I'd
| give it a go too.
| renus wrote:
| Everything runs locally, we use:
|
| - WhisperLive for the transcription -
| https://github.com/collabora/WhisperLive - WhisperSpeech for
| the text-to-speech - https://github.com/collabora/WhisperSpeech
|
| and an LLM (phi-2, Mistral, etc.) in between
| WhackyIdeas wrote:
| Thank you! When I read OpenAI I was thinking would be going
| through them. This revelation is perfect timing for me...
| keeping user data even more private. Excellent!
___________________________________________________________________
(page generated 2024-01-29 23:01 UTC)