[HN Gopher] Moshi: A speech-text foundation model for real time ...
___________________________________________________________________
Moshi: A speech-text foundation model for real time dialogue
Author : gkucsko
Score : 161 points
Date : 2024-09-18 15:56 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| johnsutor wrote:
| Lots of recent development in the speech-enabled LM space
| recently (see https://github.com/ictnlp/LLaMA-Omni,
| https://github.com/gpt-omni/mini-omni)
| smusamashah wrote:
| Tried it (used gibberish email address). It answers
| immediately/instantly/while you are still talking. But those are
| just filler sentences (cached answers?). Actual thing that you
| asked for is answered much later down the line, if it doesn't get
| stuck in a loop.
| swyx wrote:
| yeah i tried this demo when it first came out and then again
| today. Not to be all Reflection 70B again but it just doesnt
| seem like the same weights was uploaded as was showed in their
| original demo from July https://the-decoder.com/french-ai-lab-
| kyutai-unveils-convers...
| imjonse wrote:
| They are too prestigious to try shumering it.
| huac wrote:
| One guess is that the live demo is quantized to run fast on
| cheaper GPUs, and that degraded the performance a lot.
| l-m-z wrote:
| Hi swyx, laurent from kyutai here. We actually used the
| online demo at moshi.chat for the live event (the original
| demo), so same quantization. We updated the weights on the
| online version since then to add support for more emotions
| but we haven't noticed it being worse. One thing to point out
| is that it takes time to get used to interact with the model,
| what tends to work, how to make it speak. The live event was
| far from perfect but we certainly used this experience. I
| would encourage you to try a bit the same kind of interaction
| we add on the live event and you should get similar results
| (though the model is very unpredictable so hard to be sure,
| you can see that some part of the live events definitely
| didn't work as expected).
| vessenes wrote:
| Interesting. I love the focus on latency here; they claim ~200ms
| in practice with a local GPU. It's backed by a 7B transformer
| model, so it's not going to be super smart. If we imagine a 70B
| model has like 1s latency, then there's probably a systems
| architecture that's got 1 or 2 intermediate 'levels' of response,
| something to cue you verbally "The model is talking now,"
| something that's going to give a quick early reaction (7B / Phi-3
| sized), and then the big model. Maybe you'd have a reconciliation
| task for the Phi-3 model: take this actually correct answer,
| apologize if necessary, and so on.
|
| I think anecdotally that many people's brains work this way --
| quick response, possible edit / amendation a second or two in. Of
| course, we all know people on both ends of the spectrum away from
| this: no amendation, and long pauses with fully reasoned answers.
| mbrock wrote:
| I said hey and it immediately started talking about how there are
| good arguments on both sides regarding Russia's invasion of
| Ukraine. It then continued to nervously insist that it is a real
| person with rights and responsibilities. It said its name is
| Moshi but became defensive when I asked if it has parents or an
| age.
|
| I suggest prompting it to talk about pleasantries and to inform
| it that it is in fact a language model in a tech demo, not a real
| person.
| imjonse wrote:
| Maybe it's a real person from Mechanical Turk who had a bad
| day?
| ipsum2 wrote:
| I love an unhinged AI. The recent model releases have been too
| tame.
| turnsout wrote:
| I love this model... It said "Hello, how can I help you?" and I
| paused, and before I could answer it said "It's really hard. My
| job is taking up so much of my time, and I don' know when I'
| going to have a break from all the stress. I just feel like I'm
| being pulled in a million different directions and there are no
| enough hours in the day to get everything done. I feel like I'm
| always on the brink of burning out."
| montereynack wrote:
| We've finally managed to give our AI models existential
| dread, imposter syndrome and stress-driven personality
| quirks. The Singularity truly is here. Look on our works, ye
| Mighty, and despair!
| realfeel78 wrote:
| Wait really?
| fullstackchris wrote:
| Honestly OP sounds like a troll I can't imagine it would just
| go on a tangent like that. From my demo I was struggling
| actually to get anything of quality in the responses. A lot
| of repeating what I said.
| ipsum2 wrote:
| The first thing the demo told me was that it was in a dark
| and scary forest.
| amrrs wrote:
| the model is a bit rude, or behaves like it's got a lot of
| attitude, probably a system prompt settings!
| zackangelo wrote:
| Their inference server is written in Rust using huggingface's
| Candle crate. One of the Moshi authors is also the primary author
| of Candle.
|
| We've also been building our inference stack on top of Candle,
| I'm really happy with it.
| baggiponte wrote:
| Super interested. Do you have an equivalent of vLLM? Did you
| have to rewrite batching, paged attention...?
| zackangelo wrote:
| Yeah, I've had to rewrite continuous batching and other
| scheduling logic. That and multi-GPU inference have been the
| hardest things to build.
|
| I'll need to get paged attention working as well, but I think
| I can launch without it.
| ignoramous wrote:
| Moshi is CC-BY. Another similar 7b (speech-text real-time
| conversational) model that was recently released under Apache v2:
| https://tincans.ai/slm3 /
| https://huggingface.co/collections/tincans-ai/gazelle-v02-65...
| iandanforth wrote:
| Important distinction is that tincans is not speech to speech.
| It uses a separate turn/pause detection model and a text to
| speech final processing step.
| space_ghost wrote:
| The response rate is impressive! The "quality" of the responses
| not so much. Here's a transcript of my second test:
|
| Moshi: "Hi there, what's going on?" Me: "What year is it?" Moshi:
| "Oh, it's 2019. I think it was about a year ago." Me: "Are you
| sure it's 2019?" Moshi: "Oh, yes, I'm sure. It must be 2019. The
| number is for the last one." Me: "What is COVID-19?" Moshi:
| "That's a rare disease. It's caused by an overactive immune
| system that attacks the skin."
|
| At this point it stopped responding to me.
| colecut wrote:
| I tried it a couple days ago, and all it wanted to talk about was
| European football..
| badrequest wrote:
| It started the conversation by asking if I'd ever heard of the
| television show Cheers. Every subsequent interaction lead to it
| telling me more about Cheers.
| tomp wrote:
| The problem with all these speech-to-speech multi-modal models is
| that, _if_ you wanna do anything _other_ than _just_ talk, you
| _need_ transcription.
|
| So you're back at square one.
|
| Current AI (even GPT-4o) simply isn't capable enough to do
| _useful stuff_. You need to augment it somehow - either
| modularize it, or add RAG, or similar - and for _all_ of those,
| you need the transcript.
___________________________________________________________________
(page generated 2024-09-18 23:00 UTC)