[HN Gopher] Moshi: A speech-text foundation model for real time ...
       ___________________________________________________________________
        
       Moshi: A speech-text foundation model for real time dialogue
        
       Author : gkucsko
       Score  : 161 points
       Date   : 2024-09-18 15:56 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | johnsutor wrote:
       | Lots of recent development in the speech-enabled LM space
       | recently (see https://github.com/ictnlp/LLaMA-Omni,
       | https://github.com/gpt-omni/mini-omni)
        
       | smusamashah wrote:
       | Tried it (used gibberish email address). It answers
       | immediately/instantly/while you are still talking. But those are
       | just filler sentences (cached answers?). Actual thing that you
       | asked for is answered much later down the line, if it doesn't get
       | stuck in a loop.
        
         | swyx wrote:
         | yeah i tried this demo when it first came out and then again
         | today. Not to be all Reflection 70B again but it just doesnt
         | seem like the same weights was uploaded as was showed in their
         | original demo from July https://the-decoder.com/french-ai-lab-
         | kyutai-unveils-convers...
        
           | imjonse wrote:
           | They are too prestigious to try shumering it.
        
           | huac wrote:
           | One guess is that the live demo is quantized to run fast on
           | cheaper GPUs, and that degraded the performance a lot.
        
           | l-m-z wrote:
           | Hi swyx, laurent from kyutai here. We actually used the
           | online demo at moshi.chat for the live event (the original
           | demo), so same quantization. We updated the weights on the
           | online version since then to add support for more emotions
           | but we haven't noticed it being worse. One thing to point out
           | is that it takes time to get used to interact with the model,
           | what tends to work, how to make it speak. The live event was
           | far from perfect but we certainly used this experience. I
           | would encourage you to try a bit the same kind of interaction
           | we add on the live event and you should get similar results
           | (though the model is very unpredictable so hard to be sure,
           | you can see that some part of the live events definitely
           | didn't work as expected).
        
       | vessenes wrote:
       | Interesting. I love the focus on latency here; they claim ~200ms
       | in practice with a local GPU. It's backed by a 7B transformer
       | model, so it's not going to be super smart. If we imagine a 70B
       | model has like 1s latency, then there's probably a systems
       | architecture that's got 1 or 2 intermediate 'levels' of response,
       | something to cue you verbally "The model is talking now,"
       | something that's going to give a quick early reaction (7B / Phi-3
       | sized), and then the big model. Maybe you'd have a reconciliation
       | task for the Phi-3 model: take this actually correct answer,
       | apologize if necessary, and so on.
       | 
       | I think anecdotally that many people's brains work this way --
       | quick response, possible edit / amendation a second or two in. Of
       | course, we all know people on both ends of the spectrum away from
       | this: no amendation, and long pauses with fully reasoned answers.
        
       | mbrock wrote:
       | I said hey and it immediately started talking about how there are
       | good arguments on both sides regarding Russia's invasion of
       | Ukraine. It then continued to nervously insist that it is a real
       | person with rights and responsibilities. It said its name is
       | Moshi but became defensive when I asked if it has parents or an
       | age.
       | 
       | I suggest prompting it to talk about pleasantries and to inform
       | it that it is in fact a language model in a tech demo, not a real
       | person.
        
         | imjonse wrote:
         | Maybe it's a real person from Mechanical Turk who had a bad
         | day?
        
         | ipsum2 wrote:
         | I love an unhinged AI. The recent model releases have been too
         | tame.
        
         | turnsout wrote:
         | I love this model... It said "Hello, how can I help you?" and I
         | paused, and before I could answer it said "It's really hard. My
         | job is taking up so much of my time, and I don' know when I'
         | going to have a break from all the stress. I just feel like I'm
         | being pulled in a million different directions and there are no
         | enough hours in the day to get everything done. I feel like I'm
         | always on the brink of burning out."
        
           | montereynack wrote:
           | We've finally managed to give our AI models existential
           | dread, imposter syndrome and stress-driven personality
           | quirks. The Singularity truly is here. Look on our works, ye
           | Mighty, and despair!
        
         | realfeel78 wrote:
         | Wait really?
        
           | fullstackchris wrote:
           | Honestly OP sounds like a troll I can't imagine it would just
           | go on a tangent like that. From my demo I was struggling
           | actually to get anything of quality in the responses. A lot
           | of repeating what I said.
        
             | ipsum2 wrote:
             | The first thing the demo told me was that it was in a dark
             | and scary forest.
        
           | amrrs wrote:
           | the model is a bit rude, or behaves like it's got a lot of
           | attitude, probably a system prompt settings!
        
       | zackangelo wrote:
       | Their inference server is written in Rust using huggingface's
       | Candle crate. One of the Moshi authors is also the primary author
       | of Candle.
       | 
       | We've also been building our inference stack on top of Candle,
       | I'm really happy with it.
        
         | baggiponte wrote:
         | Super interested. Do you have an equivalent of vLLM? Did you
         | have to rewrite batching, paged attention...?
        
           | zackangelo wrote:
           | Yeah, I've had to rewrite continuous batching and other
           | scheduling logic. That and multi-GPU inference have been the
           | hardest things to build.
           | 
           | I'll need to get paged attention working as well, but I think
           | I can launch without it.
        
       | ignoramous wrote:
       | Moshi is CC-BY. Another similar 7b (speech-text real-time
       | conversational) model that was recently released under Apache v2:
       | https://tincans.ai/slm3 /
       | https://huggingface.co/collections/tincans-ai/gazelle-v02-65...
        
         | iandanforth wrote:
         | Important distinction is that tincans is not speech to speech.
         | It uses a separate turn/pause detection model and a text to
         | speech final processing step.
        
       | space_ghost wrote:
       | The response rate is impressive! The "quality" of the responses
       | not so much. Here's a transcript of my second test:
       | 
       | Moshi: "Hi there, what's going on?" Me: "What year is it?" Moshi:
       | "Oh, it's 2019. I think it was about a year ago." Me: "Are you
       | sure it's 2019?" Moshi: "Oh, yes, I'm sure. It must be 2019. The
       | number is for the last one." Me: "What is COVID-19?" Moshi:
       | "That's a rare disease. It's caused by an overactive immune
       | system that attacks the skin."
       | 
       | At this point it stopped responding to me.
        
       | colecut wrote:
       | I tried it a couple days ago, and all it wanted to talk about was
       | European football..
        
       | badrequest wrote:
       | It started the conversation by asking if I'd ever heard of the
       | television show Cheers. Every subsequent interaction lead to it
       | telling me more about Cheers.
        
       | tomp wrote:
       | The problem with all these speech-to-speech multi-modal models is
       | that, _if_ you wanna do anything _other_ than _just_ talk, you
       | _need_ transcription.
       | 
       | So you're back at square one.
       | 
       | Current AI (even GPT-4o) simply isn't capable enough to do
       | _useful stuff_. You need to augment it somehow - either
       | modularize it, or add RAG, or similar - and for _all_ of those,
       | you need the transcript.
        
       ___________________________________________________________________
       (page generated 2024-09-18 23:00 UTC)