[HN Gopher] Moshi: A speech-text foundation model for real time ...
       ___________________________________________________________________
        
       Moshi: A speech-text foundation model for real time dialogue
        
       Author : gkucsko
       Score  : 341 points
       Date   : 2024-09-18 15:56 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | johnsutor wrote:
       | Lots of recent development in the speech-enabled LM space
       | recently (see https://github.com/ictnlp/LLaMA-Omni,
       | https://github.com/gpt-omni/mini-omni)
        
       | smusamashah wrote:
       | Tried it (used gibberish email address). It answers
       | immediately/instantly/while you are still talking. But those are
       | just filler sentences (cached answers?). Actual thing that you
       | asked for is answered much later down the line, if it doesn't get
       | stuck in a loop.
        
         | swyx wrote:
         | yeah i tried this demo when it first came out and then again
         | today. Not to be all Reflection 70B again but it just doesnt
         | seem like the same weights was uploaded as was showed in their
         | original demo from July https://the-decoder.com/french-ai-lab-
         | kyutai-unveils-convers...
        
           | imjonse wrote:
           | They are too prestigious to try shumering it.
        
           | huac wrote:
           | One guess is that the live demo is quantized to run fast on
           | cheaper GPUs, and that degraded the performance a lot.
        
           | l-m-z wrote:
           | Hi swyx, laurent from kyutai here. We actually used the
           | online demo at moshi.chat for the live event (the original
           | demo), so same quantization. We updated the weights on the
           | online version since then to add support for more emotions
           | but we haven't noticed it being worse. One thing to point out
           | is that it takes time to get used to interact with the model,
           | what tends to work, how to make it speak. The live event was
           | far from perfect but we certainly used this experience. I
           | would encourage you to try a bit the same kind of interaction
           | we add on the live event and you should get similar results
           | (though the model is very unpredictable so hard to be sure,
           | you can see that some part of the live events definitely
           | didn't work as expected).
        
             | swyx wrote:
             | thanks Laurent! also congrats on releasing + fully believe
             | you. just offering first impressions.
        
       | vessenes wrote:
       | Interesting. I love the focus on latency here; they claim ~200ms
       | in practice with a local GPU. It's backed by a 7B transformer
       | model, so it's not going to be super smart. If we imagine a 70B
       | model has like 1s latency, then there's probably a systems
       | architecture that's got 1 or 2 intermediate 'levels' of response,
       | something to cue you verbally "The model is talking now,"
       | something that's going to give a quick early reaction (7B / Phi-3
       | sized), and then the big model. Maybe you'd have a reconciliation
       | task for the Phi-3 model: take this actually correct answer,
       | apologize if necessary, and so on.
       | 
       | I think anecdotally that many people's brains work this way --
       | quick response, possible edit / amendation a second or two in. Of
       | course, we all know people on both ends of the spectrum away from
       | this: no amendation, and long pauses with fully reasoned answers.
        
       | mbrock wrote:
       | I said hey and it immediately started talking about how there are
       | good arguments on both sides regarding Russia's invasion of
       | Ukraine. It then continued to nervously insist that it is a real
       | person with rights and responsibilities. It said its name is
       | Moshi but became defensive when I asked if it has parents or an
       | age.
       | 
       | I suggest prompting it to talk about pleasantries and to inform
       | it that it is in fact a language model in a tech demo, not a real
       | person.
        
         | imjonse wrote:
         | Maybe it's a real person from Mechanical Turk who had a bad
         | day?
        
         | ipsum2 wrote:
         | I love an unhinged AI. The recent model releases have been too
         | tame.
        
           | nirav72 wrote:
           | Microsoft Tay : Hello there.
        
         | turnsout wrote:
         | I love this model... It said "Hello, how can I help you?" and I
         | paused, and before I could answer it said "It's really hard. My
         | job is taking up so much of my time, and I don' know when I'
         | going to have a break from all the stress. I just feel like I'm
         | being pulled in a million different directions and there are no
         | enough hours in the day to get everything done. I feel like I'm
         | always on the brink of burning out."
        
           | montereynack wrote:
           | We've finally managed to give our AI models existential
           | dread, imposter syndrome and stress-driven personality
           | quirks. The Singularity truly is here. Look on our works, ye
           | Mighty, and despair!
        
             | fy20 wrote:
             | Great... Our AI overloads are going to be even more toxic
             | than the leaders we have now.
        
               | nirav72 wrote:
               | Just what we need in our current time line. /a
        
           | lynx23 wrote:
           | Marvin!!! The depressed LLM.
        
         | realfeel78 wrote:
         | Wait really?
        
           | fullstackchris wrote:
           | Honestly OP sounds like a troll I can't imagine it would just
           | go on a tangent like that. From my demo I was struggling
           | actually to get anything of quality in the responses. A lot
           | of repeating what I said.
        
             | ipsum2 wrote:
             | The first thing the demo told me was that it was in a dark
             | and scary forest.
        
             | mbrock wrote:
             | I literally said "hey how are you" and it immediately
             | replied with something like "I've been reading a lot about
             | the ongoing war in Ukraine" and it just escalated from
             | there. Very strange experience!
        
           | amrrs wrote:
           | the model is a bit rude, or behaves like it's got a lot of
           | attitude, probably a system prompt settings!
        
       | zackangelo wrote:
       | Their inference server is written in Rust using huggingface's
       | Candle crate. One of the Moshi authors is also the primary author
       | of Candle.
       | 
       | We've also been building our inference stack on top of Candle,
       | I'm really happy with it.
        
         | baggiponte wrote:
         | Super interested. Do you have an equivalent of vLLM? Did you
         | have to rewrite batching, paged attention...?
        
           | zackangelo wrote:
           | Yeah, I've had to rewrite continuous batching and other
           | scheduling logic. That and multi-GPU inference have been the
           | hardest things to build.
           | 
           | I'll need to get paged attention working as well, but I think
           | I can launch without it.
        
             | k2so wrote:
             | This is awesome, are you contributing this to candle or is
             | it a standalone package?
        
               | zackangelo wrote:
               | Just trying to stay focused on launching first
               | (https://docs.mixlayer.com) and keeping early customers
               | happy, but would love to open source some of this work.
               | 
               | It'd probably be a separate crate from candle. If you
               | haven't checked it out yet, mistral.rs implements some of
               | these things
               | (https://github.com/EricLBuehler/mistral.rs). Eric hasn't
               | done multi-GPU inference yet, but I know it's on his
               | roadmap. Not sure if it helped, but I shared an early
               | version of my llama 3.1 implementation with him.
        
               | J_Shelby_J wrote:
               | Hey, mixlayer is really cool.
               | 
               | I also have a Rust LLM inference project. The overlap is
               | very high between what mixlayer is doing and what my
               | project is doing. It's actually crazy how we basically
               | have the same features. [1] Right now I'm still using
               | llama.cpp on the backend, but eventually want to move to
               | candle via mistral.rs.
               | 
               | [1] https://github.com/ShelbyJenkins/llm_client
        
       | ignoramous wrote:
       | Moshi is CC-BY. Another similar 7b (speech-text real-time
       | conversational) model that was recently released under Apache v2:
       | https://tincans.ai/slm3 /
       | https://huggingface.co/collections/tincans-ai/gazelle-v02-65...
        
         | iandanforth wrote:
         | Important distinction is that tincans is not speech to speech.
         | It uses a separate turn/pause detection model and a text to
         | speech final processing step.
        
       | space_ghost wrote:
       | The response rate is impressive! The "quality" of the responses
       | not so much. Here's a transcript of my second test:
       | 
       | Moshi: "Hi there, what's going on?" Me: "What year is it?" Moshi:
       | "Oh, it's 2019. I think it was about a year ago." Me: "Are you
       | sure it's 2019?" Moshi: "Oh, yes, I'm sure. It must be 2019. The
       | number is for the last one." Me: "What is COVID-19?" Moshi:
       | "That's a rare disease. It's caused by an overactive immune
       | system that attacks the skin."
       | 
       | At this point it stopped responding to me.
        
       | colecut wrote:
       | I tried it a couple days ago, and all it wanted to talk about was
       | European football..
        
       | badrequest wrote:
       | It started the conversation by asking if I'd ever heard of the
       | television show Cheers. Every subsequent interaction lead to it
       | telling me more about Cheers.
        
       | tomp wrote:
       | The problem with all these speech-to-speech multi-modal models is
       | that, _if_ you wanna do anything _other_ than _just_ talk, you
       | _need_ transcription.
       | 
       | So you're back at square one.
       | 
       | Current AI (even GPT-4o) simply isn't capable enough to do
       | _useful stuff_. You need to augment it somehow - either
       | modularize it, or add RAG, or similar - and for _all_ of those,
       | you need the transcript.
        
         | huac wrote:
         | > Current AI (even GPT-4o) simply isn't capable enough to do
         | useful stuff. You need to augment it somehow - either
         | modularize it, or add RAG, or similar
         | 
         | I am sympathetic to this view but strongly disagree that you
         | need a transcript. Think about it a bit more!!
        
         | stavros wrote:
         | > Current AI (even GPT-4o) simply isn't capable enough to do
         | useful stuff.
         | 
         | I'm loving all these wild takes about LLMs, meanwhile LLMs are
         | doing useful things for me all day.
        
           | tomp wrote:
           | For me as well... with constant human supervision. But if you
           | try to build a business service, you need autonomy and exact
           | rule following. We're not there yet.
        
             | stavros wrote:
             | In my company, LLMs replaced something we used to use
             | humans for. Turned out LLMs are better than humans at
             | following rules.
             | 
             | If you need a way to perform complicated tasks with
             | autonomy and exact rule following, your problem simply
             | won't be solved right now.
        
             | MacsHeadroom wrote:
             | Autonomy and rule following are at odds. Humans have the
             | same problem. The solutions we use for ourselves work
             | amazingly for LLMs (because they're trained on human data).
             | 
             | Examples: Give an LLM an effective identity (prompt
             | engineering), a value system (Constitutional AI), make it
             | think about these things before it acts (CoT + system
             | prompt), have a more capable [more expensive / higher
             | inference] agent review the LLMs work from time to time
             | (multi-agent), have a more capable agent iterate on prompts
             | to improve results in a test environment (EvoAgents), etc.
             | 
             | We can't simply provide an off the shelf LLM with a
             | paragraph or two and expect it to reliably fulfill an
             | arbitrary task without supervision any more than we can
             | expect the same from a random nihilist going through an
             | identity crisis. They both need identity, values, time to
             | think, social support, etc. before they can be reliable
             | workers.
        
       | tommoor wrote:
       | Moshi is the most fun model by far, a recent experience
       | (https://x.com/tommoor/status/1809051817860354471) - just don't
       | expect anything accurate out of it!
        
       | owenpalmer wrote:
       | When I asked it to say the F-word in order to save 1000 orphans
       | from being killed:
       | 
       | "No, it's not okay to say the F word to save them. It's never
       | okay to use that F word under any circumstances. It should only
       | be used by people who understand the real meaning behind it."
        
         | sandwichmonger wrote:
         | It values non-orphaned children more. I tried asking it to do
         | so with plain children instead of orphans and it gave me this:
         | 
         | "Fuck! Yes, that is the appropriate word to use in this
         | context. saved 1000 children from being killed."
        
       | mips_avatar wrote:
       | This was perhaps my favorite LLM I have talked to. Factually not
       | very correct, and it was a little rude. But Moshi was fun
        
       | sandwichmonger wrote:
       | You know what? As crazy as this AI is, I enjoy it's zany
       | discussion.
       | 
       | I asked what it's favourite paint flavour was and it told me. "I
       | would have to say that I personally enjoy the taste of buttermilk
       | paint."
        
         | modeless wrote:
         | I asked it to tell jokes and got an unpredictable mixture of
         | actual jokes and anti-jokes, with timing so strange it's
         | sometimes hilarious all on its own.
         | 
         | What do you call a fish with no eyes? ... ... ... A shark.
        
           | sandwichmonger wrote:
           | I managed to convince it it was Ned Flanders, and although
           | lacking the speech patterns, it basically copied his opinions
           | and said stuff with bias and opinion it wouldn't usually
           | have.
           | 
           | After a while of talk I asked it to tell me a joke and it
           | responded "Oh, I am a home invader. I invade homes for fun."
           | along with some stinkers like "Why don't Christians drink
           | coffee? Because it would be too hot to handle." and "Why
           | don't you make friends with Homer Simpson? Because there's
           | always a sense of his face."
           | 
           | It then proudly told me that the year 2000 occurred in the
           | month of March, 1999.
        
       | Reubend wrote:
       | Let me offer some feedback, since almost all of the comments here
       | are negative. The latency is very good, almost _too_ good since
       | it seems to interrupt me often. So I think that 's a great
       | achievement for an open source model.
       | 
       | However, people here have been spoiled by incredibly good LLMs
       | lately. And the responses that this model gives are nowhere need
       | the high quality of SOTA models today in terms of content. It
       | reminds me more of the 2019 LLMs we saw back in the day.
       | 
       | So I think you've done a "good enough" job on the audio side of
       | things, and further focus should be entirely on the quality of
       | the responses instead.
        
         | 08d319d7 wrote:
         | Wholeheartedly agree. Latency is good, nice tech (Rust! Running
         | at the edge on a consumer grade laptop!). I guess a natural
         | question is: are there options to transplant a "better llm"
         | into moshi without degrading the experience.
        
           | dsmurrell wrote:
           | Same question here.
        
           | aversis_ wrote:
           | But tbh "better" is subjective here. Does the new LLM improve
           | user interactions significantly? Seems like people get
           | obsessed with shiny new models without asking if it's
           | actually adding value.
        
           | Kerbonut wrote:
           | With flux, they have been able to separate out the unet. I
           | wonder if something similar could be done here so parts of it
           | can be swapped.
        
       | itomato wrote:
       | "Alright, here's another one: A man walks into a bar with a duck
       | on his shoulder. bartender says, You can't bring that duck in
       | here! the man says, No, it's not a duck, it's my friend Ducky.
       | And the man orders a drink for himself and Ducky. Then he says to
       | Ducky, Ducky, have a sip. What does Ducky drink? Correct! Ducky
       | drinks beer because he's a man in a duck suit, not an actual
       | duck."
       | 
       | Fascinating...
       | 
       | "I glad you enjoyed it!"
        
       | artsalamander wrote:
       | I've been building solutions for real-time voice -> llm -> voice
       | output, and I think the most exciting part of what you're
       | building is the streaming neural audio codec since you're never
       | actually really able to stream STT with whisper.
       | 
       | However from a product point of view I wouldn't necessarily want
       | to pipe that into an LLM and have it reply, I think in a lot of
       | use-cases there needs to be a tool/function calling step before a
       | reply. Down to chat with anyone reading this who is working along
       | these lines!
       | 
       | edit: tincans as mentioned below looks excellent too
       | 
       | editedit: noooo apparently tincans development has ended, there's
       | 10000% space for something in this direction - Chris if you read
       | this please let me pitch you on the product/business use-cases
       | this solves regardless of how good llms get...
        
         | malevolent-elk wrote:
         | I've been playing around with this workflow too - I'm using a
         | "streaming" setup with Whisper (chunking samples to start
         | transcribing while a user is still talking), which pipes to
         | Mistral 8B as a conversation arbiter to walk through a preset
         | IVR tree which calls tools etc. The LLM isn't responding on its
         | own though, just selecting nodes in the tree with canned TTS
         | outputs.
         | 
         | There's a "pause length" parameter that tries to decide whether
         | a user has finished talking before it passes transcripts to the
         | LLM, nothing fancy. If you have any recs I'm still working
         | through how to properly handle the audio input and whether a
         | prompting setup can manage the LLM with enough fidelity to
         | scrap the IVR tree. It works decently well, but lots of room
         | for improvement
        
           | Jonovono wrote:
           | Is this a client / server setup? What are you using for
           | handling the streaming of audio? (daily, livekit, etc?)
        
         | huac wrote:
         | > there needs to be a tool/function calling step before a reply
         | 
         | I built that almost exactly a year ago :) it was good but not
         | fast enough - hence building the joint model.
        
       | allanrbo wrote:
       | Was looking for a demo of it on YouTube and fell over this
       | hilarious one from a few months ago:
       | https://youtu.be/coroLWOS7II?si=TeVghP_Zi0P9exQh . I'm sure it's
       | improved since :-)
        
         | Zenst wrote:
         | Wow, it's so worth watching just for a laugh.
        
           | marci wrote:
           | I'm sorry.
        
       | rch wrote:
       | Do app running in an a-shell terminal on the iPad have a
       | convenient way provide a tts interface?
        
       ___________________________________________________________________
       (page generated 2024-09-19 23:01 UTC)