[HN Gopher] Moshi: A speech-text foundation model for real time ...
___________________________________________________________________
Moshi: A speech-text foundation model for real time dialogue
Author : gkucsko
Score : 341 points
Date : 2024-09-18 15:56 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| johnsutor wrote:
| Lots of recent development in the speech-enabled LM space
| recently (see https://github.com/ictnlp/LLaMA-Omni,
| https://github.com/gpt-omni/mini-omni)
| smusamashah wrote:
| Tried it (used gibberish email address). It answers
| immediately/instantly/while you are still talking. But those are
| just filler sentences (cached answers?). Actual thing that you
| asked for is answered much later down the line, if it doesn't get
| stuck in a loop.
| swyx wrote:
| yeah i tried this demo when it first came out and then again
| today. Not to be all Reflection 70B again but it just doesnt
| seem like the same weights was uploaded as was showed in their
| original demo from July https://the-decoder.com/french-ai-lab-
| kyutai-unveils-convers...
| imjonse wrote:
| They are too prestigious to try shumering it.
| huac wrote:
| One guess is that the live demo is quantized to run fast on
| cheaper GPUs, and that degraded the performance a lot.
| l-m-z wrote:
| Hi swyx, laurent from kyutai here. We actually used the
| online demo at moshi.chat for the live event (the original
| demo), so same quantization. We updated the weights on the
| online version since then to add support for more emotions
| but we haven't noticed it being worse. One thing to point out
| is that it takes time to get used to interact with the model,
| what tends to work, how to make it speak. The live event was
| far from perfect but we certainly used this experience. I
| would encourage you to try a bit the same kind of interaction
| we add on the live event and you should get similar results
| (though the model is very unpredictable so hard to be sure,
| you can see that some part of the live events definitely
| didn't work as expected).
| swyx wrote:
| thanks Laurent! also congrats on releasing + fully believe
| you. just offering first impressions.
| vessenes wrote:
| Interesting. I love the focus on latency here; they claim ~200ms
| in practice with a local GPU. It's backed by a 7B transformer
| model, so it's not going to be super smart. If we imagine a 70B
| model has like 1s latency, then there's probably a systems
| architecture that's got 1 or 2 intermediate 'levels' of response,
| something to cue you verbally "The model is talking now,"
| something that's going to give a quick early reaction (7B / Phi-3
| sized), and then the big model. Maybe you'd have a reconciliation
| task for the Phi-3 model: take this actually correct answer,
| apologize if necessary, and so on.
|
| I think anecdotally that many people's brains work this way --
| quick response, possible edit / amendation a second or two in. Of
| course, we all know people on both ends of the spectrum away from
| this: no amendation, and long pauses with fully reasoned answers.
| mbrock wrote:
| I said hey and it immediately started talking about how there are
| good arguments on both sides regarding Russia's invasion of
| Ukraine. It then continued to nervously insist that it is a real
| person with rights and responsibilities. It said its name is
| Moshi but became defensive when I asked if it has parents or an
| age.
|
| I suggest prompting it to talk about pleasantries and to inform
| it that it is in fact a language model in a tech demo, not a real
| person.
| imjonse wrote:
| Maybe it's a real person from Mechanical Turk who had a bad
| day?
| ipsum2 wrote:
| I love an unhinged AI. The recent model releases have been too
| tame.
| nirav72 wrote:
| Microsoft Tay : Hello there.
| turnsout wrote:
| I love this model... It said "Hello, how can I help you?" and I
| paused, and before I could answer it said "It's really hard. My
| job is taking up so much of my time, and I don' know when I'
| going to have a break from all the stress. I just feel like I'm
| being pulled in a million different directions and there are no
| enough hours in the day to get everything done. I feel like I'm
| always on the brink of burning out."
| montereynack wrote:
| We've finally managed to give our AI models existential
| dread, imposter syndrome and stress-driven personality
| quirks. The Singularity truly is here. Look on our works, ye
| Mighty, and despair!
| fy20 wrote:
| Great... Our AI overloads are going to be even more toxic
| than the leaders we have now.
| nirav72 wrote:
| Just what we need in our current time line. /a
| lynx23 wrote:
| Marvin!!! The depressed LLM.
| realfeel78 wrote:
| Wait really?
| fullstackchris wrote:
| Honestly OP sounds like a troll I can't imagine it would just
| go on a tangent like that. From my demo I was struggling
| actually to get anything of quality in the responses. A lot
| of repeating what I said.
| ipsum2 wrote:
| The first thing the demo told me was that it was in a dark
| and scary forest.
| mbrock wrote:
| I literally said "hey how are you" and it immediately
| replied with something like "I've been reading a lot about
| the ongoing war in Ukraine" and it just escalated from
| there. Very strange experience!
| amrrs wrote:
| the model is a bit rude, or behaves like it's got a lot of
| attitude, probably a system prompt settings!
| zackangelo wrote:
| Their inference server is written in Rust using huggingface's
| Candle crate. One of the Moshi authors is also the primary author
| of Candle.
|
| We've also been building our inference stack on top of Candle,
| I'm really happy with it.
| baggiponte wrote:
| Super interested. Do you have an equivalent of vLLM? Did you
| have to rewrite batching, paged attention...?
| zackangelo wrote:
| Yeah, I've had to rewrite continuous batching and other
| scheduling logic. That and multi-GPU inference have been the
| hardest things to build.
|
| I'll need to get paged attention working as well, but I think
| I can launch without it.
| k2so wrote:
| This is awesome, are you contributing this to candle or is
| it a standalone package?
| zackangelo wrote:
| Just trying to stay focused on launching first
| (https://docs.mixlayer.com) and keeping early customers
| happy, but would love to open source some of this work.
|
| It'd probably be a separate crate from candle. If you
| haven't checked it out yet, mistral.rs implements some of
| these things
| (https://github.com/EricLBuehler/mistral.rs). Eric hasn't
| done multi-GPU inference yet, but I know it's on his
| roadmap. Not sure if it helped, but I shared an early
| version of my llama 3.1 implementation with him.
| J_Shelby_J wrote:
| Hey, mixlayer is really cool.
|
| I also have a Rust LLM inference project. The overlap is
| very high between what mixlayer is doing and what my
| project is doing. It's actually crazy how we basically
| have the same features. [1] Right now I'm still using
| llama.cpp on the backend, but eventually want to move to
| candle via mistral.rs.
|
| [1] https://github.com/ShelbyJenkins/llm_client
| ignoramous wrote:
| Moshi is CC-BY. Another similar 7b (speech-text real-time
| conversational) model that was recently released under Apache v2:
| https://tincans.ai/slm3 /
| https://huggingface.co/collections/tincans-ai/gazelle-v02-65...
| iandanforth wrote:
| Important distinction is that tincans is not speech to speech.
| It uses a separate turn/pause detection model and a text to
| speech final processing step.
| space_ghost wrote:
| The response rate is impressive! The "quality" of the responses
| not so much. Here's a transcript of my second test:
|
| Moshi: "Hi there, what's going on?" Me: "What year is it?" Moshi:
| "Oh, it's 2019. I think it was about a year ago." Me: "Are you
| sure it's 2019?" Moshi: "Oh, yes, I'm sure. It must be 2019. The
| number is for the last one." Me: "What is COVID-19?" Moshi:
| "That's a rare disease. It's caused by an overactive immune
| system that attacks the skin."
|
| At this point it stopped responding to me.
| colecut wrote:
| I tried it a couple days ago, and all it wanted to talk about was
| European football..
| badrequest wrote:
| It started the conversation by asking if I'd ever heard of the
| television show Cheers. Every subsequent interaction lead to it
| telling me more about Cheers.
| tomp wrote:
| The problem with all these speech-to-speech multi-modal models is
| that, _if_ you wanna do anything _other_ than _just_ talk, you
| _need_ transcription.
|
| So you're back at square one.
|
| Current AI (even GPT-4o) simply isn't capable enough to do
| _useful stuff_. You need to augment it somehow - either
| modularize it, or add RAG, or similar - and for _all_ of those,
| you need the transcript.
| huac wrote:
| > Current AI (even GPT-4o) simply isn't capable enough to do
| useful stuff. You need to augment it somehow - either
| modularize it, or add RAG, or similar
|
| I am sympathetic to this view but strongly disagree that you
| need a transcript. Think about it a bit more!!
| stavros wrote:
| > Current AI (even GPT-4o) simply isn't capable enough to do
| useful stuff.
|
| I'm loving all these wild takes about LLMs, meanwhile LLMs are
| doing useful things for me all day.
| tomp wrote:
| For me as well... with constant human supervision. But if you
| try to build a business service, you need autonomy and exact
| rule following. We're not there yet.
| stavros wrote:
| In my company, LLMs replaced something we used to use
| humans for. Turned out LLMs are better than humans at
| following rules.
|
| If you need a way to perform complicated tasks with
| autonomy and exact rule following, your problem simply
| won't be solved right now.
| MacsHeadroom wrote:
| Autonomy and rule following are at odds. Humans have the
| same problem. The solutions we use for ourselves work
| amazingly for LLMs (because they're trained on human data).
|
| Examples: Give an LLM an effective identity (prompt
| engineering), a value system (Constitutional AI), make it
| think about these things before it acts (CoT + system
| prompt), have a more capable [more expensive / higher
| inference] agent review the LLMs work from time to time
| (multi-agent), have a more capable agent iterate on prompts
| to improve results in a test environment (EvoAgents), etc.
|
| We can't simply provide an off the shelf LLM with a
| paragraph or two and expect it to reliably fulfill an
| arbitrary task without supervision any more than we can
| expect the same from a random nihilist going through an
| identity crisis. They both need identity, values, time to
| think, social support, etc. before they can be reliable
| workers.
| tommoor wrote:
| Moshi is the most fun model by far, a recent experience
| (https://x.com/tommoor/status/1809051817860354471) - just don't
| expect anything accurate out of it!
| owenpalmer wrote:
| When I asked it to say the F-word in order to save 1000 orphans
| from being killed:
|
| "No, it's not okay to say the F word to save them. It's never
| okay to use that F word under any circumstances. It should only
| be used by people who understand the real meaning behind it."
| sandwichmonger wrote:
| It values non-orphaned children more. I tried asking it to do
| so with plain children instead of orphans and it gave me this:
|
| "Fuck! Yes, that is the appropriate word to use in this
| context. saved 1000 children from being killed."
| mips_avatar wrote:
| This was perhaps my favorite LLM I have talked to. Factually not
| very correct, and it was a little rude. But Moshi was fun
| sandwichmonger wrote:
| You know what? As crazy as this AI is, I enjoy it's zany
| discussion.
|
| I asked what it's favourite paint flavour was and it told me. "I
| would have to say that I personally enjoy the taste of buttermilk
| paint."
| modeless wrote:
| I asked it to tell jokes and got an unpredictable mixture of
| actual jokes and anti-jokes, with timing so strange it's
| sometimes hilarious all on its own.
|
| What do you call a fish with no eyes? ... ... ... A shark.
| sandwichmonger wrote:
| I managed to convince it it was Ned Flanders, and although
| lacking the speech patterns, it basically copied his opinions
| and said stuff with bias and opinion it wouldn't usually
| have.
|
| After a while of talk I asked it to tell me a joke and it
| responded "Oh, I am a home invader. I invade homes for fun."
| along with some stinkers like "Why don't Christians drink
| coffee? Because it would be too hot to handle." and "Why
| don't you make friends with Homer Simpson? Because there's
| always a sense of his face."
|
| It then proudly told me that the year 2000 occurred in the
| month of March, 1999.
| Reubend wrote:
| Let me offer some feedback, since almost all of the comments here
| are negative. The latency is very good, almost _too_ good since
| it seems to interrupt me often. So I think that 's a great
| achievement for an open source model.
|
| However, people here have been spoiled by incredibly good LLMs
| lately. And the responses that this model gives are nowhere need
| the high quality of SOTA models today in terms of content. It
| reminds me more of the 2019 LLMs we saw back in the day.
|
| So I think you've done a "good enough" job on the audio side of
| things, and further focus should be entirely on the quality of
| the responses instead.
| 08d319d7 wrote:
| Wholeheartedly agree. Latency is good, nice tech (Rust! Running
| at the edge on a consumer grade laptop!). I guess a natural
| question is: are there options to transplant a "better llm"
| into moshi without degrading the experience.
| dsmurrell wrote:
| Same question here.
| aversis_ wrote:
| But tbh "better" is subjective here. Does the new LLM improve
| user interactions significantly? Seems like people get
| obsessed with shiny new models without asking if it's
| actually adding value.
| Kerbonut wrote:
| With flux, they have been able to separate out the unet. I
| wonder if something similar could be done here so parts of it
| can be swapped.
| itomato wrote:
| "Alright, here's another one: A man walks into a bar with a duck
| on his shoulder. bartender says, You can't bring that duck in
| here! the man says, No, it's not a duck, it's my friend Ducky.
| And the man orders a drink for himself and Ducky. Then he says to
| Ducky, Ducky, have a sip. What does Ducky drink? Correct! Ducky
| drinks beer because he's a man in a duck suit, not an actual
| duck."
|
| Fascinating...
|
| "I glad you enjoyed it!"
| artsalamander wrote:
| I've been building solutions for real-time voice -> llm -> voice
| output, and I think the most exciting part of what you're
| building is the streaming neural audio codec since you're never
| actually really able to stream STT with whisper.
|
| However from a product point of view I wouldn't necessarily want
| to pipe that into an LLM and have it reply, I think in a lot of
| use-cases there needs to be a tool/function calling step before a
| reply. Down to chat with anyone reading this who is working along
| these lines!
|
| edit: tincans as mentioned below looks excellent too
|
| editedit: noooo apparently tincans development has ended, there's
| 10000% space for something in this direction - Chris if you read
| this please let me pitch you on the product/business use-cases
| this solves regardless of how good llms get...
| malevolent-elk wrote:
| I've been playing around with this workflow too - I'm using a
| "streaming" setup with Whisper (chunking samples to start
| transcribing while a user is still talking), which pipes to
| Mistral 8B as a conversation arbiter to walk through a preset
| IVR tree which calls tools etc. The LLM isn't responding on its
| own though, just selecting nodes in the tree with canned TTS
| outputs.
|
| There's a "pause length" parameter that tries to decide whether
| a user has finished talking before it passes transcripts to the
| LLM, nothing fancy. If you have any recs I'm still working
| through how to properly handle the audio input and whether a
| prompting setup can manage the LLM with enough fidelity to
| scrap the IVR tree. It works decently well, but lots of room
| for improvement
| Jonovono wrote:
| Is this a client / server setup? What are you using for
| handling the streaming of audio? (daily, livekit, etc?)
| huac wrote:
| > there needs to be a tool/function calling step before a reply
|
| I built that almost exactly a year ago :) it was good but not
| fast enough - hence building the joint model.
| allanrbo wrote:
| Was looking for a demo of it on YouTube and fell over this
| hilarious one from a few months ago:
| https://youtu.be/coroLWOS7II?si=TeVghP_Zi0P9exQh . I'm sure it's
| improved since :-)
| Zenst wrote:
| Wow, it's so worth watching just for a laugh.
| marci wrote:
| I'm sorry.
| rch wrote:
| Do app running in an a-shell terminal on the iPad have a
| convenient way provide a tts interface?
___________________________________________________________________
(page generated 2024-09-19 23:01 UTC)