[HN Gopher] Neural audio codecs: how to get audio into LLMs
___________________________________________________________________
Neural audio codecs: how to get audio into LLMs
Author : karimf
Score : 313 points
Date : 2025-10-21 12:55 UTC (10 hours ago)
(HTM) web link (kyutai.org)
(TXT) w3m dump (kyutai.org)
| mondainx wrote:
| Thanks for sharing this well written post that I will share with
| my team; we just recently started using audio/voice in our AI
| suite and the details herein will be helpful and informative.
| amelius wrote:
| > Many LLMs have voice interfaces, but they usually work by
| transcribing your speech, generating the answer in text, and
| using text-to-speech to read the response out loud. That's
| perfectly fine in many cases (...), but it's a wrapper, not real
| speech understanding.
|
| But I can say the same about tokenization. LLMs first convert
| groups of characters to tokens, then use that to generate tokens,
| and then convert the tokens back to characters. That's not real
| understanding! If LLMs are so smart, we should be able to skip
| the tokenization step.
| Workaccount2 wrote:
| Nothing is real understanding because we have no benchmark for
| understanding because we don't mechanistically know what
| understanding is. The best we have is people "vibe knowing" a
| benchmark that they made up on the spot.
| trollbridge wrote:
| An ongoing question I have is why effort wasn't put into
| tokenising speech (instead of transcribed words) and then making
| an LLM out of that. There are huge amounts of speech available to
| train on.
| MichealCodes wrote:
| I don't think we've had the transformer moment for audio
| training yet, but yes, in theory audio-first models will be
| much more capable.
| trollbridge wrote:
| Particularly interesting would be transformations between
| tokenised audio and tokenised text.
|
| I recall someone telling me once up to 90% of communication
| can be non-verbal, so when an LLM sticks to just text, it's
| only getting 10% of the data.
| benob wrote:
| Audio tokenization consumes at least 4x tokens versus text. So
| there is an efficiency problem to start with. Then is there
| enough audio data to train a LLM from scratch?
| trollbridge wrote:
| Start an MVNO that offers cheaper phone plans and and train
| on all those phone calls.
|
| There are big libraries of old speeches.
|
| Simply capture all all current radio/tv transmissions and
| train on that (we've already established copyright doesn't
| apply to LLM training, right?)
| miki123211 wrote:
| > Start an MVNO that offers cheaper phone plans and and
| train on all those phone calls.
|
| q: What is 2+2?
|
| A: The warranty for your car has expired...
| 542354234235 wrote:
| Don't we have tens of thousands of hours (hundreds of
| thousands?) of closed captioned tv shows and movies? How many
| hours of news broadcasts with transcripts do we have? Maybe I
| just don't understand what is needed, but it seems like we
| have a lot of data to work with.
| roboror wrote:
| Sure but that needs to be licensed
| cruffle_duffle wrote:
| Correct me if I'm wrong but you need more than just closed
| captions. You need precise timing too. I'd think you'd need
| the text to line up exactly with the audio so when the
| voice makes an "A" sound the text it aligns with is "A" as
| well.
|
| So while having the closed captions saves some of the work,
| there is probably much more needed to get everything lined
| up.
|
| But I'm absolutely not an expert at all. In fact this is
| the first I've ever even though about it!
| vvolhejn wrote:
| Author here. Speech-to-text is more or less solved, it's
| easy to automatically get captions including precise
| timestamps. For training Moshi, Kyutai's audio LLM, my
| colleagues used whisper-timestamped to transcribe 7
| million hours of audio.
|
| See Section 4.2 in the Moshi paper:
| https://arxiv.org/pdf/2410.00037
| cruffle_duffle wrote:
| Sweet!
| ca_tech wrote:
| There is data but nowhere near the amount of written language
| that is fairly normalized and without the need to account for
| additional features such as language, dialect, intonation,
| facial expression, hand gestures. Speech to text is used as the
| translation layer as it throws many of those other features
| away and contextualizes it into a set of tokens that are much
| more efficient to map between languages.
| mohsen1 wrote:
| It costs more to train on audio tokens but I'm sure we will get
| there. Training a model on transcript of a lecture on YouTube
| vs. training on audio of it will make a difference.
| nmfisher wrote:
| The article is talking about doing exactly that. The key
| question is how to convert an inherently continuous signal
| (speech/audio) into a discrete set of tokens. A single window
| of audio is usually somewhere between 10ms and 100ms. It's
| difficult to squish all that information down to a single
| "token" that represents the semantic and acoustic content for
| that window.
|
| That's why residual vector quantization is a useful technique -
| using multiple dictionaries to quantize a single timeslice,
| each conditioned on the previous residual level. You can also
| quantize a signal at different frequencies.
|
| There are samples towards the end of the post of their LLM
| trained on their Mimi audio codec.
| CGMthrowaway wrote:
| _> The key question is how to convert an inherently
| continuous signal (speech /audio) into a discrete set of
| tokens. A single window of audio is usually somewhere between
| 10ms and 100ms. It's difficult to squish all that information
| down to a single "token" that represents the semantic and
| acoustic content for that window._
|
| I read the article and confess some of the modeling parts
| were above my comprehension. But I would like to add that as
| an audio engineer, the "key question" you describe is solved,
| just not applied to transformer models (?).
|
| An experienced engineer can look at a waveform in a DAW and
| identify specific consonants, vowels, specific words, etc
| quite fluently. And with tools like Melodyne - which already
| quantize audio semantically - they can identify (and
| manipulate) pitch and formants as well, turning an O vowel
| into an E vowel, or changing the inflection of a phrase (up-
| speak vs down-speak, for example).
|
| I don't know how to apply this to a neural codec, but it
| seems like it shouldn't be that hard (that's my naivete
| coming through)
| PaulDavisThe1st wrote:
| > An experienced engineer can look at a waveform in a DAW
| and identify specific consonants, vowels, specific words,
| etc quite fluently.
|
| As an experienced DAW author, I very, very much doubt this.
|
| What can be done relatively easy is to "see" or rather
| "follow along" in the waveform when listening to the audio.
| But I read your claim as being that someone could look at
| the waveform (which is already decimated from the original)
| and identify words or phonemes without hearing the
| associated audio. I am extremely skeptical that there is
| anyone anywhere in the world who can do this.
| CGMthrowaway wrote:
| I started in music but have since edited thousands of
| hours of podcasts. I cannot transcribe a track by looking
| at the waveform, except the word "um" haha. But without
| playing the audio I can tell you where words start and
| end, whether a peak is a B or a T or an A or an I
| sound... And melodyne can add layers to that and tell me
| the pitch, formants (vowels), quantize the syllables etc.
| If I can do all this, a computer ought to be able to do
| the same and more
| spudlyo wrote:
| Hundreds of hours here, and I can't even always reliably
| spot my own ums. I edit as many out as I possibly can for
| myself, my co-host and guest, as well as eliminating
| continuation signaling phrases like "you know" and
| "like". I also remove uninteresting asides and bits of
| dead air. This is boring and tedious work but it makes
| the end result considerably better I think.
|
| I feel like there should be a model that can do much of
| this for me but I haven't really looked into it,
| ironically due to laziness, but also because I edit
| across multiple tracks at this stage, and I'm afraid to
| feed the model an already mixed stereo track. I'm curious
| why you still do it manually, if you still do and if
| you've looked into alternatives.
| PaulDavisThe1st wrote:
| > I edit as many out as I possibly can for myself, my co-
| host and guest, as well as eliminating continuation
| signaling phrases like "you know" and "like". I also
| remove uninteresting asides and bits of dead air.
|
| Hopefully using Ardour's "Ripple - Interview" mode :))
| vvolhejn wrote:
| I use Descript to edit videos/podcasts and it works great
| for this kind of thing! It transcribes your audio and
| then you can edit it as if you were editing text.
| PaulDavisThe1st wrote:
| Yeah, that stuff is just freaking amazing. I don't know
| what the transcription quality is like, but if I was
| doing this as a job, and it was good at transcription,
| I'd definitely be using that all the time.
| jampekka wrote:
| > An experienced engineer can look at a waveform in a DAW
| and identify specific consonants, vowels, specific words,
| etc quite fluently.
|
| DAWs' rendered waveforms have so little information that
| such identification is likely impossible even in theory.
| Telling apart plosives and vowels maybe, but not much more
| than that.
|
| I work with phoneticians and they can (sometimes) read even
| words from suitably scaled spectrograms, but that's a lot
| more information than in waveforms.
| duped wrote:
| > the key question is how to convert an inherently continuous
| signal (speech/audio) into a discrete set of tokens
|
| Did Claude Shannon not answer this question in 1948? You need
| at least 1 bit per 6dB of dynamic range for each symbol and
| 2B symbols per second where B is the bandwidth of the signal.
|
| Compression techniques are all about getting below that
| fundamental limit but it's not like this is an unsolved
| problem. Or is 1kbaud too much for LLMs?
| generuso wrote:
| One of the popular speech-to-text models is Whisper, which
| starts with the conventional spectral analysis of the speech
| signal, and then feeds the data into a Transformer model. It
| works quite well.
|
| https://openai.com/index/whisper/
|
| Such approach dates back to 1940s, when people were trained
| to read the speech from spectrograms. There is a 1947 book
| "Visible Speech" by Potter, Kopp, and Green describing these
| experiments. Here is a more slightly recent 1988 review of
| the subject: "Formalizing Knowledge Used in Spectrogram
| Reading"
|
| https://apps.dtic.mil/sti/tr/pdf/ADA206826.pdf
| bkitano19 wrote:
| Awesome post!
| krackers wrote:
| Indeed, the title undersells it and I'm glad I didn't skip over
| it, the article is basically an information-dense but
| approachable summary of audio generation.
| robviren wrote:
| This has got to be one of the most visually pleasing explanations
| I have seen of these concepts. Congrats!
|
| I attempted some similar VQ-VAE work instead trying to tokenize
| rendered text. I was curious if I could make a visual llm working
| on 10 pt rendered font, but I also tried using PDF sources. The
| basic idea was to do what more advanced diffusion image models
| can do where they generate images of text. Make a specific image
| text diffusion model to do completions. Further I wondered if I
| could embed things like document type and language so you could
| have a latent representation of text more abstracted than current
| dictionary tokenizers. Learned a lot and thought it was all
| beautifully displayed in this post.
| crazygringo wrote:
| This is fascinating.
|
| Obviously working directly with audio is vastly more complex than
| with text.
|
| But it is very exciting to see how part of making LLMs work
| natively with speech, is finding a codec that is maximally
| efficient at encoding speech.
|
| I even have to wonder if, at some point, we ultimately create a
| popular voice codec usable with LLMs based not on the Fourier
| transform or similar, but rather on some kind of set of physical
| parameters describing vocal cord shape, tongue position,
| throat/chest/mouth shape, etc.
|
| I can imagine such a model being arrived at statistically
| (determining the necessary number of parameters), and then almost
| becoming "hard-coded" as a standard since human anatomy doesn't
| change much there, beyond certain ranges.
|
| I think it's called formant speech encoding, and it would be
| interesting if LLMs wind up massively advancing that field. Since
| I think historically it's had to do more with speech synthesis
| than audio compression.
| quinndupont wrote:
| There's a long history of attempts at artificial speech that
| take this approach, recreating mouth parts and vibrating air.
| They are all pretty silly, like this work, which fails to
| understand how writing isn't just a derivative of speech.
| crazygringo wrote:
| > _They are all pretty silly,_
|
| Huh? How?
|
| > _like this work which fails to understand how writing isn't
| just a derivative of speech._
|
| The whole point of the article is that writing isn't just a
| derivative of speech. It's in the introduction.
| duped wrote:
| In speech coding/synthesis this called a "source-filter" model
| (decompose speech production into a sound generator in the
| vocal folds and filter in the vocal tract, and parameterize
| them) and it's actually older than Tukey and Cooley's
| rediscovery of the FFT.
| vvolhejn wrote:
| Author here, thanks for the kind words! I think such a physics-
| based codec is unlikely to happen: in general, machine learning
| is always moving from handcrafted domain-specific assumptions
| to leaving as much as possible to the model. The more
| assumptions you bake in, the smaller the space of sounds you
| can model, so the quality is capped. Basically, modern ML is
| just about putting the right data into transformers.
|
| That being said, having a more constrained model can also lead
| to some really cool stuff. The DDSP paper learns how to control
| a synthesizer to mimic instruments:
| https://arxiv.org/abs/2001.04643
|
| You could probably do something similar for a speech model. The
| result would not sound as good but you could get away with much
| fewer parameters, because much of the modelling work is done by
| the assumptions you put in.
|
| Compare also KokoroTTS, a tiny TTS that's so tiny because it
| uses a handcrafted system to turn text into phonemes, and then
| just synthesizes from those phonemes:
| https://huggingface.co/spaces/hexgrad/Kokoro-TTS
| bob1029 wrote:
| Why not normal audio codecs? How are JPEG and MP3 (i.e.,
| DCT/MDCT) not a reasonable way to go about tokenizing spatial and
| time domain signals for these kinds of models?
|
| Each MP3 frame is entirely self-contained and can completely
| reconstruct a few tens of milliseconds of original audio. It does
| not require other frames to do this. I think this is the most
| important element. At 128kbps CBR, each MP3 frame is ~418 bytes
| and covers ~26 milliseconds of time. This is a reduction of
| 10-11x over the raw PCM waveform. MP3 is also designed to
| eliminate the information that humans don't seem to care about.
|
| I don't know if it's possible to use 400 byte tokens in a
| transformer model, but I would be very compelled to try.
| PaulDavisThe1st wrote:
| The approach in TFA encodes into a 32 dimensional space. I
| suspect this is significantly more dimensions than any psycho-
| acoustic compression algorithm uses. Also, throwing away
| information that our hearing systems can't process very well is
| not particularly useful if your goal is speech (or more
| generally, audio) synthesis from scratch.
| bob1029 wrote:
| > throwing away information that our hearing systems can't
| process very well is not particularly useful if your goal is
| speech (or more generally, audio) synthesis from scratch.
|
| I'm not sure I follow. If there is a set of tokens that the
| average human cannot perceive, why wouldn't we want to
| eliminate them from the search space? Who is the target
| audience for this model?
| CaptainOfCoit wrote:
| Maybe that things outside our audible range could
| impact/influence things inside of our audible range?
| 542354234235 wrote:
| I imagine it would be like if there were Rosetta Stones of
| text, written with a language you could read and a language
| you couldn't. For your purposes, discarding the text you
| can't read would be fine and you wouldn't lose anything.
| But if you were ingesting a bunch into an LLM, the
| additional text would give the LLM more context and help it
| make connections and relate words more accurately, even if
| you never were going to have it output anything in the
| language you don't understand.
|
| The inaudible sounds add context and additional datapoints
| on how the audible sounds are related.
| PaulDavisThe1st wrote:
| Humans that read (at least) Indo-European languages can
| read texts in their native language with all the vowels
| removed. Does that suggest that it would be a good idea to
| remove the vowels from text before using it for training
| text-based LLMs ?
|
| Presumably you want to train on as rich a set of data as
| possible, even if some of that data is redundant or
| irrelevant when it comes to human perception.
| Tangurena2 wrote:
| Generally, the difference between regional dialects is
| almost all in vowels (sample: 0). This is why SOUNDEX [1]
| eliminated vowels.
|
| 0 - https://www.acelinguist.com/2020/01/the-pin-pen-
| merger.html
|
| 1 - https://en.wikipedia.org/wiki/Soundex
| ACCount37 wrote:
| You _can_ try to train an adapter from a raw 400-byte MP3 frame
| to an embedding for a given LLM (4096+ floating point numbers,
| exact precision varies).
|
| But you'd need that information to be _digestible_ for a neural
| network. Otherwise, you 'll have a very hard time getting that
| adapter to work.
|
| As a rule: neural networks love highly redundant data, and hate
| highly compressed data at their inputs. Tokenized text good,
| GZIP compressed bytestream bad. But who knows, really. It's a
| rule of thumb, not a mathematical law. So you could have some
| success getting that MP3-based adapter to work. I've seen
| weirder shit work.
| a-dub wrote:
| if you were able to normalize and quantokenize the distinct
| dct values in a consistent way, it could be an interesting
| approach. so yeah, undo the bit packing but keep the front
| end signal processing and compressed dct representation and
| viola! something quite weird that might actually work. :)
| WithinReason wrote:
| JPEG is a good idea:
|
| _The simple, elegant approach of training convolutional neural
| networks (CNNs) directly from RGB pixels has enjoyed
| overwhelming empirical success. But can more performance be
| squeezed out of networks by using different input
| representations? In this paper we propose and explore a simple
| idea: train CNNs directly on the blockwise discrete cosine
| transform (DCT) coefficients computed and available in the
| middle of the JPEG codec. Intuitively, when processing JPEG
| images using CNNs, it seems unnecessary to decompress a
| blockwise frequency representation to an expanded pixel
| representation, shuffle it from CPU to GPU, and then process it
| with a CNN that will learn something similar to a transform
| back to frequency representation in its first layers. Why not
| skip both steps and feed the frequency domain into the network
| directly? In this paper we modify \libjpeg to produce DCT
| coefficients directly, modify a ResNet-50 network to
| accommodate the differently sized and strided input, and
| evaluate performance on ImageNet. We find networks that are
| both faster and more accurate, as well as networks with about
| the same accuracy but 1.77x faster than ResNet-50._
|
| https://proceedings.neurips.cc/paper_files/paper/2018/file/7...
|
| I suspect mp3 is also a good idea
| cubefox wrote:
| I believe language models usually use 2-byte (16 bit) tokens,
| which corresponds to an embedding dimension of 2^16=65536. With
| 400 bytes per token this would be 2^(400*8), which is an
| extremely large number. Way too large to be practical, I
| assume.
| vvolhejn wrote:
| Author here. There are a few reasons, but the biggest one is
| simply the compression ratio.
|
| The OG neural audio codec SoundStream (whose first author is
| Neil, now at Kyutai) can sound decent at 3kbps, whereas MP3
| typically has around 128kbps, as you say. Interestingly, it was
| originally developed for audio compression for Google Meet, not
| for LLMs. Today's neural codecs have even better compression.
|
| The more modern MP3 alternative is Opus, which can work ok at
| 12kbps, but it's still less efficient than neural audio codecs.
| However, these traditional codecs are a lot less CPU-hungry, so
| they have that going for them.
| espadrine wrote:
| That makes sense.
|
| Why RVQ though, rather than using the raw VAE embedding?
|
| If I compare rvq-without-quantization-v4.png with
| rvq-2-level-v4.png, the quality seems oddly similar, but the
| former takes a 32-sized vector, while the latter takes two
| 32-sized (one-hot) vectors, (2 = number of levels, 32 =
| number of quantization cluster centers). Isn't that more?
| vvolhejn wrote:
| I had a part about this but I took it out: for compression,
| you could keep the embeddings unquantized and it would
| still compress quite well, depending on the embedding
| dimension and the number of quantization levels.
|
| But categorical distributions are better for modelling.
| It's a little difficult to explain here without using
| diagrams. The intuition is that if you try to have a model
| predict the next _embedding_ and not the next _token_ , you
| can't model multimodal distributions - you'll end up
| predicting the mean of the possible continuations and not
| the mode, which is not what you want.
|
| Check out Section 5.3 and Figure 6 from PixelRNN, where
| they discuss this phenomenon:
| https://arxiv.org/pdf/1601.06759
|
| At the bottom of the blog, I link two articles that do make
| continuous embeddings work. One of them is the Kyutai paper
| Continuous Audio Language Models:
| https://arxiv.org/abs/2509.06926
| HarHarVeryFunny wrote:
| Human audio perception is based on detecting the frequency
| components, which we detect via what amounts to a filter bank
| in the inner ear (different length hairs with different
| resonant frequencies).
|
| Speech perception builds upon frequencies and is based on
| "formants" - the frequency bands that are attentuated via the
| vocal tract resonances created by articulation when the speech
| was generated. More specifically, most speech information is
| contained in formant _changes_ since these correspond to
| articulatory changes. There are also other articulatory
| artifacts in speech such as the onsets of speech energy
| corresponding to plosives ( "puh", "buh"), and the high
| frequencies generated by fricatives like "sss".
|
| One problem with embedding MP3 frames as audio tokens would be
| that although MP3 compression is based on frequency
| representation, you've then got quantization, huffman encoding
| and the MP3 frame structure all on top of that, so the frame as
| a whole is going to be more of a black box. Presumably a
| transformer could still use MP3 frames to predict the text
| transcription, or any arbitrary encoding of speech audio for
| that matter (similar to how an LLM can predict text from Base64
| representation, or vice versa), but it's certainly not making
| it easier if the input is obfuscating the frequency components
| and formants etc that correspond to the generating process.
|
| Not having direct access to the frequency/formant information
| is also going to make generalization more difficult since that
| is based around formant structure and changes. When
| articulating the same word, the specific formant frequencies
| will differ between individuals, primarily based on vocal tract
| length, but humans have no problem generalizing across these
| and understanding speech from different individuals. I'm not
| sure if an LLM only trained to predict MP3 speech from, say,
| male adults, would necessarily have generalized enough to also
| be able to recognize child speech or that from a speech
| synthesizer.
| miki123211 wrote:
| > Try asking any of them "Am I speaking in a low voice or a high
| voice?" in a high-pitched voice, and they won't be able to tell
| you.
|
| I wonder how much of that is LLMs being bad, and how much is LLMs
| being (over) aligned not to do it.
|
| AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on
| it to prevent music generation, accent matching (if you sound
| Indian, it shouldn't also sound Indian), and assuming ethnicity /
| biasing based on accents.
|
| It doesn't seem that impossible to me that some of these
| behaviors have been aligned out of these models out of an
| abundance of caution.
| sbrother wrote:
| I don't think it's just safeguards; they really don't seem to
| understand pitch at all. I tried asking ChatGPT's advanced
| voice mode to recognize a tune I was humming, and it insisted
| it was Beethoven's 5th -- multiple times. I think it must have
| basically tokenized my humming to "dun dun dun duuun".
| bigzyg33k wrote:
| advanced voice mode operates on audio tokens directly, it
| doesn't transcribe them into "text tokens" as an intermediate
| step like the original version of voice mode did.
| cubefox wrote:
| But they behave just like models which use text tokens
| internally, which is also pointed out at the end of the
| above article.
| bigzyg33k wrote:
| we don't know if that's due to inherent limitations of
| the tokenisation of audio, or a byproduct of
| reinforcement learning. In my own usage, I noticed a
| significant degradation in capabilities over time from
| when they initially released advanced voice mode. The
| model used to be able to sing, whisper, imitate sounds
| and tone just fine, but I imagine this was not intended
| and has subsequently been stunted via reinforcement
| learning.
|
| I don't find the articles argument that this is due to
| tokenisation convincing.
| cubefox wrote:
| They didn't say it's due to tokenization.
|
| > This is likely because they're trained on a lot of data
| generated synthetically with text-to-speech and/or
| because understanding the tone of the voice (apparently)
| doesn't help the models make more accurate predictions.
| sbrother wrote:
| right, but either whatever audio tokenization it's doing
| doesn't seem to encode pitch, or there's ~nothing where
| pitch is relevant in the training set.
| oezi wrote:
| Absolutely correct! My simple test is if it can tell
| American and British English Tomato and Potato apart. So
| far it can't.
| fragmede wrote:
| Which "it" are you referring to? There are models that
| can.
| idonotknowwhy wrote:
| Qwen3 omni transcriber can do this. It can describe the voice,
| emotion very well
| 85392_school wrote:
| I've also had luck with Gemini. If I made a few noises and
| asked which one was higher pitched, it could easily tell.
| tsol wrote:
| Did they respond differently depending on what race they
| thought you were? I'm surprised they would even do that
| honestly. I thought they were trained on text conversations
| which presumably wouldn't have any of that to learn from.
| OisinMoran wrote:
| You can often tell where someone is from from text alone!
| There are plenty of idiosyncrasies even in how different
| English speaking countries use the language.
| anotherhue wrote:
| Ah stop
| fragmede wrote:
| Like, what do you mean? Are there, like, particular
| mannerisms that people from some regions that are hella
| unique to those regions?
| robotresearcher wrote:
| I say old chap, what colour are your mummy's wellies?
| ctxc wrote:
| Clever!
| ElevenLathe wrote:
| You betcha!
| thwarted wrote:
| If it did, it responded based on the accent it picked up on
| not race, because race and accent are orthogonal, correlation
| does not imply causation.
| dotancohen wrote:
| Are denying that race and accent are highly correlated?
| j45 wrote:
| There are subtle differences in language where two groups can
| be speaking English and one is having a completely different
| conversation without saying much.
| dotancohen wrote:
| This is quite the reason my wife evolved into my ex-wife.
| vvolhejn wrote:
| Author here. I think it's more of a capability issue than a
| safety issue. Since learning audio is still harder than
| learning text, audio models don't generalize as well. To fix
| that, audio models rely on combining information from text and
| audio (having a single model that consumes/produces both text
| and audio tokens) and the audio tokens basically end up being
| an integrated speech-to-text/text-to-speech. This reflects my
| colleagues' experience working on Moshi, and it seems to be the
| case for other models too, see the Conclusion section.
|
| Part of the reason can also be synthetic data: if you fine-tune
| on data generated from text via a text-to-speech, the tone of
| the voice doesn't have any information, so the model learns to
| ignore it.
| j45 wrote:
| Accent detection or consciously ignoring it is a filter step.
| JoshTriplett wrote:
| Audio models for speech not understanding pitch, seems
| similar to how text LLMs often don't understand spelling:
| it's not what they were trying to recognize.
| oezi wrote:
| > generated from text via a text-to-speech
|
| Yes, frustratingly we don't have good speech-to-text
| (STT/ASR) to transcribe such differences.
|
| I recently finetuned a TTS* to be able to emit laughter and
| hunting for transcriptions which include non-verbal sounds
| was the hardest part of it. Whisper and other popular
| transcription systems will ignore sigh, sniff, laugh, etc and
| can't detect mispronounciations etc.
|
| * = https://github.com/coezbek/PlayDiffusion
| jasonjayr wrote:
| IIRC -- the 15.ai dev was training on fan-made "My Little
| Pony" transcriptions, specificaly because they included
| more emotive clues in the transcription, and supported a
| syntax to control the emotive aspect of the speech.
| dotancohen wrote:
| Where can I read about this?
| smusamashah wrote:
| There was an example, of ChatGPT copying and responding in
| the speakers voice mid conversation, on OpenAI blog. This was
| presented an example on non-alignment.
| wordglyph wrote:
| I used aistudio and it understood pitch and and even emotion
| with an uploaded mp3
| bongodongobob wrote:
| Hmm, the last time I played with GPT voice mode it was able to
| do all kinds of different accents.
| quinndupont wrote:
| Y'all need to learn about the history and development of spoken
| language and writing. Writing isn't just a copy or derivation of
| writing. LLMs work because of the conceptual characteristics of
| writing (consider the distinctions between ideographic,
| logographic, alphabetical...). What a sloppy mess!
|
| Read some Wittgenstein and Goodman, but especially Derrida who
| calls this logocentrism.
| daxfohl wrote:
| Another interesting thing here is that the model presumably has
| some understanding of the passage of time. That's one thing that
| can be odd about chat models, in that they will respond the same
| no matter whether you respond a second later or a month later.
|
| I think even for text models, "streams" could be useful. Perhaps
| if the LLM sees too long of a pause after explaining something
| and asking a question, they could interject a "do you need help?"
| or something. Pure chat GPTs don't have that ability.
| daxfohl wrote:
| I wonder if a linear-space, constant-time model like RWKV or S4
| would work better here. For audio, I wouldn't think you'd need
| long range context, and all-to-all mapping seems like overkill.
|
| Maybe a transformer could be running in parallel, but much lower
| frequency, where the linear model feeds it "summary" tokens once
| per second, whose information would mostly be "text", but also
| some hint of emotion and other cues. Then the output of this
| could be fed back to the linear model so that it would know what
| it was saying and with what emotion. Basically the transformer
| would be the low frequency long range context thinker (and
| feeler), and the linear model would translate that to and from
| phonetics.
|
| They'd be trained in parallel, so those transformer tokens would
| attain meaning at training time, not something that would have to
| be pre-defined. So it'd still be purely phonetic e2e, no direct
| translation to text. It could even end up being a good way to
| compress text for LLMs, since low-value words might have smaller
| representation in the token.
|
| Probably would never reach the level of text based LLMs for logic
| and code and such, but that somewhat parallels humans anyway;
| it's pretty hard to explain an algorithm in detail in plain
| conversation.
| tehnub wrote:
| Write this paper please!
| daxfohl wrote:
| If anyone wants to buy me some GPU time I'd be happy to try
| it out! Fair warning: my only experience in deep learning
| thus far was training a CNN to count dots on an image, which
| worked semi reliably up to 8, when the image was perfectly
| square black "dots" on a perfectly white background.
| smokel wrote:
| Off-topic, but it would be great if everyone who voiced
| their opinion on something would add a small disclaimer
| with their actual knowledge about the subject. Thanks for
| sharing :)
| fragmede wrote:
| Sure. what's your venmo?
| vvolhejn wrote:
| I don't know about linear models, but this kind of hierarchical
| modelling is quite a common idea in speech research. For
| example, OpenAI's Jukebox (2020) [1], which uses a proto-neural
| audio codec, has three levels of encoding that get coarser and
| coarser. They use a language model to predict continuations in
| the coarsest level and then have models to upscale to the finer
| levels and finally back to audio.
|
| The recent MiMo-audio bunches tokens into "patches" of four
| timesteps and has the model predict those. [2]
|
| [1] https://arxiv.org/abs/2005.00341
|
| [2] https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-
| Audi...
| lxe wrote:
| I've been messing around with Higgs Audio that actually uses the
| delay pattern. It has to apply it and then unapply it after the
| generation. I noticed it's actually really hard to chunk and
| stream audio correctly when you need to apply and reapply these
| patterns essentially to the "entire" output.
| mmaunder wrote:
| Thanks for posting, I wasn't aware of Kyutai and it seems your
| work is perfect for something I'm working on.
| croemer wrote:
| Typo: "not even a the length of one word"
| vvolhejn wrote:
| merci, will fix tomorrow
| Rickasaurus wrote:
| I wouldn't mind so much if they cheat on the way back but listen
| in earnest. There are use cases like teaching language where
| having the AI understand the sounds carefully matters a ton.
___________________________________________________________________
(page generated 2025-10-21 23:00 UTC)